Saturday, March 16, 2013

Twin Sons of Different Mothers

I watched the webcast of the Boeing presentation on their comprehensive solution to the 787 battery system.  People may have different opinions on Boeing’s presentation, but I was particularly impressed with the presentation of Mike Sinnett, Vice President and Chief Project Engineer.  He articulated an approach to an engineering issue that is the same as the method used by effective pilots.   Their approach has been validated by the FAA, the only people outside Boeing with all the available information.  The NTSB is continuing their investigation.

The 787 Dreamliner battery saga is providing many lessons for the air transportation and aerospace engineering industries.  I am not an engineer by training, but I have tremendous respect for the profession. The ability to apply science and manufacturing to an idea and produce a sophisticated product is very impressive.  Their goal is to produce a product that reliably performs to the design criteria.  In the case of the Dreamliner, as well as their other aircraft, the primary design criteria espoused by Boeing is safety.
My education and my training is in the operation of sophisticated machinery, specifically airplanes, in a manner that accomplishes the desired objective(s).  The primary objective is safety for me as well. 

Engineers and pilots have much the same approach to their disciplines.  We both have challenges to deal with; risk, causal factors, unexpected outcomes, testing of assumptions as well as unknowns.  Engineers, however, are afforded a little more of one critical resource than pilots, time.  When a problem occurs, a pilot is always faced with a fixed amount of time to deal with it.  The time can be a short as seconds to as long as hours, but there is always a limit.  Engineers have some time limits as well.  But, for engineers running out of their “fuel”, money, allows them weeks or months instead of minutes and hours.

Pilots and engineers do have the same approach to the obstacles to their desired objective, safety.  It is awareness, avoidance and resolution.  What are issues that might be anticipated?  Can these issues be avoided?  If an issue cannot be avoided, or my avoidance strategy is ineffective, will the outcome still be acceptable? 

When the engineer is asked if he is certain his part will not fail, the answer is no.  When the pilot is asked if he is certain he will be able to land at the scheduled destination, the answer is also no.  That is the case every day throughout the air transportation industry.  With so much uncertainty, how can air transportation have such a remarkable safety record? 

Both the engineer and the pilot have the same mindset.  They ask  “Can something go wrong?”, “Can it be prevented?” and if prevention is ineffective, “Can it be dealt with to a safe conclusion?”.  Successful engineers build redundancy into their systems.  Successful pilots build redundancy into their decision making.  This multifaceted and balanced approach is essential in the dynamic environment that is aviation.  

Failures, mechanical and human, are a part of air transportation.  They are neither desired nor acceptable, but are a reality.  The existence of failure or error is part of any mechanical or human system.  An engineer that believes he can build a part that will never fail is as naive as the pilot that believes he can avoid all errors.  The resolution process of failures or errors indicates a great deal about the effectiveness of an engineer or pilot.  Is it recognized in a timely manner?  Is it trapped before there are negative consequences?  Are the circumstances of the error or failure fully considered for possible future countermeasures?  Are opportunities explored for additional levels of redundancy?

I am sure the aerospace engineering community has a name for their process to achieve a safe aircraft.  For pilots, our process is know as Threat and Error Management.  Identifying and managing (avoiding) threats and errors before they result in a negative impact to safe operations is how we both do it.

Sunday, March 3, 2013

What is Acceptable Risk?

"There are no new types of air crashes - only people with short memories.  Every accident has it's own forerunners, and every one happens either because somebody did not know where to draw the vital dividing line between the unforeseen and the unforeseeable or because well-meaning people deemed the risk acceptable." "The Final Call" (Why Airline Disasters Continue to Happen)

I have referenced this 1990 Stephan Barlay quote before and it has never been more applicable than it is today.  Not just because of 787 battery problems, but because it is the essence of safety, whether aviation related or not.

Recently Boeing met with the FAA to give its proposal for addressing the battery issues associated with the 787 Dreamliner.  The terms unforeseen and unforeseeable may not have been used in those meeting, but I am confident those concepts were discussed at length.  Trying to sort out what we know, what we don’t know and what we don’t know we don’t know can be a daunting task.  However, all the risks associated with what we know and what we don’t know become academic if we don’t fly.  That, of course, was the rationale the FAA used to issue an order grounding the 787.  So now like every other airliner, if the 787 ever flies again, there will be actual risks on every flight.  I don’t believe any manufacturer or operator has ever intentionally placed an airplane and its crew, passengers and cargo in a situation where they determined the risk to be unacceptable. The question that must be asked is simple yet complex.  It is the question that all Captains must always answer in their decision making process.   Under what circumstances are the risks acceptable or unacceptable?

The idea of an acceptable risk is confusing.  I have jokingly referred to it as an “aviation oxymoron”.  Acceptable risk is one of aviation’s least understood concepts.  Most pilots have a very difficult time articulating exactly what it is.  It is not difficult for most pilots to point out things that are clearly risky.  Conditions that portend, “bad things could happen” are usually considered risky.  For example, a worn out tire, severe icing, microburst wind shear, inadequate fuel supply are all examples of things that would commonly be termed “risky”.  In fact everyone would probably agree these things would be not just risky, but unacceptably risky.

Conversely, what about things like engine failure on takeoff, contaminated runways with diminished braking action, inoperative communications or navigation equipment?  These are all considered risky scenarios, but we fly safely with them every day.  What is the difference? These risks have been deemed acceptable by the FAA, by the manufacturer, by the operator and by the pilots otherwise they would not be routine.  If some risks in aviation were not acceptable we would never fly.

Therefore, what is an acceptable risk?  Is an acceptable risk just subjective opinion or are there ways we can more objectively determine acceptability.  I believe there is a process for evaluating whether a risk is acceptable or unacceptable.

To begin this discussion we must first talk about risk.  The risky or threatening event has two components, probability and severity.  What is the chance the event will occur?  What is the severity of the outcome if the event does occur?  Is the risk acceptable if the probability is low (not zero)?  If we use probability to determine acceptable risk then we are making safety a game of chance.  Do you fell lucky?  If we use severity to determine acceptable risk then anything like engine failure during takeoff is a deal breaker.  In that scenario, only when we develop an engine that cannot fail will we fly.  That is not realistic either.  How can we effectively address the probability / severity issue?  The answer is rather simple.

Let’s take the probability issue first.  Since most people, especially airline customers are not comfortable with luck as a safety management system we must assume the probability of the event is 1.0.  Just simply saying an engine only fails once in 10 to the 15th flights is not enough.  With those odds, the same person who buys a power ball ticket will never get on an airplane.  In practice, however, the probability must be very small but does not need to be zero.  If we knew there was a high probability of certain numbers being picked the lotto would go broke.  Similarly, if we knew there was a high probability of an engine failing on takeoff, the manufacturer and airline would go broke.  Who would fly on that airplane or airline?  The probability question is actually not if, but when.

That brings us to severity.  This is where, I’m sorry, the rubber meets the road. 
This is really where risk becomes acceptable or unacceptable.  With every risk there is a corresponding set of associated threats.  Using the engine failure on takeoff example, some of the threats would be; climb performance, stopping distance, controllability, pilot training, terrain to name a few.  The severity question becomes a function of the ability to manage the threats associated with an engine failure (the risk).   If all the associated threats can be effectively managed to a safe outcome, the risk becomes acceptable.  It is the difference between, “We hope this doesn’t happen.” And, “If this happens we will be safe”.  The FAA has a statement in pilot certification standards that is applicable here, “The outcome of the maneuver must never be in doubt.”  If there is doubt in the outcome, the risk is unacceptable. 

This is why the safety management strategy known as Threat and Error Management has brought about a new paradigm in risk assessment or decision-making.  Risk assessment is no longer a just a graph where one axis is probability and the other severity with some green areas under the curve labeled low (acceptable?).  With probability and severity as risk criteria, only the origin (0,0) would be truly acceptable. In an effective safety management system the ability to mitigate threats defines the “acceptable risk” area, not probability/severity.  Threat management creates a clear distinction between acceptable and unacceptable risk in contrast to the spectrum defined by probability and severity.  The good news is that there is a rather large area under the threshold that is acceptable. The threat management definition of risk assessment allows for a safe operation in a very dynamic environment. Another difference between those two risk assessment models is Threat and Error Management is a cultural mindset and a skill, not fixed value like probability or severity.  Threat and Error Management can be trained and imbedded in an operation whereas probability / severity are fixed parameters.

What is an acceptable risk?   It is a risk where the set of associated threats can be effectively managed to a safe outcome.  A threat management strategy is simply a process of identifying and preparing for things that potentially make the environment more complex or reduce safety margins.   Therefore, an unacceptable risk is where the set of associated threats cannot be effectively managed.  Why is microburst windshear an unacceptable risk?  We cannot guarantee that the associated threats can be effectively managed if we fly through it.  Therefore, we avoid it.  But what if we get into windshear inadvertently?  We must have a strategy for that as well.

Finally, the ubiquitous potential for human error must be embraced and always considered a threat.  Error management is inexorably linked to threat management and its strategic principles. Threat management strategies in aviation are developed around two basic components, the system (hardware) and the individual (human).  The system component includes those inanimate factors (e.g. checklists, ground proximity warning systems) engineered into the operation that exist whether or not the crew is present or chooses to use them.  The individual component includes the effective human behaviors that are derived from training and cognitive thought (e.g. communication, situation awareness, decision making).  Both components of threat management, the system and the individual, are tools that must be understood and effectively employed if a risk is to be deemed acceptable.