Introduction

System safety is a critical aspect of engineering that aims to prevent accidents, injuries, and damages caused by complex systems. The term “system” refers to any combination of people, equipment, software, and environment that interact to perform a specific function. The objective of system safety is to identify, evaluate, and control the risks associated with the system, ensuring that it operates safely and reliably under all conditions.

One of the most commonly used techniques for system safety is fault tree analysis. Fault tree analysis is a deductive method that helps engineers and safety experts to identify the potential causes of system failures and accidents. By using a graphical representation of the system, fault tree analysis can identify the combinations of events and conditions that can lead to a specific failure mode. The outcome of the analysis is a fault tree diagram, which shows the logical relationships between the events and the ultimate failure mode.

This article aims to provide an overview of fault tree analysis as a tool for system safety. It will begin by explaining the basics of fault tree analysis, including its components, process, and event probabilities. It will then discuss the applications of fault tree analysis in system safety, including identifying critical system components, analyzing system failures and accidents, evaluating system safety performance, and improving system design and reliability.

The article will also examine the challenges and limitations of fault tree analysis, including assumptions and uncertainties, complexity and data requirements, cost and time constraints, and human factors and cognitive biases. Finally, the article will provide case studies and examples of fault tree analysis in different industries, best practices, and recommendations for conducting fault tree analysis, and a conclusion on the importance of fault tree analysis for system safety.

Background

Fault tree analysis (FTA) is a method of analyzing systems and their potential failure modes. It is a top-down, deductive approach that starts with an undesired event and then works backward to identify the causes or factors that could lead to that event. FTA involves breaking down the system into its component parts and then using a logical diagram to represent the potential pathways that could lead to the undesired event.

Fault tree analysis was initially developed by H.A. Watson, a reliability engineer at the Boeing Corporation, in the early 1960s. Watson was tasked with improving the safety of the Boeing 747 aircraft, which was under development at the time, and he recognized the need for a systematic approach to analyzing the causes of complex system failures.

Watson developed the basic concepts of fault tree analysis, which involved breaking down the complex system into its components and analyzing the potential causes of a top event or undesired outcome. Watson’s work was later expanded upon by other engineers at Boeing, including B.W. Johnson and D.C. Hendrickson, who developed methods for quantifying the probability and consequences of individual events in the fault tree.

The method was later adopted by NASA in the 1970s for use in its space programs, particularly in the analysis of the causes of the Apollo 13 mission failure. The fault tree analysis was instrumental in helping NASA engineers identify the causes of the explosion on board the spacecraft and develop corrective measures to prevent similar incidents from occurring in the future.

Since then, fault tree analysis has been widely used in many different industries, including nuclear power, chemical processing, transportation, and healthcare. The method has been continually refined and improved over the years, with many variations and extensions developed to suit specific applications and industries. Today, fault tree analysis remains a powerful tool for identifying and mitigating the risks associated with complex systems.

The fault tree analysis process begins with the selection of an undesired event or failure mode to be analyzed. The system is then decomposed into its component parts, and the potential causes of the undesired event are identified. These causes are represented in a fault tree diagram, which shows the logical relationships between the events and conditions that could lead to the undesired event. The fault tree can be quantified by assigning probabilities or failure rates to each event, allowing engineers to assess the likelihood of the undesired event occurring and to identify critical components of the system that require further attention.

FTA is often used in conjunction with other safety analysis techniques, such as hazard analysis, failure mode and effects analysis (FMEA), and reliability analysis, to provide a comprehensive understanding of the system and its potential failure modes.

The Basics of Fault Tree Analysis

Components of a fault tree

A fault tree consists of basic events, logical gates, intermediate events, and the top event. Basic events are the lowest level of events in the tree and are considered to be directly observable or measurable. Intermediate events are events that are not directly observable and are defined in terms of the basic events. The top event is the undesired event or failure mode that is being analyzed.

Logical gates in FTA

There are two main types of gates used in fault tree analysis: the AND gate and the OR gate. These gates are used to combine the basic events and intermediate events in the fault tree, and to determine the overall probability of the top event.

The AND gate represents a logical AND operation, meaning that the top event can only occur if all of the events connected to the gate occur. For example, in a fault tree for an aircraft engine failure, an AND gate could be used to represent the requirement for both a fuel system problem and a mechanical failure to occur in order for the engine to fail. The probability of the top event is calculated by multiplying the probabilities of the events connected to the AND gate.

The OR gate represents a logical OR operation, meaning that the top event can occur if any one of the events connected to the gate occurs. For example, in the same fault tree for an aircraft engine failure, an OR gate could be used to represent the possibility of the engine failing due to either a fuel system problem or a mechanical failure. The probability of the top event is calculated by adding the probabilities of the events connected to the OR gate and subtracting the probability of both events occurring together, which is represented by the product of their probabilities.

In addition to these basic gates, there are also more complex gates that can be used in fault tree analysis, such as the priority AND gate, the voting OR gate, the Inhibit gate, and the Transfer gate. These gates are used to represent more complex system behaviors, such as redundant systems and voting systems. The priority AND gate requires a certain number of events to occur in a specific order, while the voting OR gate requires a certain number of events to occur in any order. The Inhibit gate represents the effect of a component or event preventing the occurrence of another event, and the Transfer gate is used to transfer the probability of an event from one part of the fault tree to another. These gates are particularly useful in complex safety-critical systems where redundancy, fault tolerance, ad modularity are essential.

The fault tree analysis process

The fault tree analysis process involves several steps. The first step is to clearly define the top event and the system that is being analyzed. The next step is to identify the basic events and intermediate events that could lead to the top event. These events are then arranged in a logical diagram to represent the potential pathways that could lead to the top event. The fault tree can then be analyzed quantitatively, by assigning probabilities or failure rates to each event, or qualitatively, by identifying the critical events that are most likely to lead to the top event.

Categories of events in FTA

There are three categories of events in fault trees: initiating events, primary events, and secondary events. Initiating events are external events that trigger the fault tree analysis, such as equipment failures or human errors. Primary events are events that are directly related to the top event and are considered to be the most important events in the tree. Secondary events are events that are not directly related to the top event but can still contribute to its occurrence.

Event probabilities and failure rates

Event probabilities and failure rates are used to quantify the likelihood of an event occurring in the fault tree. Event probabilities are used in qualitative analysis to identify critical events that are most likely to lead to the top event. Failure rates are used in quantitative analysis to calculate the overall likelihood of the top event occurring, based on the probabilities of the events in the fault tree.

Applications of Fault Tree Analysis in System Safety

Identifying critical system components

In order to identify the critical components of a system using fault tree analysis, engineers and safety experts must first define the top event or undesired outcome they wish to avoid. This could be a system failure, an accident, or any other event that poses a risk to safety or reliability.

Once the top event has been defined, the fault tree is constructed by analyzing the potential causes and contributing factors that could lead to the top event. These factors are represented in the fault tree as basic events, intermediate events, and the top event itself.

The fault tree can then be analyzed to identify the most critical components of the system that are most likely to contribute to the top event. This is done by assessing the probability and consequence of each event in the tree, and determining the impact that the event would have on the overall system if it were to occur.

By identifying the critical components of the system, engineers and safety experts can focus their efforts on improving the reliability and safety of those components. This could involve implementing redundancy measures, improving the quality of the components, or redesigning the system to eliminate or reduce the risk of failure.

For example, fault tree analysis could be used to identify the critical components of an aircraft engine that are most likely to cause a failure. By analyzing the potential causes of an engine failure, such as fuel system problems, mechanical failures, or pilot error, engineers can identify the weak points in the system and take steps to improve their reliability and safety. This could involve implementing redundancy measures, such as multiple fuel pumps or engine control systems, or improving the quality of the components used in the engine.

Analyzing system failures and accidents

Fault tree analysis can be used to investigate system failures and accidents, by analyzing the events and conditions that led to the undesired event. This can help engineers and safety experts to identify the root causes of the failure or accident, and to take corrective action to prevent similar incidents from occurring in the future.

Evaluating system safety performance

Fault tree analysis can be used to evaluate the safety performance of a system, by assessing the likelihood and consequences of different failure modes. This can help engineers and safety experts to identify areas for improvement, and to optimize the system design for safety and reliability.

Improving system design and reliability

Fault tree analysis can be used to inform the design of new systems or to improve the reliability of existing systems. By identifying the critical components and failure modes, engineers can take steps to improve the design of the system, such as adding redundancy or improving the quality of the components.

Other applications

Fault tree analysis has many other applications in system safety, including risk assessment, hazard identification, safety certification, and regulatory compliance. Fault tree analysis can also be used in conjunction with other safety analysis techniques, such as failure mode and effects analysis (FMEA), hazard and operability (HAZOP) analysis, event tree analysis (ETA), or Systems Theoretic Process Analysis (STPA) to provide a comprehensive understanding of the system and its potential failure modes.

Advantages and Limitations of Fault Tree Analysis

Fault tree analysis offers several advantages for identifying and mitigating the risks associated with complex systems. Some of the main advantages of using fault tree analysis include:

Systematic approach: Fault tree analysis provides a systematic and structured approach to identifying the potential causes of system failures or accidents. This allows engineers and safety experts to identify the weak points in the system and take steps to improve their reliability and safety.

Quantitative analysis: Fault tree analysis enables engineers to quantitatively assess the probability and consequences of system failures or accidents. This helps in making informed decisions about system design, maintenance, and risk mitigation strategies.

Clear visualization: Fault tree analysis provides a clear and concise visualization of the potential causes of system failures or accidents. This allows engineers and safety experts to communicate complex ideas and analysis to stakeholders in a clear and easily understandable manner.

Despite its advantages, fault tree analysis also has some limitations that should be taken into consideration. Some of the main limitations of using fault tree analysis include:

Limited scope: Fault tree analysis is limited to the scope of the system being analyzed. It may not account for external factors or events that could impact the system’s safety or reliability.
- Mitigation: Expand the scope. To address the limitation of limited scope, engineers can expand the scope of the analysis to include external factors or events that could impact the system’s safety or reliability. This can be done by including additional events or conditions in the fault tree or by using complementary analysis methods, such as event tree analysis or hazard analysis.
Uncertainty: Fault tree analysis relies on probability estimates for the events and components in the fault tree. These estimates may be uncertain and subject to variation based on the available data and assumptions made during the analysis.
- Mitigation: Reduce uncertainty. To address the limitation of uncertainty, engineers can gather more data and refine their assumptions about the events and components in the fault tree. This can help to improve the accuracy and reliability of the probability estimates used in the analysis. Sensitivity analysis can also be performed to identify the most critical factors contributing to the uncertainty.
Resource-intensive: Fault tree analysis can be resource-intensive, requiring significant time, effort, and expertise to perform effectively. This can make it challenging to apply in some situations where resources may be limited.
- Mitigation: Streamline the process. To address the limitation of being resource-intensive, engineers can streamline the fault tree analysis process by using software tools that automate some of the tasks involved, such as generating the fault tree structure or performing the probability calculations. Training and development of specialized personnel can also help to reduce the time and effort required to perform the analysis.

Fault tree analysis is a powerful tool for identifying and mitigating the risks associated with complex systems. While it has some limitations, its benefits outweigh its drawbacks, making it an essential tool in many industries, including aerospace, nuclear power, and transportation.

Applications of Fault Tree Analysis by Industry

Fault tree analysis has been widely used in many different industries to improve the safety and reliability of complex systems. Some of the main applications of fault tree analysis include:

Aerospace: Fault tree analysis is widely used in the aerospace industry to analyze the safety and reliability of aircraft systems. By identifying the potential causes of system failures or accidents, engineers can take steps to improve the design, maintenance, and operation of these systems to minimize risks.
- Hixenbaugh AF. Fault tree for safety. BOEING CO SEATTLE WA SUPPORT SYSTEMS ENGINEERING; 1968 Nov 8.

Nuclear power: Fault tree analysis is also commonly used in the nuclear power industry to analyze the safety and reliability of nuclear power plants. By identifying the potential causes of accidents or failures, engineers can design safety features and backup systems to prevent or mitigate the impact of such events.
- Kang HG, Kim MC, Lee SJ, Lee HJ, Eom HS, Choi JG, Jang SC. An overview of risk quantification issues for digitalized nuclear power plants using a static fault tree. Nuclear Engineering and Technology. 2009;41(6):849-58.

Chemical processing: Fault tree analysis is used in the chemical processing industry to identify and mitigate the risks associated with hazardous chemicals and processes. By analyzing the potential causes of accidents or failures, engineers can implement safety measures and emergency response plans to protect workers and the environment.
- Khakzad N, Khan F, Amyotte P. Safety analysis in process facilities: Comparison of fault tree and Bayesian network approaches. Reliability Engineering & System Safety. 2011 Aug 1;96(8):925-32.

Healthcare: Fault tree analysis is increasingly being used in the healthcare industry to analyze the risks associated with medical devices and procedures. By identifying the potential causes of adverse events, healthcare professionals can take steps to improve patient safety and prevent medical errors.
- Abecassis ZA, McElroy LM, Patel RM, Khorzad R, Carroll IV C, Mehrotra S. Applying fault tree analysis to the prevention of wrong-site surgery. Journal of Surgical Research. 2015 Jan 1;193(1):88-94.

Transportation: Fault tree analysis is used in the transportation industry to analyze the safety and reliability of complex transportation systems, such as railroads, highways, and airports. By identifying the potential causes of accidents or failures, engineers can design safety features and emergency response plans to protect passengers and the public.
- Lambert HE. Use of fault tree analysis for automotive reliability and safety analysis. SAE transactions. 2004 Jan 1:690-6.

In addition to these applications, fault tree analysis is also used in many other industries and contexts where safety and reliability are critical, such as the military, mining, and oil and gas. Fault tree analysis is a versatile tool that can be applied to many different systems and situations, making it an essential tool for engineers and safety experts around the world.

FTA is used in transportation, aerospace, nuclear, and many other industries

Best practices and recommendations for FTA

While fault tree analysis can be a powerful tool for analyzing the safety and reliability of complex systems, it is important to follow best practices and recommendations to ensure its effectiveness. The following are some recommended practices for fault tree analysis:

A. Preparing for fault tree analysis

Before conducting fault tree analysis, it is important to define the system boundary and scope, as well as the specific failure modes and top events of interest. This will help to ensure that the fault tree analysis is focused and effective.

B. Selecting appropriate fault tree software tools

There are many software tools available for conducting fault tree analysis, each with its own strengths and weaknesses. It is important to carefully consider the requirements and constraints of the analysis and select a software tool that is appropriate for the task.

C. Involving subject matter experts

Fault tree analysis requires a deep understanding of the system being analyzed and its components. Involving subject matter experts in the analysis can provide valuable insights into the system and its potential failure modes.

D. Defining probability estimates

Fault tree analysis relies on probability estimates for the events and components in the tree. It is important to carefully define these probabilities, taking into account available data and expert judgment, and to document the sources of these estimates.

E. Using conservative estimates

In cases where data is limited or uncertain, it may be appropriate to use conservative estimates for the probability of events and components. This can help to ensure that the analysis is conservative and provides a realistic assessment of the risks associated with the system.

F. Conducting sensitivity analysis

Sensitivity analysis involves varying the input parameters of the fault tree model to assess their impact on the results. This can help to identify the most important contributors to system failure and guide decision-making around risk mitigation strategies.

G. Considering human factors

Human factors, such as operator error or fatigue, can play a significant role in system failures. It is important to consider these factors in fault tree analysis and to include appropriate mitigation strategies in the analysis.

H. Considering common cause

When conducting fault tree analysis, it is important to consider the potential for common cause failures, which occur when multiple components or events share a common cause. This can lead to a greater likelihood of failure than predicted by traditional fault tree analysis methods. To address this, it may be necessary to use specialized techniques or software tools that account for common cause failures.

I. Iterative analysis and refinement

Fault tree analysis is an iterative process that may require multiple rounds of analysis and refinement to achieve an accurate and effective model. It is important to be prepared to revise the analysis as new information becomes available or to refine the model based on feedback from subject matter experts.

J. Incorporating feedback from stakeholders

Feedback from stakeholders, including management, operators, and maintenance personnel, can provide valuable insights into the system being analyzed and the potential failure modes. It is important to incorporate this feedback into the fault tree analysis to ensure that it accurately reflects the system and its operation.

K. Validating and verifying the fault tree model

It is important to validate and verify the fault tree model to ensure that it accurately reflects the system being analyzed. This can involve reviewing the assumptions and data inputs used in the analysis, as well as testing the model against historical data or other sources of information.

L. Communicating and documenting the results

It is important to communicate the results of the fault tree analysis to stakeholders in a clear and concise manner. This can involve creating visualizations or reports that summarize the results and their implications, as well as documenting the assumptions and limitations of the analysis.

M. Taking actions based on the fault tree analysis results

The ultimate goal of fault tree analysis is to identify opportunities for improvement and risk reduction. It is important to take action based on the results of the analysis to implement the recommended changes and improve the safety and reliability of the system. This may involve designing new safety features, modifying existing processes or systems, or providing additional training to personnel. It is important to prioritize actions based on the severity and likelihood of the identified risks and track progress towards implementing the recommended changes.

N. Avoid Common Errors

Some additional common errors that can occur when constructing or analyzing a fault tree:

Missing or incomplete events
Over-complication
Unclear or ambiguous event descriptions
Success logic instead of failure logic
Not considering the operating environment
Not accounting for human error
Not accounting for different operating modes
Not considering interaction failures
Unrealistic probabilities
Incorrect use of logic gates

By following these best practices and recommendations, engineers and safety experts can maximize the effectiveness of fault tree analysis and improve the safety and reliability of complex systems.

Conclusion

Fault tree analysis is a widely used method for analyzing the safety and reliability of complex systems. It involves constructing a graphical model that represents the possible combinations of events and components that could lead to a specific top event. By identifying the potential causes of system failures or accidents, engineers can take steps to improve the design, maintenance, and operation of these systems to minimize risks. However, fault tree analysis also has some limitations that should be taken into consideration, including limited scope, uncertainty, and resource-intensiveness.

While fault tree analysis has proven to be a valuable tool in ensuring system safety and reliability, there is still room for further research and development in this field. Some potential areas of future research include exploring the use of dynamic fault trees, which incorporate time-dependent variables and events, as well as Bayesian Networks, which allow for more flexible and intuitive modeling of complex systems. Additionally, there is growing interest in the use of systems-theoretic process analysis (STPA) as an alternative or complement to fault tree analysis, particularly in the healthcare and transportation industries. Continued research and development in these areas could lead to even more effective and efficient methods for analyzing system safety and reliability.

Fault tree analysis has proven to be a valuable tool for improving the safety and reliability of complex systems in a variety of industries. However, it is important to approach fault tree analysis with care and follow best practices and recommendations to ensure its effectiveness. By doing so, engineers can gain valuable insights into the potential causes of system failures and take steps to minimize risks and improve system safety.

Further Study

Here is a list of books and other media that provide an introduction to fault tree analysis:

“Fault Tree Handbook” by W. Vesely et al, U.S. Nuclear Regulatory Commission.
“Fault Tree Handbook with Aerospace Applications” by M. Stamatelatos et al, NASA
“Fault Tree Analysis” by Clifton Ericson
“Building a Fault Tree from a Schematic” by Isograph Software (Video)
“Fault Tree Analysis FTA” by the Bosch Group
“Fault tree analysis: A survey of the state-of-the-art in modeling, analysis and tools.” by Ruijters et al

These resources provide a good starting point for anyone seeking to learn more about fault tree analysis and related topics.

Stephen Thomas, PE, CFSE

Stephen is the founder and editor of functionalsafetyengineer.com. He is a functional safety expert with over 26 years of experience. He is currently a system safety engineer with a leading developer of autonomous vehicle technology. He is a member of the IEC 61508 and IEC 61511 functional safety committees. He is a member of the non-profit CFSE Advisory Board advising the exida CFSE program. He is the Director of Education & Professional Development for the International System Safety Society and an associate editor for the Journal of System Safety.

Follow Me on LinkedIn