Failure Analysis: Definition

Definition Failure analysis is the systematic investigation of why an asset, component, or system failed to perform its intended function. It identifies the failure mode, the physical mechanism, and the underlying root cause so that corrective actions can be taken and the failure can be prevented from recurring.

What Is Failure Analysis?

Failure analysis is the structured process of examining a failed asset or component, understanding how and why it failed, and using those findings to guide corrective and preventive action.

It is used across manufacturing, energy, aerospace, food production, and any industry where equipment failure creates safety risk, production loss, or significant repair cost.

A failure analysis investigation answers three questions:

  • What failed and how did it fail (the failure mode and mechanism)?
  • Why did it fail (the root cause)?
  • What changes will prevent it from failing again?

Without answers to all three, maintenance teams are left fixing symptoms rather than eliminating causes.

Why Failure Analysis Matters in Maintenance

Most unplanned failures are not random. They follow identifiable patterns, triggered by specific conditions such as misalignment, contamination, improper lubrication, or design limitations.

Unplanned maintenance is two to five times more expensive than scheduled work, when accounting for emergency labor, expedited parts, and lost production.

Failure analysis converts reactive events into reliability knowledge. Each investigation produces findings that can reduce the failure rate of a similar asset class, adjust preventive maintenance intervals, or justify a design change that eliminates the failure mode permanently.

Over time, a structured failure analysis program raises Mean Time Between Failure (MTBF), lowers maintenance cost per unit of output, and builds the institutional knowledge that supports reliability-centered decision-making.

Types of Failure Analysis

Different failure analysis methods serve different purposes. Some are reactive (applied after a failure occurs); others are proactive (used during design or to anticipate future failures).

Root Cause Analysis (RCA)

Root cause analysis is a structured method for tracing a failure back to its originating cause. It is the most common form of failure analysis applied after an unplanned event.

RCA uses tools such as the Five Whys, fishbone (Ishikawa) diagrams, and fault trees to systematically eliminate contributing factors and isolate the true root cause.

The output is a corrective action plan that addresses the cause rather than the symptom.

Failure Mode and Effects Analysis (FMEA)

FMEA is a proactive method that identifies potential failure modes before they occur and evaluates their effects on the system. Each failure mode is scored on severity, occurrence likelihood, and detectability to produce a Risk Priority Number (RPN).

FMEA is used during equipment design, process design, and maintenance planning to prioritize risk reduction efforts. Two common variants are DFMEA (applied at the design stage) and PFMEA (applied to manufacturing processes).

Failure Mode Effects and Criticality Analysis (FMECA)

FMECA extends FMEA by adding a criticality ranking, which weights each failure mode by both probability of occurrence and severity of consequence. This produces a prioritized list of failure modes that require the most attention in maintenance planning.

FMECA is widely used in aerospace, defense, and process industries where certain failure modes carry catastrophic or safety-critical consequences.

Fault Tree Analysis (FTA)

Fault tree analysis is a top-down, deductive method. It starts with a defined undesirable event (such as a pump stopping) and works backward through a logic diagram to identify all the combinations of events and conditions that could cause it.

FTA is particularly useful for complex systems where multiple failure paths may lead to the same outcome. It is common in nuclear, chemical, and aviation industries.

Failure Reporting Analysis and Corrective Action System (FRACAS)

FRACAS is a closed-loop process for capturing failure data, analyzing it, and verifying that corrective actions are effective. Unlike a single-event investigation, FRACAS operates continuously across all assets and failure events.

FRACAS feeds data into reliability programs and enables trend analysis across similar asset populations, making it one of the most powerful long-term failure analysis frameworks.

Field Failure Analysis

Field failure analysis investigates failures that occur in operating environments rather than in controlled test conditions. It is used when a component fails in service and the failure must be analyzed in context, incorporating actual operating conditions, load cycles, and environmental factors.

Physical Failure Analysis

Physical failure analysis examines the failed component directly using material science techniques: metallurgical analysis, scanning electron microscopy, spectroscopy, and hardness testing.

This method identifies the physical mechanism (corrosion, fatigue, wear, fracture, overheating) and supports the broader root cause investigation by confirming what happened to the material.

Failure Analysis Process: Step by Step

The failure analysis process follows a consistent sequence regardless of the specific method used.

Step 1: Define the Problem

Describe what failed, when, under what operating conditions, and what the consequences were. A precise problem statement prevents scope creep and focuses the investigation on the right evidence.

Step 2: Collect Evidence

Gather the failed component, maintenance records, sensor data, operator logs, and any photographs taken at the time of failure. Preserve physical evidence before the site is cleared or the component is discarded.

Maintenance history from a CMMS and condition monitoring data are primary evidence sources that reveal trends leading up to the failure.

Step 3: Examine the Failed Component

Apply visual inspection first, then non-destructive testing, and finally destructive examination where necessary. Document what is found at each stage before proceeding to the next.

Techniques include vibration analysis, oil analysis, thermography, ultrasonic inspection, and metallurgical examination depending on the asset type and failure mechanism.

Step 4: Identify the Failure Mode and Mechanism

The failure mode is the way in which the asset stopped performing its function (fracture, excessive wear, loss of seal, electrical short). The failure mechanism is the physical or chemical process that caused the failure mode (fatigue, corrosion, erosion, thermal degradation).

These two elements describe what happened, but not yet why.

Step 5: Determine the Root Cause

Trace the failure mode and mechanism back to their originating cause using a structured method such as the Five Whys, fishbone diagram, or fault tree.

Root causes typically fall into three categories:

  • Physical causes: material defect, incorrect specification, manufacturing flaw
  • Human causes: incorrect installation, inadequate lubrication, improper operation
  • Latent/organizational causes: inadequate procedures, lack of training, poor inspection intervals

Step 6: Develop and Implement Corrective Actions

Define specific actions that address the root cause, assign owners, and set completion dates. Corrective actions may include design changes, updated procedures, new inspection tasks, spare parts adjustments, or operator training.

Verify effectiveness after implementation by monitoring the asset and tracking whether the failure recurs.

Failure Analysis vs Root Cause Analysis

These terms are often used interchangeably, but they have distinct meanings in a rigorous maintenance program.

Aspect Failure Analysis Root Cause Analysis (RCA)
Scope Broad: covers physical examination, mechanism identification, and root cause Focused: traces symptoms to their originating cause
Timing Reactive (post-failure) or proactive (FMEA, FRACAS) Primarily reactive, applied after an event
Output Failure mode, mechanism, root cause, corrective action Root cause and corrective action
Relationship Contains RCA as one phase Is a component of failure analysis
Tools Physical inspection, lab analysis, FMEA, FRACAS, FTA, RCA Five Whys, fishbone diagram, fault tree
Who uses it Reliability engineers, metallurgists, maintenance teams Maintenance managers, quality teams, operations leaders

In practice: failure analysis is the full investigation; RCA is the analytical phase that identifies the originating cause within that investigation.

How Failure Analysis Improves Maintenance Strategy

Failure analysis findings have direct applications in how maintenance programs are designed and executed.

Feeding Reliability-Centered Maintenance

Reliability-centered maintenance (RCM) relies on knowing which failure modes are most likely, most consequential, and most detectable. Failure analysis data populates these inputs, making RCM decisions evidence-based rather than assumption-based.

Adjusting Preventive Maintenance Intervals

When failure analysis reveals that an asset is consistently failing between scheduled PM cycles, the interval is wrong. When assets are consistently replaced well before they show any sign of degradation, the interval is too short. Failure analysis drives the data that justifies interval adjustments.

This directly improves preventive maintenance efficiency and reduces unnecessary labor and parts costs.

Enabling Predictive Maintenance

Understanding which physical mechanisms precede failure allows teams to select the right sensors and monitoring techniques. If failure analysis shows that bearing degradation precedes motor failure, vibration and temperature monitoring can detect that degradation early, enabling predictive maintenance intervention before the asset fails.

The P-F curve formalizes this relationship: the gap between potential failure (when a defect becomes detectable) and functional failure (when the asset stops working) defines how much time is available for condition-based intervention.

Reducing Repeat Failures

Without failure analysis, teams correct the same failure repeatedly. With it, they identify whether the cause is a design issue, a procedural gap, or an operational condition and address it at the source. This is the mechanism by which MTBF increases over time.

Improving Spare Parts Decisions

Failure analysis reveals which components fail most frequently and under what conditions. This data supports more accurate spare parts inventory planning, reducing both stockouts and excess inventory.

Supporting Criticality Analysis

Criticality analysis ranks assets by the consequence of their failure. Failure analysis provides the failure rate and severity data that make criticality rankings accurate.

Failure Analysis in the Context of Asset Health

Failure analysis is most effective when integrated with continuous asset health monitoring. Real-time sensor data captures the operating conditions at the time of failure, providing context that a post-failure inspection alone cannot recover.

Condition monitoring platforms that record vibration, temperature, pressure, and electrical signatures create a pre-failure data trail. When a failure does occur, analysts can look back through that trail to identify the onset of degradation and correlate it with specific operating events.

This integration shortens investigation time, improves accuracy, and makes it easier to detect the same signature in similar assets before they fail.

Corrective maintenance workflows benefit from this integration as well: corrective maintenance work orders can be linked to the failure analysis findings, ensuring that the repair addresses the confirmed cause rather than just restoring function.

Tools and Software Used in Failure Analysis

Diagnostic Instruments

  • Vibration analyzers: detect mechanical imbalance, misalignment, bearing defects, and resonance
  • Thermographic cameras: identify hotspots from electrical faults or friction
  • Ultrasonic detectors: detect leaks, electrical discharge, and early-stage bearing faults
  • Oil analysis kits: identify contamination, wear particles, and lubricant degradation
  • Non-destructive testing equipment: includes ultrasonic thickness gauges, magnetic particle testing, and eddy current instruments

Software Platforms

  • Condition monitoring software: aggregates sensor data, detects anomalies, and triggers alerts when signatures indicate developing faults
  • CMMS (Computerized Maintenance Management System): stores maintenance history, work order data, and corrective action records that provide the operational context for failure investigations
  • FMEA software: structured tools for completing and documenting FMEA and FMECA worksheets
  • FRACAS platforms: closed-loop systems that capture failure reports, track investigations, and verify corrective action effectiveness
  • APM software: asset performance management platforms that combine monitoring, analytics, and reliability workflows in one interface

Integrated platforms that connect condition data with maintenance workflows give reliability teams the fastest path from failure detection to verified resolution.

Failure Analysis: Reactive vs Proactive Applications

Application Type When Applied Primary Method Goal
Post-failure investigation After an unplanned failure occurs RCA, physical examination Prevent recurrence
Design review During product or equipment design DFMEA, FMECA Eliminate failure modes before manufacture
Process review During process design or change PFMEA Identify process-induced failure risks
Continuous reliability improvement Ongoing, across all failure events FRACAS Build failure knowledge base, track trends
System safety analysis For safety-critical systems FTA Map all failure paths leading to a hazardous event

Common Mistakes in Failure Analysis

Jumping to corrective action before confirming the root cause. Teams under production pressure often repair and restart equipment without completing the investigation. The failure recurs because the cause was never addressed.

Confusing the failure mode with the root cause. A bearing failure is a failure mode. The root cause might be contaminated lubricant, improper installation, or shaft misalignment. Fixing the bearing without addressing the cause does not solve the problem.

Discarding or contaminating physical evidence. Once a site is cleaned or a component is discarded, critical evidence is gone. Preserving the failed component and documenting the surrounding conditions before any work begins is essential.

Investigating in isolation. Failure analysis produces the most value when findings are shared across similar asset populations and fed into a FRACAS or reliability database. Investigations that remain siloed in one site or one team do not scale.

Not verifying corrective action effectiveness. A failure analysis is not complete until the corrective action has been implemented and confirmed to work. Without follow-up, the investigation cycle is broken.

Frequently Asked Questions

When should a failure analysis be conducted?

Failure analysis should be conducted after any unplanned failure that resulted in significant downtime, safety risk, quality loss, or cost. It should also be triggered when a recurring failure pattern is identified, even if individual events seem minor. Some organizations set thresholds based on repair cost or production impact to determine when a formal investigation is warranted versus a standard corrective maintenance response.

Who conducts failure analysis?

The lead investigator is typically a reliability engineer or a senior maintenance technician with experience in the relevant asset type. Physical failure analysis involving material science or metallurgy may require a specialist or laboratory. FMEA and FMECA reviews are typically conducted by cross-functional teams including maintenance, operations, engineering, and quality.

How long does a failure analysis take?

A basic post-failure investigation using RCA can take anywhere from a few hours to several days, depending on evidence availability and investigation depth. Physical failure analysis involving laboratory testing may take weeks. Proactive methods such as FMEA are typically planned activities that run over days or weeks as part of a project.

What is the difference between failure analysis and failure mode analysis?

Failure mode analysis identifies and documents the ways in which an asset can fail, which is one component of a broader failure analysis investigation. Failure analysis encompasses the full investigation: identifying the failure mode, determining the physical mechanism, tracing the root cause, and defining corrective actions.

Is failure analysis the same as the Failure Finding Interval (FFI)?

No. Failure analysis is the investigation process applied after (or in anticipation of) a failure. The Failure Finding Interval is a maintenance task interval used for hidden failure modes, specifying how often a protective device or redundant function must be tested to verify it is still operational. Both concepts relate to managing failures but operate at different stages of the reliability process.

Can failure analysis be automated?

Parts of the process can be accelerated by technology. Condition monitoring platforms with anomaly detection can automatically flag deviating signals and trigger an investigation workflow. AI-assisted diagnostic tools can suggest probable failure causes based on sensor patterns. However, the judgment-intensive steps, including root cause determination and corrective action development, still require experienced human analysis.

The Bottom Line

Failure analysis is the discipline that prevents maintenance from becoming a permanent exercise in damage control. When every significant failure is analyzed to its root cause, and when corrective actions are implemented and verified, the failure rate of that failure mode decreases over time. The alternative — repairing without understanding — produces the same failures on the same schedule indefinitely.

The most effective failure analysis programs are integrated with the CMMS and condition monitoring infrastructure. When failure codes are standardized, failure events trigger investigation workflows automatically, and findings are stored with the asset record, the accumulated history becomes a searchable database of lessons that improves maintenance planning across the entire asset base over time.

Detect Failures Before They Happen

Tractian's condition monitoring platform gives your reliability team the real-time data and diagnostic intelligence needed to investigate failures faster, identify recurring patterns, and stop breakdowns before they reach critical assets.

See How Condition Monitoring Works

Related terms