Root Cause Analysis: Definition
Key Takeaways
- RCA investigates the origin of a failure, not just its surface symptoms, so corrective actions produce lasting results instead of temporary fixes.
- The four primary methods are the 5 Whys, the Fishbone (Ishikawa) diagram, Fault Tree Analysis (FTA), and Failure Mode and Effects Analysis (FMEA); each suits different problem types and complexity levels.
- A completed RCA produces three outputs: the physical cause (what failed), the human cause (what decision or action allowed it), and the latent cause (what systemic condition made the failure possible).
- RCA is not reserved for catastrophic failures. Applying it to recurring low-severity faults often delivers greater total reliability improvement than analysing single high-impact events.
- Integrating RCA findings with condition monitoring and a CMMS ensures corrective actions are tracked, verified, and incorporated into future maintenance planning.
What Is Root Cause Analysis?
Root cause analysis is a formal investigation process that traces a failure or quality problem back to its origin. Rather than stopping at the immediate cause (a bearing seized, a motor tripped, a valve leaked), RCA continues asking why until it reaches the underlying condition or decision that made the failure possible in the first place. That underlying condition is the root cause, and correcting it is the only way to eliminate the failure mode permanently.
In a manufacturing and maintenance context, RCA sits at the intersection of reliability engineering and continuous improvement. It is the practical mechanism that converts failure data into process change. Without RCA, teams repair the same equipment repeatedly, consuming labour, parts, and production capacity in a loop that never closes. With it, each failure becomes a learning event that strengthens the overall maintenance program.
Modern RCA practice recognises three causal layers. The physical cause is the component or material that failed. The human cause is the act or omission that triggered or failed to prevent the failure. The latent cause is the organisational condition (an inadequate procedure, a missing inspection, insufficient training) that allowed the human and physical causes to align. Effective RCA addresses all three layers; correcting only the physical cause is the most common reason failures repeat.
Root Cause vs. Symptom: Why the Distinction Matters
Every failure presents visible symptoms: abnormal vibration, elevated temperature, unexpected shutdown, reduced throughput. Symptoms are what prompt maintenance teams to respond. The physical cause is one level deeper: the bearing that failed, the insulation that broke down, the seal that wore through. The root cause goes deeper still, to the reason the physical cause occurred.
Consider a pump that repeatedly overheats. The symptom is high bearing temperature. The physical cause might be lubrication breakdown. The root cause might be that the grease specification in the maintenance procedure was written for a lower-load application and has never been updated since the pump duty was changed. Replacing the bearing each time the overheating occurs addresses the symptom. Updating the lubrication specification and adding a temperature alarm addresses the root cause. After the RCA-driven fix, the failure stops recurring.
This distinction matters financially as well. The cost of downtime associated with a single high-frequency, medium-severity failure can far exceed that of a rare catastrophic event over a multi-year maintenance cycle. RCA is the tool that reduces the frequency of repeating failures and controls the cumulative cost they generate.
The Four Primary RCA Methods
5 Whys
The 5 Whys is the simplest and most widely used RCA technique. Starting from the failure statement, the analyst asks "Why did this occur?" and records the answer. That answer becomes the next problem statement, and the process repeats until no further useful answer can be given. Five iterations is a guideline, not a rule; some problems resolve in three whys, others require seven or more.
Example: A conveyor belt drive motor fails unexpectedly.
- Why did the motor fail? Because it overheated.
- Why did it overheat? Because airflow to the cooling fins was blocked.
- Why was airflow blocked? Because dust had accumulated on the motor casing.
- Why had dust accumulated? Because there was no cleaning task in the preventive maintenance schedule for this motor.
- Why was there no cleaning task? Because the motor was added to the line during a production expansion and was never fully onboarded into the CMMS maintenance plan.
The root cause is not dust; it is a gap in the asset onboarding process. The corrective action is to add a PM cleaning task and audit newly installed assets for completeness of maintenance coverage. The 5 Whys is most effective for straightforward, single-strand cause chains and requires no specialist tools or software.
Fishbone (Ishikawa) Diagram
The Fishbone diagram, developed by quality engineer Kaoru Ishikawa, maps causes visually. The effect (failure) is placed at the head of the fish. Major cause categories branch off the spine, and specific causes are added as smaller bones on each branch. In maintenance, the standard categories are: People, Machine, Method, Material, Measurement, and Environment (the 6Ms framework).
Example: A hydraulic press produces inconsistent clamp force, causing rejected parts to reach assembly.
- People: operator not following warm-up sequence; technician calibrating pressure gauge with incorrect reference standard.
- Machine: accumulator pre-charge pressure has drifted; seal wear not detected between planned PM intervals.
- Method: maintenance procedure does not specify accumulator check frequency.
- Material: hydraulic fluid viscosity grade was changed during the last top-up without engineering approval.
The Fishbone diagram excels at revealing the breadth of contributing factors across categories, making it particularly useful when a cross-functional team suspects multiple independent causes. It is less effective at establishing causal sequence and is often followed by the 5 Whys on the most promising branches to reach the true root cause.
Fault Tree Analysis (FTA)
Fault Tree Analysis is a top-down, deductive method that uses formal Boolean logic (AND/OR gates) to model how combinations of component failures or human errors can propagate to a defined top-level event. FTA originated in aerospace reliability engineering and is standard practice in high-consequence industries including oil and gas, nuclear, and chemical processing.
Example: An emergency shutdown system fails to activate when a pressure vessel exceeds its safe operating limit. The fault tree maps every logical path to that top event: sensor failure, signal cable open circuit, logic controller software fault, solenoid valve mechanical failure, and combinations of partial failures that together defeat the system. By calculating probabilities at each branch, engineers can identify the weakest points and prioritise design changes or redundancy additions.
FTA is quantitative when failure probability data is available, and qualitative when it is not. It requires specialist training and is typically reserved for safety-critical systems and complex multi-component failure scenarios.
FMEA (Failure Mode and Effects Analysis)
FMEA is a proactive RCA-adjacent method. Rather than investigating a failure that has already occurred, FMEA systematically anticipates every way a system, component, or process could fail and evaluates the consequences of each failure mode. Each failure mode is scored on three dimensions: Severity (how serious is the effect?), Occurrence (how often is this failure mode likely to happen?), and Detection (how likely is it that the failure would be caught before causing harm?). The three scores are multiplied to produce a Risk Priority Number (RPN).
High RPN failure modes are prioritised for corrective action before they produce real failures. FMEA is used during equipment design, during process change reviews, and during reliability improvement programs as a forward-looking complement to retrospective RCA. It is particularly valuable during the design phase of new equipment because changes made on paper are far less expensive than changes made after installation.
Comparing RCA Methods
| Method | Approach | Best For | Skill Level | Output |
|---|---|---|---|---|
| 5 Whys | Iterative questioning | Simple, single-strand failures; rapid investigations | Low; any technician can apply it | Linear cause chain leading to one root cause |
| Fishbone (Ishikawa) | Categorical cause mapping | Multi-factor problems; cross-functional teams | Low to medium; facilitation skills helpful | Visual cause map; hypothesis list for further investigation |
| Fault Tree Analysis | Top-down Boolean logic tree | Safety-critical systems; complex multi-failure scenarios | High; requires reliability engineering background | Logic diagram; minimal cut sets; failure probability (if quantified) |
| FMEA | Proactive failure mode scoring | Design review; process change; new asset commissioning | Medium; structured worksheet and team required | RPN-ranked failure mode register; prioritised action list |
The RCA Process: Step by Step
Step 1: Define the Problem
Write a precise problem statement that describes the failure, the asset affected, when and where it occurred, and the measurable impact. Vague problem statements produce vague investigations. "Pump P-102 lost prime three times in the past 30 days, each incident causing 45 to 90 minutes of unplanned downtime on the packaging line" is an effective problem statement. "The pump keeps failing" is not.
Step 2: Collect Data
Before interviewing anyone or drawing diagrams, gather objective evidence. Pull work order history from the CMMS, review sensor trend data from the period before failure, collect maintenance logs, inspection records, and any alarm histories. Physical evidence from the failed components should be preserved rather than discarded. Time-sequenced evidence is particularly valuable because it reveals the order in which conditions developed.
Step 3: Map the Causal Chain
Apply the appropriate RCA method based on problem complexity. For straightforward failures, the 5 Whys or a simple timeline analysis is sufficient. For complex, multi-variable failures, the Fishbone diagram helps ensure no major cause category is overlooked, and FTA provides rigorous logical structure when system safety is involved. Document every causal link with supporting evidence; unsupported assertions weaken the analysis and make corrective action harder to justify.
Step 4: Identify the Root Cause
The root cause is the point at which the causal chain terminates: the condition that, if changed, would prevent the failure from recurring. Test this by asking whether removing the identified root cause would have prevented the failure. If the answer is yes, the root cause is correctly identified. If the answer is "probably, but only if another condition also changed," further investigation is needed to separate true root causes from contributing factors.
Step 5: Develop Corrective Actions
Generate corrective actions for each causal layer. Physical causes typically require engineering or component-level fixes. Human causes require procedural updates, training, or job aids. Latent causes require systemic changes: revised maintenance strategies, updated PM schedules, improved inspection criteria, or management system changes. Each corrective action should be specific, measurable, and assigned to a named owner with a completion date.
Step 6: Implement and Verify
Implement corrective actions through the work order system and track their completion in the CMMS. After implementation, monitor the asset to verify that the failure mode has been eliminated. If the failure recurs, the analysis must be reopened; recurrence is evidence that the root cause was not correctly identified or that the corrective action was not fully effective. Verification is not optional; an RCA that is not verified has no reliability value.
Step 7: Share Findings
Distribute the RCA findings to all teams managing similar equipment. A failure analysis completed on a pump in one facility is directly applicable to identical pumps in other facilities. Sharing findings multiplies the return on the investigation investment and builds institutional knowledge that reduces the organisation's overall failure rate over time.
Practical Examples from Industrial Equipment
Centrifugal Pump Seal Failure (Chemical Plant)
A centrifugal pump handling a mildly corrosive process fluid experienced repeated mechanical seal failures at an average interval of 60 days, against an expected seal life of 18 months. Initial repairs replaced the seal each time. An RCA using the 5 Whys revealed: the seal was failing due to dry running; dry running was occurring because the automatic priming valve was sticking closed; the valve was sticking because it was specified for clean water service and had not been upgraded when the process fluid was changed; the specification was not updated because there was no formal management-of-change process requiring engineering review when process fluids were changed. The root cause was a gap in the management-of-change procedure. The corrective action replaced the valve with a chemically compatible type and established a formal MOC checklist. Seal life returned to specification.
Induction Motor Bearing Failure (Food and Beverage Plant)
A 75 kW induction motor driving a mixer had its drive-end bearing replaced four times in 18 months. Each replacement was treated as a routine corrective task. A Fishbone RCA identified causes across three categories: Machine (incorrect bearing fit tolerance allowing micro-movement), Method (technician using a hammer rather than an induction heater for bearing installation, causing installation damage), and Measurement (no vibration baseline was taken after installation to confirm correct fit). The root cause spanned two categories: an inadequate bearing installation procedure and absence of post-installation verification. Updated procedures specifying thermal installation and mandatory post-installation vibration analysis eliminated the failure; the bearing subsequently ran without replacement for more than two years.
Hydraulic System Contamination (Automotive Press Line)
A press line suffered eight servo valve failures in a single quarter. Each valve was replaced under warranty, but valve life remained short. An RCA combining Fishbone and 5 Whys analysis found that hydraulic fluid particle contamination was exceeding ISO 4406 cleanliness targets. The fluid was clean when sampled at the reservoir but highly contaminated at the valve block. The investigation found that the return line filter element had not been changed in 14 months despite a 6-month PM interval, because the filter change task had been incorrectly labelled as a quarterly task in the CMMS while the OEM specification required it twice yearly. The PM interval data entry error was the root cause. Correcting the CMMS PM frequency and adding a filter differential pressure alarm resolved the contamination problem, and valve life returned to the OEM expected range.
When to Use Root Cause Analysis
RCA is not appropriate for every failure event. Applying a full formal investigation to every minor fault would consume more resources than the failures themselves. Maintenance teams typically set threshold criteria to decide when a formal RCA is warranted. Common triggers include:
- Any failure that has occurred more than twice in a rolling 12-month period on the same asset or asset class.
- Any unplanned downtime event exceeding a defined cost or duration threshold (for example, more than two hours on a critical production line).
- Any failure involving a safety incident, near-miss, or regulatory notification requirement.
- Any failure causing a product quality escape that reached the customer.
- Any failure of a safety-critical system, regardless of outcome.
Below these thresholds, a simplified analysis or a quick 5 Whys conversation is often sufficient. The objective is to match investigation rigour to the actual risk and cost of the failure, not to generate paperwork. Corrective maintenance that feeds documented RCA findings back into the maintenance planning cycle produces the highest long-term return.
RCA and Reliability Engineering
In a mature maintenance program, RCA is not a standalone reactive tool; it is integrated into the broader reliability engineering framework. Predictive maintenance technologies detect early failure signals and allow condition-triggered investigations before a full breakdown occurs. This means the physical evidence is preserved (the component has not yet failed catastrophically), the data record is complete (sensor trends leading up to the anomaly are available), and corrective action can be planned rather than rushed.
FMEA, applied proactively, reduces the number of failure modes that require reactive RCA by eliminating high-risk modes during the design and planning phase. Reliability data collected from closed RCA investigations feeds back into FMEA worksheets, improving their accuracy over time. The two methods are complementary rather than interchangeable.
Organisations that systematically close the loop between RCA findings, maintenance planning, and equipment design see measurable reductions in mean time between failures (MTBF), unplanned downtime rates, and maintenance cost per unit of production over a two to three year horizon.
The Bottom Line
Root cause analysis is the mechanism that converts reactive maintenance into a learning system. Every failure contains information about what the maintenance program failed to prevent. RCA extracts that information systematically and translates it into procedural, technical, or organisational changes that reduce the probability of recurrence. For maintenance managers, the practical value is straightforward: teams that conduct and act on RCA findings spend progressively less time on repetitive repairs and more time on planned, value-adding work.
The method matters less than the discipline. A well-facilitated 5 Whys analysis acted on promptly delivers more reliability value than a technically perfect FTA report that sits in a file with no follow-through. The key steps are defining the problem accurately, collecting objective evidence, identifying causes at all three layers (physical, human, and latent), assigning corrective actions with owners and deadlines, verifying that the actions worked, and sharing the findings across the organisation.
When RCA is supported by continuous condition monitoring data and a complete asset history in a CMMS, investigation time drops, evidence quality improves, and corrective actions are more precisely targeted. The result is a maintenance program that gets measurably better with each failure it investigates, rather than one that simply reacts to the same failures repeatedly.
Detect Failures Before They Happen
Tractian's condition monitoring platform gives maintenance teams the real-time asset health data needed to catch failure modes early, build a complete evidence record, and make RCA faster and more accurate. Stop investigating the same failures twice.
See Condition MonitoringFrequently Asked Questions
What is root cause analysis?
Root cause analysis (RCA) is a structured problem-solving process used to identify the fundamental cause of a failure or defect, rather than addressing only its symptoms. By tracing the chain of contributing factors back to its origin, RCA enables teams to implement corrective actions that prevent recurrence rather than simply restoring equipment to operation.
What are the main methods used in root cause analysis?
The four most widely used RCA methods in maintenance are: the 5 Whys (iteratively asking why until the root cause is reached), the Fishbone or Ishikawa diagram (mapping causes across categories such as people, equipment, and process), Fault Tree Analysis (a top-down logic diagram that maps fault sequences), and FMEA (a proactive method that anticipates failure modes before they occur). Method selection depends on problem complexity, data availability, and team experience.
When should you use root cause analysis?
Root cause analysis is appropriate after any recurring failure, high-impact unplanned breakdown, safety incident, quality escape, or regulatory non-conformance. It is also used proactively as part of reliability programs to analyse near-misses and low-severity failures before they escalate. Maintenance teams typically set criticality thresholds based on asset risk and downtime cost to prioritise which failures warrant a formal investigation.
What is the difference between a root cause and a contributing factor?
A root cause is the deepest underlying condition that, if eliminated, would prevent the failure from recurring. Contributing factors are conditions that increased the likelihood or severity of the failure but are not sufficient on their own to have caused it. Effective RCA distinguishes between the two to avoid wasting corrective resources on factors that would not, by themselves, have produced the same outcome.
How does root cause analysis differ from troubleshooting?
Troubleshooting is focused on restoring equipment to operation as quickly as possible. Root cause analysis is a subsequent, structured investigation aimed at understanding why the failure occurred and preventing it from happening again. Troubleshooting asks "What broke and how do I fix it now?" RCA asks "Why did it break and what must change so it does not break again?"
How does root cause analysis integrate with a CMMS and condition monitoring?
A CMMS captures the failure history, work order data, and parts consumption records that form the evidence base for RCA. Condition monitoring sensors provide the early-warning trend data showing how an asset behaved before failure, which helps analysts pinpoint when the failure mode initiated and which variables correlated with it. Together, these tools shorten investigation time, improve accuracy, and ensure corrective actions are tracked through to closure.
Related terms
Remaining Useful Life: Definition
Remaining useful life (RUL) is the estimated time an asset can keep operating before failure. Learn the formula, worked example, three estimation methods, and industry applications.
Right First Time (RFT): Definition
Right First Time (RFT) measures the percentage of maintenance work orders completed correctly on the first attempt, without a callback or repeat visit. Learn the formula, causes of low RFT, and how to improve it.
Replacement Asset Value: Definition
Replacement Asset Value (RAV) is the current cost to replace all facility assets at today's prices, used to benchmark maintenance spending and justify reliability investments.
Run to Failure (RTF)
Run-to-failure (RTF) is a reactive maintenance strategy where equipment operates until it breaks. Learn when RTF makes sense, its risks, and how to apply it.