Fault Tolerance: Definition
Key Takeaways
- Fault tolerance is the ability of a system to keep functioning after one or more component failures, not the ability to prevent failures
- Redundancy is the most common mechanism for achieving fault tolerance, but it is not the only one: fail-safe design and graceful degradation are equally important approaches
- Active redundancy provides seamless fault tolerance with no interruption; passive redundancy provides it with a brief switchover delay
- Fault tolerance does not reduce maintenance requirements: it shifts the focus to keeping redundant elements in a ready state and testing standby systems regularly
- Availability, MTBF, MTTR, and fault tolerance level (N+1, 2N) are the primary metrics used to measure and specify fault-tolerant systems
- Industrial safety systems, control architectures, and power supply infrastructure all use fault-tolerant design as a core reliability strategy
What Is Fault Tolerance?
Fault tolerance is a design property that allows a system to continue performing its required function after a fault occurs in one or more of its components. The system detects the fault, isolates it, and either compensates through redundancy or moves to a reduced but acceptable operating state.
The distinction is important: fault tolerance is not about preventing failures. It is about designing systems so that individual failures do not produce system-level failures. A single failed sensor, pump, power supply, or controller should not bring the whole process down.
In industrial contexts, fault tolerance is applied wherever the consequences of an unplanned outage are severe: continuous process plants, safety instrumented systems, power distribution networks, industrial automation, and data-critical control architectures.
The concept is formalized in standards including IEC 61508 (functional safety of electrical and programmable electronic systems), IEC 61511 (process safety instrumented systems), and ISO 25010 (systems and software quality), all of which include explicit requirements for fault tolerance at defined integrity levels.
Fault Tolerance vs Fault Avoidance vs Redundancy
These three terms are often confused or used interchangeably. They address related but distinct aspects of system design:
| Concept | Goal | Mechanism | Relationship |
|---|---|---|---|
| Fault avoidance | Prevent faults from occurring | High-quality components, robust design, strict tolerances, thorough testing | Reduces fault frequency |
| Fault tolerance | Continue operating despite faults | Redundancy, fail-safe states, graceful degradation, automatic switchover | Reduces fault consequence |
| Redundancy | Provide backup capacity | Duplicate components operating in parallel or in standby | One mechanism within fault tolerance |
Fault avoidance and fault tolerance are complementary strategies, not alternatives. High-quality components reduce the rate of faults. Fault-tolerant architecture limits the impact of faults that still occur. Together they produce systems with high inherent reliability and high operational availability.
Types of Fault Tolerance
Active Redundancy
In active redundancy, two or more identical components operate simultaneously and share the load. If one fails, the others absorb its share of the workload without any interruption to system function.
A dual-feed power supply arrangement where both feeds are always live is a classic example. A processing system with multiple CPUs all running the same computation in parallel is another. Active redundancy provides seamless fault tolerance because there is no switchover delay: the healthy component is already operational at the moment of failure.
The trade-off is cost and complexity. All active components carry load continuously, so all are subject to wear, and all require maintenance. The system must also have logic to detect when one component has failed silently so the fault is not propagated.
Passive Redundancy (Standby Redundancy)
In passive redundancy, a standby component is held in reserve and switched in only when the primary component fails. The standby may be hot (powered and ready to take over immediately), warm (partially initialised with a brief activation delay), or cold (completely offline until needed).
A standby pump that starts automatically when the duty pump fails is the most common industrial example. Standby power generators that start on mains failure are another. Passive redundancy is less expensive than active redundancy because the standby components are not continuously loaded, but it introduces a short interruption at switchover and requires regular testing to confirm that the standby is functional.
Fail-Safe Design
Fail-safe design ensures that when a component or system fails, it moves to a predetermined safe state rather than continuing to operate in an undefined or dangerous condition. The safe state is chosen to minimize risk, even if it means the system is no longer productive.
Industrial examples include normally closed (NC) solenoid valves that close on power or signal loss, emergency shutdown systems that de-energize to trip, and pressure relief valves that open on overpressure. In each case, the failure state is safe by design.
Fail-safe is particularly important in safety instrumented systems (SIS), where the consequence of an incorrect output is more serious than a spurious trip. IEC 61508 distinguishes between fail-safe (known safe state) and fail-secure (the system remains in its last known good state), depending on the application context.
Graceful Degradation
Graceful degradation allows a system to continue operating at reduced capacity or capability when a fault occurs, rather than failing entirely. The system sacrifices some performance to preserve core function.
An industrial control system that loses one of three operator workstations but continues to run on the remaining two is degrading gracefully. A distributed sensor network that continues to monitor most assets after one sensor node goes offline is another example. The system is impaired but not stopped, and operators are informed of the degraded state so they can manage it.
Graceful degradation is closely related to the concept of the failure mode: the design must define both the failure behaviour and the acceptable reduced-capability state so that operators know what to expect and how to respond.
Fault Tolerance in Industrial Systems
Process Control Systems
Distributed control systems (DCS) and programmable logic controllers (PLC) in continuous process industries are designed with multiple layers of fault tolerance. Redundant controllers, redundant I/O modules, redundant communication buses, and redundant power supplies mean that the failure of any single element does not interrupt process control.
Voting architectures, such as 2-out-of-3 (2oo3) sensor voting, are common in safety-critical loops. Three identical sensors measure the same process variable. The control system acts on the majority vote, so a single sensor failure or a false reading does not trigger an incorrect action. The failed sensor is flagged for replacement while the process continues under the control of the remaining two.
Power Supply Systems
Industrial facilities use layered fault-tolerant power architectures. Dual utility feeds from separate substations provide the first level. Uninterruptible power supply (UPS) systems with battery or flywheel backup maintain power during the switchover interval. Standby generators provide longer-duration backup if the mains supply is lost for extended periods.
For critical loads, N+1 and 2N UPS configurations are standard. In an N+1 configuration, one more UPS module is installed than is needed to carry the full load, so any single module can fail without loss of capacity. In a 2N configuration, two completely independent UPS systems each rated to carry the full load operate in parallel, providing the highest level of protection.
Safety Instrumented Systems
Safety instrumented systems (SIS) are purpose-designed fault-tolerant architectures that bring a process to a safe state when a hazardous condition is detected. They are independent of the basic process control system. Their design is governed by IEC 61511, which specifies fault tolerance requirements in terms of Safety Integrity Levels (SIL).
A SIL 2 safety function, for example, requires a hardware fault tolerance of at least one: the system must continue to perform its safety function with any single hardware fault present. This typically means redundant sensors, redundant logic solvers, and redundant final elements. Reliability-centered maintenance (RCM) principles are used to derive the test intervals that maintain the required SIL over the system's operating life.
Communication and Network Infrastructure
Industrial networks use ring topologies, dual-path routing, and media redundancy protocols (MRP, RSTP) to ensure that communication continues if a cable, switch, or network segment fails. Fault detection and automatic reconfiguration happen within milliseconds, which is fast enough that process control is unaffected.
How Fault Tolerance Affects Maintenance Strategy
Fault-tolerant design does not reduce maintenance requirements. It changes them.
When a system is designed with redundancy, the failure of the primary component is tolerated, but the redundant protection is now consumed. The system is still running, but it is running without its safety net. If the redundant component then fails before the primary is restored, the system fails entirely.
This means maintenance teams must treat the loss of a redundant component as an urgent event, even though process continuity has not been disrupted. The window between the failure of the primary and the restoration of the standby is a period of hidden vulnerability.
Three maintenance priorities emerge from this:
1. Monitor redundant components actively. Standby elements that are not being used can fail silently. Condition-based maintenance techniques applied to standby components ensure that their readiness is known, not assumed. Vibration analysis, thermal monitoring, and electrical testing on standby pumps, motors, and generators keep the standby genuinely available.
2. Test protective and standby systems at defined intervals. Passive redundancy and fail-safe systems can fail in a dormant state that is only discovered when the system is demanded. Scheduled functional tests at intervals derived from the system's failure rate and the acceptable unavailability target are required to confirm that protection is in place. This is the logic behind failure finding intervals in RCM programs.
3. Restore redundancy quickly after any fault. Once a fault has consumed the redundant protection, the mean time to recovery (MTTR) becomes critical. A fast restoration, whether through spare parts availability, pre-defined repair procedures, or rapid escalation paths, limits the window of vulnerability. Predictive maintenance tools that detect degradation before failure allow teams to plan restoration work before redundancy is consumed rather than after.
Measuring Fault Tolerance
Fault tolerance is quantified through a combination of reliability metrics and design specifications:
| Metric | What It Measures | Relevance to Fault Tolerance |
|---|---|---|
| System availability | Proportion of scheduled time the system is operational | Primary measure of whether fault tolerance is working: a well-designed fault-tolerant system maintains high asset availability even with component failures |
| MTBF (system level) | Average operating time between system-level failures | Mean time between failures at the system level should be much longer than MTBF of any individual component, demonstrating that redundancy is absorbing component failures |
| MTTR | Average time to restore a failed component to service | Mean time to recovery governs how long the system operates without its redundant protection after a fault; a low MTTR limits vulnerability |
| Fault tolerance level | Number of simultaneous component failures the system can tolerate without losing function | Expressed as N+1 (one spare), 2N (full duplication), or fault tolerance level 1/2/3 per IEC 61508; defines the design requirement |
| PFD (probability of failure on demand) | Probability that a protective system fails to operate when a demand occurs | Used for safety instrumented systems; a lower PFD means higher fault tolerance for safety functions; defined per SIL requirements |
| RAM analysis | Reliability, availability, and maintainability modelling of a system | RAM analysis uses reliability block diagrams or fault tree models to predict availability and quantify the contribution of redundancy to fault tolerance |
Fault Tolerance and Condition Monitoring
A fault-tolerant design is only as good as the readiness of its redundant components. Condition monitoring is the operational discipline that keeps redundancy genuinely available rather than theoretically available.
Continuous monitoring of critical and standby assets provides the early warning needed to address developing faults before they consume the protective redundancy. If a duty pump is running with a bearing that is beginning to fail, predictive detection via vibration or temperature sensors allows the team to plan a controlled switchover to the standby and repair the duty pump during a scheduled window. The standby absorbs the fault without any production impact, and the duty pump is restored before the standby's readiness degrades.
This is the operational expression of fault tolerance: the design creates the architecture, and condition monitoring keeps that architecture performing as intended over years of operation.
Failure mode analysis tools, including FMEA, are used at the design stage to identify which components require redundancy and what their expected failure rate is. Condition monitoring at the operational stage validates those assumptions with real-world data and surfaces deterioration that the design analysis could not predict in advance.
Fault Tolerance in Maintenance Strategy Design
When building a maintenance strategy for a fault-tolerant system, risk-based maintenance principles apply. The strategy must account for two distinct failure consequences:
Component failure within redundant capacity. If the system can absorb the failure without losing function, the maintenance response can be planned and scheduled. The priority is to restore the redundant element before the next fault occurs, and the urgency is driven by the likelihood and consequence of that next fault.
Component failure that exceeds redundant capacity. If all redundant elements have failed, or if a fault bypasses the redundant protection, the system is down. At this point the response is emergency corrective maintenance, with all the cost and disruption that entails.
A well-designed maintenance strategy for a fault-tolerant system keeps the first scenario common and the second scenario rare. It does this through a combination of preventive maintenance on redundant components, regular functional testing of standby and protective systems, fast response times when faults are detected, and condition monitoring that provides early warning before failures occur.
Frequently Asked Questions
Is a fault-tolerant system the same as a reliable system?
Not exactly. A reliable system has a low probability of failing under stated conditions. A fault-tolerant system is designed to keep functioning even when individual components fail. High reliability reduces fault frequency; fault tolerance reduces fault consequence. The most dependable industrial systems combine both: high-quality, well-maintained components (reliability) within an architecture that can absorb failures when they do occur (fault tolerance).
What does N+1 mean in fault-tolerant design?
N+1 means that one additional (redundant) unit is provided beyond the number needed to carry the full load. If N units are required for normal operation, N+1 units are installed so the system can continue to operate if any one fails. The "+1" is the spare capacity. Higher redundancy levels, such as 2N (full duplication) or N+2, provide greater protection at greater cost.
Can software be fault-tolerant?
Yes. Software fault tolerance is achieved through techniques including error detection and correction, exception handling, watchdog timers that detect and restart hung processes, redundant software modules running in parallel, and checkpoint-and-restart mechanisms that allow a process to resume from a known-good state after a crash. In industrial control systems, redundant controllers running identical software with voter logic to detect discrepancies between outputs are a common hardware-software fault tolerance combination.
How does fault tolerance relate to safety integrity levels (SIL)?
SIL requirements directly specify fault tolerance. IEC 61508 defines hardware fault tolerance (HFT) as the number of faults a system must be able to tolerate while still performing its safety function. A SIL 1 safety function typically requires HFT of 0 (the system must perform with any zero faults present). SIL 2 requires HFT of 1. SIL 3 requires HFT of 2. These requirements drive the redundancy architecture and test interval of safety instrumented systems.
Does fault tolerance eliminate the need for maintenance?
No. Fault tolerance changes the nature and urgency of maintenance tasks but does not eliminate them. Redundant components still age, degrade, and fail. Standby systems can fail silently if not regularly tested. If maintenance allows redundant elements to fall into a failed state undetected, the fault-tolerant protection disappears and the system is exposed to the very outages it was designed to prevent. Fault tolerance and active maintenance are not alternatives; they work together.
The Bottom Line
Fault tolerance is engineered resilience. It acknowledges that failures will occur despite maintenance efforts and builds the system's ability to absorb those failures without losing function. For critical infrastructure, production systems, and safety-relevant equipment, fault tolerance is not an alternative to good maintenance — it is the risk control layer that bridges the gap between maintenance intervals.
The maintenance implication of fault tolerance is important: redundant and standby systems must themselves be maintained to preserve their protective function. A redundant pump that has been quietly failing for months provides no protection when the primary pump fails. Regular testing and maintenance of fault-tolerant components is what keeps the protection they provide real rather than theoretical.
Keep Your Fault-Tolerant Architecture Genuinely Protected
Fault-tolerant design creates the architecture. Continuous condition monitoring keeps it working. Tractian's condition monitoring platform tracks the health of both duty and standby assets in real time, detects degradation before it consumes your redundant protection, and gives reliability teams the data they need to restore readiness before the next fault occurs.
See Condition MonitoringRelated terms
Remote Terminal Units
A remote terminal unit (RTU) is a rugged field device that collects sensor data at remote sites and transmits it to a central SCADA system for real-time monitoring and control.
Reorder Point
A reorder point (ROP) is the inventory level that triggers a replenishment order. Formula: ROP = (Average Daily Usage x Lead Time) + Safety Stock.
Request for Work Order
A request for work order is a formal submission asking the maintenance team to act on an asset issue. Learn the workflow, priority framework, and benefits of a structured request process.
Repair and Maintenance
Repair and maintenance covers all activities that keep assets operational, from scheduled upkeep to post-failure restoration. Learn types, the R&M lifecycle, and key metrics.
Revenue Per Employee
Revenue per employee measures total annual revenue divided by full-time equivalent headcount. Learn benchmarks by industry, how downtime affects the ratio, and operational levers to improve it.