What is the difference between fault tolerance and redundancy?

Redundancy is one of several design techniques used to achieve fault tolerance. Redundancy means duplicating critical components so a spare is available if the primary fails. Fault tolerance is the broader system property: the ability to continue functioning through a fault, regardless of the mechanism used to achieve it. A fault-tolerant system may use redundancy, fail-safe design, graceful degradation, automatic switchover, or a combination of these. Redundancy is a means; fault tolerance is the outcome.

What is the difference between fault tolerance and fault avoidance?

Fault avoidance seeks to prevent faults from occurring in the first place, through high-quality components, robust design, and strict manufacturing tolerances. Fault tolerance accepts that faults will occur and designs the system to continue functioning despite them. Both strategies are complementary: fault avoidance reduces the frequency of failures, while fault tolerance reduces the consequence of failures that do occur.

What are the main types of fault tolerance?

The main types of fault tolerance are: active redundancy (multiple identical components operating in parallel, all sharing the load, so that failure of one is absorbed immediately without interruption); passive redundancy (standby components that are switched in only when the primary fails); fail-safe design (the system moves to a safe, predefined state on failure rather than continuing at risk); and graceful degradation (the system continues at reduced capability or performance when a fault occurs, rather than failing entirely).

How is fault tolerance measured?

Fault tolerance is measured through availability, reliability, and recovery metrics. Key measures include system availability (the proportion of scheduled time the system is operational), mean time between failures (MTBF), mean time to recovery (MTTR), and the number of tolerated simultaneous faults (fault tolerance level, often expressed as N+1 or 2N). Formal methods such as reliability block diagrams and RAM analysis are used to model and quantify the fault tolerance of complex systems before deployment.

How does fault tolerance affect maintenance strategy?

Fault-tolerant design shifts the maintenance focus from preventing any failure to managing the health of redundant components so that the redundancy is always available when needed. Maintenance teams must actively monitor standby elements, test them regularly, and restore them to readiness promptly after use. If redundant components are allowed to fail silently, the fault-tolerant protection is lost and the system becomes vulnerable to a full outage. Condition monitoring and predictive maintenance tools are essential for keeping redundancy genuinely available.

Fault Tolerance: Definition

Name: Condition Monitoring System
Brand: Tractian
Rating: 4.7 (200 reviews)

Definition Fault tolerance is the property of a system that allows it to continue operating correctly, or at a reduced level of performance, even when one or more components have failed. Rather than preventing failures entirely, fault-tolerant design anticipates them and builds in the mechanisms needed to survive them: redundant components, automatic switchover, fail-safe states, and graceful degradation. It is a foundational concept in reliability engineering, safety system design, and industrial control architecture.

What Is Fault Tolerance?

Fault tolerance is a design property that allows a system to continue performing its required function after a fault occurs in one or more of its components. The system detects the fault, isolates it, and either compensates through redundancy or moves to a reduced but acceptable operating state.

The distinction is important: fault tolerance is not about preventing failures. It is about designing systems so that individual failures do not produce system-level failures. A single failed sensor, pump, power supply, or controller should not bring the whole process down.

In industrial contexts, fault tolerance is applied wherever the consequences of an unplanned outage are severe: continuous process plants, safety instrumented systems, power distribution networks, industrial automation, and data-critical control architectures.

The concept is formalized in standards including IEC 61508 (functional safety of electrical and programmable electronic systems), IEC 61511 (process safety instrumented systems), and ISO 25010 (systems and software quality), all of which include explicit requirements for fault tolerance at defined integrity levels.

Fault Tolerance vs Fault Avoidance vs Redundancy

These three terms are often confused or used interchangeably. They address related but distinct aspects of system design:

Concept	Goal	Mechanism	Relationship
Fault avoidance	Prevent faults from occurring	High-quality components, robust design, strict tolerances, thorough testing	Reduces fault frequency
Fault tolerance	Continue operating despite faults	Redundancy, fail-safe states, graceful degradation, automatic switchover	Reduces fault consequence
Redundancy	Provide backup capacity	Duplicate components operating in parallel or in standby	One mechanism within fault tolerance

Fault avoidance and fault tolerance are complementary strategies, not alternatives. High-quality components reduce the rate of faults. Fault-tolerant architecture limits the impact of faults that still occur. Together they produce systems with high inherent reliability and high operational availability.

Types of Fault Tolerance

Active Redundancy

In active redundancy, two or more identical components operate simultaneously and share the load. If one fails, the others absorb its share of the workload without any interruption to system function.

A dual-feed power supply arrangement where both feeds are always live is a classic example. A processing system with multiple CPUs all running the same computation in parallel is another. Active redundancy provides seamless fault tolerance because there is no switchover delay: the healthy component is already operational at the moment of failure.

The trade-off is cost and complexity. All active components carry load continuously, so all are subject to wear, and all require maintenance. The system must also have logic to detect when one component has failed silently so the fault is not propagated.

Passive Redundancy (Standby Redundancy)

In passive redundancy, a standby component is held in reserve and switched in only when the primary component fails. The standby may be hot (powered and ready to take over immediately), warm (partially initialised with a brief activation delay), or cold (completely offline until needed).

A standby pump that starts automatically when the duty pump fails is the most common industrial example. Standby power generators that start on mains failure are another. Passive redundancy is less expensive than active redundancy because the standby components are not continuously loaded, but it introduces a short interruption at switchover and requires regular testing to confirm that the standby is functional.

Fail-Safe Design

Fail-safe design ensures that when a component or system fails, it moves to a predetermined safe state rather than continuing to operate in an undefined or dangerous condition. The safe state is chosen to minimize risk, even if it means the system is no longer productive.

Industrial examples include normally closed (NC) solenoid valves that close on power or signal loss, emergency shutdown systems that de-energize to trip, and pressure relief valves that open on overpressure. In each case, the failure state is safe by design.

Fail-safe is particularly important in safety instrumented systems (SIS), where the consequence of an incorrect output is more serious than a spurious trip. IEC 61508 distinguishes between fail-safe (known safe state) and fail-secure (the system remains in its last known good state), depending on the application context.

Graceful Degradation

Graceful degradation allows a system to continue operating at reduced capacity or capability when a fault occurs, rather than failing entirely. The system sacrifices some performance to preserve core function.

An industrial control system that loses one of three operator workstations but continues to run on the remaining two is degrading gracefully. A distributed sensor network that continues to monitor most assets after one sensor node goes offline is another example. The system is impaired but not stopped, and operators are informed of the degraded state so they can manage it.

Graceful degradation is closely related to the concept of the failure mode: the design must define both the failure behaviour and the acceptable reduced-capability state so that operators know what to expect and how to respond.

Fault Tolerance in Industrial Systems

Process Control Systems

Distributed control systems (DCS) and programmable logic controllers (PLC) in continuous process industries are designed with multiple layers of fault tolerance. Redundant controllers, redundant I/O modules, redundant communication buses, and redundant power supplies mean that the failure of any single element does not interrupt process control.

Voting architectures, such as 2-out-of-3 (2oo3) sensor voting, are common in safety-critical loops. Three identical sensors measure the same process variable. The control system acts on the majority vote, so a single sensor failure or a false reading does not trigger an incorrect action. The failed sensor is flagged for replacement while the process continues under the control of the remaining two.

Power Supply Systems

Industrial facilities use layered fault-tolerant power architectures. Dual utility feeds from separate substations provide the first level. Uninterruptible power supply (UPS) systems with battery or flywheel backup maintain power during the switchover interval. Standby generators provide longer-duration backup if the mains supply is lost for extended periods.

For critical loads, N+1 and 2N UPS configurations are standard. In an N+1 configuration, one more UPS module is installed than is needed to carry the full load, so any single module can fail without loss of capacity. In a 2N configuration, two completely independent UPS systems each rated to carry the full load operate in parallel, providing the highest level of protection.

Safety Instrumented Systems

Safety instrumented systems (SIS) are purpose-designed fault-tolerant architectures that bring a process to a safe state when a hazardous condition is detected. They are independent of the basic process control system. Their design is governed by IEC 61511, which specifies fault tolerance requirements in terms of Safety Integrity Levels (SIL).

A SIL 2 safety function, for example, requires a hardware fault tolerance of at least one: the system must continue to perform its safety function with any single hardware fault present. This typically means redundant sensors, redundant logic solvers, and redundant final elements. Reliability-centered maintenance (RCM) principles are used to derive the test intervals that maintain the required SIL over the system's operating life.

Communication and Network Infrastructure

Industrial networks use ring topologies, dual-path routing, and media redundancy protocols (MRP, RSTP) to ensure that communication continues if a cable, switch, or network segment fails. Fault detection and automatic reconfiguration happen within milliseconds, which is fast enough that process control is unaffected.

How Fault Tolerance Affects Maintenance Strategy

Fault-tolerant design does not reduce maintenance requirements. It changes them.

When a system is designed with redundancy, the failure of the primary component is tolerated, but the redundant protection is now consumed. The system is still running, but it is running without its safety net. If the redundant component then fails before the primary is restored, the system fails entirely.

This means maintenance teams must treat the loss of a redundant component as an urgent event, even though process continuity has not been disrupted. The window between the failure of the primary and the restoration of the standby is a period of hidden vulnerability.

Three maintenance priorities emerge from this:

1. Monitor redundant components actively. Standby elements that are not being used can fail silently. Condition-based maintenance techniques applied to standby components ensure that their readiness is known, not assumed. Vibration analysis, thermal monitoring, and electrical testing on standby pumps, motors, and generators keep the standby genuinely available.

2. Test protective and standby systems at defined intervals. Passive redundancy and fail-safe systems can fail in a dormant state that is only discovered when the system is demanded. Scheduled functional tests at intervals derived from the system's failure rate and the acceptable unavailability target are required to confirm that protection is in place. This is the logic behind failure finding intervals in RCM programs.

3. Restore redundancy quickly after any fault. Once a fault has consumed the redundant protection, the mean time to recovery (MTTR) becomes critical. A fast restoration, whether through spare parts availability, pre-defined repair procedures, or rapid escalation paths, limits the window of vulnerability. Predictive maintenance tools that detect degradation before failure allow teams to plan restoration work before redundancy is consumed rather than after.

Measuring Fault Tolerance

Fault tolerance is quantified through a combination of reliability metrics and design specifications:

Metric	What It Measures	Relevance to Fault Tolerance
System availability	Proportion of scheduled time the system is operational	Primary measure of whether fault tolerance is working: a well-designed fault-tolerant system maintains high asset availability even with component failures
MTBF (system level)	Average operating time between system-level failures	Mean time between failures at the system level should be much longer than MTBF of any individual component, demonstrating that redundancy is absorbing component failures
MTTR	Average time to restore a failed component to service	Mean time to recovery governs how long the system operates without its redundant protection after a fault; a low MTTR limits vulnerability
Fault tolerance level	Number of simultaneous component failures the system can tolerate without losing function	Expressed as N+1 (one spare), 2N (full duplication), or fault tolerance level 1/2/3 per IEC 61508; defines the design requirement
PFD (probability of failure on demand)	Probability that a protective system fails to operate when a demand occurs	Used for safety instrumented systems; a lower PFD means higher fault tolerance for safety functions; defined per SIL requirements
RAM analysis	Reliability, availability, and maintainability modelling of a system	RAM analysis uses reliability block diagrams or fault tree models to predict availability and quantify the contribution of redundancy to fault tolerance

Fault Tolerance and Condition Monitoring

A fault-tolerant design is only as good as the readiness of its redundant components. Condition monitoring is the operational discipline that keeps redundancy genuinely available rather than theoretically available.

Continuous monitoring of critical and standby assets provides the early warning needed to address developing faults before they consume the protective redundancy. If a duty pump is running with a bearing that is beginning to fail, predictive detection via vibration or temperature sensors allows the team to plan a controlled switchover to the standby and repair the duty pump during a scheduled window. The standby absorbs the fault without any production impact, and the duty pump is restored before the standby's readiness degrades.

This is the operational expression of fault tolerance: the design creates the architecture, and condition monitoring keeps that architecture performing as intended over years of operation.

Failure mode analysis tools, including FMEA, are used at the design stage to identify which components require redundancy and what their expected failure rate is. Condition monitoring at the operational stage validates those assumptions with real-world data and surfaces deterioration that the design analysis could not predict in advance.

Fault Tolerance in Maintenance Strategy Design

When building a maintenance strategy for a fault-tolerant system, risk-based maintenance principles apply. The strategy must account for two distinct failure consequences:

Component failure within redundant capacity. If the system can absorb the failure without losing function, the maintenance response can be planned and scheduled. The priority is to restore the redundant element before the next fault occurs, and the urgency is driven by the likelihood and consequence of that next fault.

Component failure that exceeds redundant capacity. If all redundant elements have failed, or if a fault bypasses the redundant protection, the system is down. At this point the response is emergency corrective maintenance, with all the cost and disruption that entails.

A well-designed maintenance strategy for a fault-tolerant system keeps the first scenario common and the second scenario rare. It does this through a combination of preventive maintenance on redundant components, regular functional testing of standby and protective systems, fast response times when faults are detected, and condition monitoring that provides early warning before failures occur.

Frequently Asked Questions

Is a fault-tolerant system the same as a reliable system?

Not exactly. A reliable system has a low probability of failing under stated conditions. A fault-tolerant system is designed to keep functioning even when individual components fail. High reliability reduces fault frequency; fault tolerance reduces fault consequence. The most dependable industrial systems combine both: high-quality, well-maintained components (reliability) within an architecture that can absorb failures when they do occur (fault tolerance).

What does N+1 mean in fault-tolerant design?

N+1 means that one additional (redundant) unit is provided beyond the number needed to carry the full load. If N units are required for normal operation, N+1 units are installed so the system can continue to operate if any one fails. The "+1" is the spare capacity. Higher redundancy levels, such as 2N (full duplication) or N+2, provide greater protection at greater cost.

Can software be fault-tolerant?

Yes. Software fault tolerance is achieved through techniques including error detection and correction, exception handling, watchdog timers that detect and restart hung processes, redundant software modules running in parallel, and checkpoint-and-restart mechanisms that allow a process to resume from a known-good state after a crash. In industrial control systems, redundant controllers running identical software with voter logic to detect discrepancies between outputs are a common hardware-software fault tolerance combination.

How does fault tolerance relate to safety integrity levels (SIL)?

SIL requirements directly specify fault tolerance. IEC 61508 defines hardware fault tolerance (HFT) as the number of faults a system must be able to tolerate while still performing its safety function. A SIL 1 safety function typically requires HFT of 0 (the system must perform with any zero faults present). SIL 2 requires HFT of 1. SIL 3 requires HFT of 2. These requirements drive the redundancy architecture and test interval of safety instrumented systems.

Does fault tolerance eliminate the need for maintenance?

No. Fault tolerance changes the nature and urgency of maintenance tasks but does not eliminate them. Redundant components still age, degrade, and fail. Standby systems can fail silently if not regularly tested. If maintenance allows redundant elements to fall into a failed state undetected, the fault-tolerant protection disappears and the system is exposed to the very outages it was designed to prevent. Fault tolerance and active maintenance are not alternatives; they work together.

The Bottom Line

Fault tolerance is engineered resilience. It acknowledges that failures will occur despite maintenance efforts and builds the system's ability to absorb those failures without losing function. For critical infrastructure, production systems, and safety-relevant equipment, fault tolerance is not an alternative to good maintenance — it is the risk control layer that bridges the gap between maintenance intervals.

The maintenance implication of fault tolerance is important: redundant and standby systems must themselves be maintained to preserve their protective function. A redundant pump that has been quietly failing for months provides no protection when the primary pump fails. Regular testing and maintenance of fault-tolerant components is what keeps the protection they provide real rather than theoretical.

Keep Your Fault-Tolerant Architecture Genuinely Protected

Fault-tolerant design creates the architecture. Continuous condition monitoring keeps it working. Tractian's condition monitoring platform tracks the health of both duty and standby assets in real time, detects degradation before it consumes your redundant protection, and gives reliability teams the data they need to restore readiness before the next fault occurs.

See Condition Monitoring