Failure Finding Interval (FFI): Definition
Key Takeaways
- FFI applies only to hidden functions: equipment that is dormant and whose failure would not be noticed until a protective demand occurs
- The FFI formula is: FFI = 2 x MTBF x P(unavailability), where MTBF is the mean time between functional failures and P(unavailability) is the maximum acceptable probability of finding the device failed
- FFI is not the same as a standard preventive maintenance interval: it is driven by risk tolerance, not by a wear or degradation mechanism
- Common FFI assets include fire suppression systems, emergency shutdown valves, standby generators, alarms, protective relays, and safety instrumented systems
- FFI should be revisited as actual failure data accumulates: the initial interval is always an estimate that improves over time
What Is a Failure Finding Interval (FFI)?
A failure finding interval is a scheduled inspection frequency designed for one specific class of asset: equipment that performs a hidden function. Hidden functions are those that will not be demanded, and therefore whose failure will not be noticed, under normal operating conditions. The only way to detect whether such equipment is in a failed state is to deliberately test it.
Examples include a fire suppression deluge system, an emergency shutdown valve, a standby diesel generator, and a high-pressure relief valve. These assets sit dormant for extended periods. If one fails while dormant, no one knows until a fire, process excursion, or power outage actually occurs, at which point the failure has catastrophic consequences.
The FFI answers a precise question: how often must we test this device to keep the probability that it is currently failed below an acceptable threshold?
The concept originates from RCM methodology, formalized in documents such as SAE JA1011 and popularized by John Moubray's RCM II. It is now standard practice in oil and gas, power generation, chemical processing, aviation, and any industry that relies heavily on protective layers.
Why Hidden Failures Need Their Own Maintenance Task Category
Most maintenance tasks address evident failures: degradation that produces noise, heat, vibration, or performance loss that operators or sensors will notice. For these failures, the task frequency is set by the rate at which the asset deteriorates, as governed by the P-F curve.
Hidden failures follow completely different logic. A functional failure of a protective device leaves the system looking entirely normal. No alarm sounds. No performance metric changes. No operator notices anything unusual.
The hazard is not the hidden failure itself. It is the combination of the hidden failure with a second, separate event: the demand on the protective function. This is called a multiple failure. A pump seal may fail with the fire deluge system simultaneously out of service. Neither event alone causes a catastrophe. Together, they can.
Because the failure mode is different, the maintenance logic is different. The task is not to prevent the failure: protective equipment often fails at random, with no age-related pattern that a time-based task could intercept. The task is to find the failure before the demand occurs, by testing the device at a frequency that keeps the probability of an undetected failure acceptably low.
The FFI Formula
The standard RCM formula for deriving an FFI is:
FFI = 2 × MTBF × P(unavailability)
Where:
- MTBF is the mean time between failures of the protective device. This is the average interval between hidden functional failures: how often, on average, does this device fail silently?
- P(unavailability) is the maximum acceptable probability that the device is currently in a failed state at any given moment. This is expressed as a decimal (for example, 0.05 for 5%).
Worked Example
A facility has a gas detection system. Based on manufacturer data and industry records, the system has a mean time between hidden failures of 8,000 hours. The site safety case requires a maximum unavailability of 5% (0.05) for this protective function.
FFI = 2 × 8,000 × 0.05
FFI = 800 hours
The gas detection system must be functionally tested at least every 800 hours to keep the probability of it being in a failed state below 5%.
If the MTBF or the acceptable unavailability changes, for example if the risk assessment tightens the target from 5% to 2%, the FFI must be recalculated.
Where the Formula Comes From
The derivation assumes that hidden failures occur at a constant, random failure rate (exponential distribution). Under this assumption, the probability that the device has failed at any moment between two successive tests rises linearly from zero immediately after the last test to a maximum just before the next one. The average probability over the full interval is half the probability at the end of the interval, which is why the factor of 2 appears in the denominator, or equivalently, why 2 appears in the numerator in the standard form.
This is an approximation appropriate for early planning. More precise calculations using actual failure distributions (Weibull, for example) can be applied when sufficient failure data is available.
FFI vs Preventive Maintenance Interval: Key Differences
| Attribute | FFI (Failure Finding Task) | Standard PM Interval |
|---|---|---|
| Applies to | Hidden functions: standby and protective equipment | Active, in-service equipment with evident failure modes |
| Failure visibility | Failure is not apparent during normal operation | Failure produces an immediate, observable symptom |
| Interval basis | Desired unavailability probability and MTBF | Degradation rate, P-F interval, or manufacturer recommendation |
| Task objective | Detect a failure that has already occurred but is undetected | Prevent or reduce the likelihood of the next failure |
| Task type | Functional test | Inspection, lubrication, calibration, component replacement |
| Source of interval | Risk/safety analysis and statistical formula | OEM data, engineering analysis, historical records |
| Failure pattern assumed | Random (no age relationship) | Age-related (wear-out or fatigue pattern) |
Understanding this distinction is important when building a maintenance interval library. FFI tasks should not be treated as ordinary time-based PMs: their logic, documentation, and scheduling rationale are fundamentally different.
How FFI Fits Into RCM
In a formal RCM analysis, every function of every asset is assessed through a structured logic tree. Each function is first classified as either evident or hidden. For evident functions, the maintenance task options include condition-based, time-based, or redesign responses. For hidden functions, the first question is always: can a failure finding task be identified that will reduce the multiple failure risk to an acceptable level?
The FMEA component of the RCM study identifies each hidden failure mode and its effects. The FFI calculation then provides the test frequency needed to manage that risk. If no practical failure finding task can reduce the risk sufficiently, the RCM process escalates to redesign: adding redundancy, changing the system architecture, or modifying the operating context to eliminate the hidden failure mode.
FFI tasks are documented in the maintenance plan with a specific task description (what exactly constitutes a functional test), the required frequency, the acceptance criteria (what result confirms the device is functional), and the restoration action if the device is found failed.
Applying FFI in Practice
Common Asset Classes That Require FFI Tasks
Any system whose sole purpose is to respond to an abnormal demand condition is a candidate for FFI management. The most common examples are:
- Fire and gas detection systems. Smoke detectors, heat detectors, combustible gas detectors, and flame detectors are dormant until a fire or gas release occurs. Their hidden failure rate and the consequences of unavailability during a fire drive the FFI calculation.
- Fire suppression systems. Sprinkler systems, deluge systems, and gaseous suppression systems must be periodically actuated or inspected to confirm they will operate correctly under demand.
- Emergency shutdown systems (ESD/ESDV). These valves and logic systems are designed to close on a process excursion. They may be dormant for months or years. Spurious trip rates must be balanced against the risk of failing to close when demanded, which determines both the FFI and the acceptable failure probability.
- Standby equipment. Standby pumps, standby generators, and standby HVAC systems require regular run tests to verify that they will start and perform to specification when the primary system fails. The FFI governs how often these run tests must occur.
- Pressure relief valves. Relief valves that protect vessels from overpressure are a classic hidden function. They are tested by lifting to confirm the set pressure has not drifted and that the valve will open freely when required.
- Protective relays and circuit breakers. In electrical systems, protective relays detect fault conditions and command circuit breakers to open. If the relay or breaker has failed silently, a fault will not be interrupted. FFI testing involves injecting a test signal to verify relay pickup and breaker operation.
Setting the Acceptable Unavailability Target
The unavailability probability P(unavailability) used in the FFI formula is not arbitrary. It must be determined by a risk assessment that considers:
- The severity of the multiple failure consequence (safety, environmental, operational)
- Regulatory and industry standards that specify minimum integrity levels (for example, IEC 61511 Safety Integrity Levels for process safety systems)
- The frequency at which the primary failure, or the demand on the protective function, is expected to occur
- Whether other protective layers are in place that reduce the net risk
For safety-critical functions, unavailability targets are typically in the range of 1% to 10%, depending on the consequence severity and the SIL (Safety Integrity Level) assigned to the function. For less critical protective functions, higher unavailability may be acceptable. The conditional probability of failure framework used in risk-based maintenance programs provides a structured basis for these decisions.
What Happens When a Device Is Found Failed
A functional test that reveals a failed state is not a maintenance failure: it is the system working exactly as designed. The purpose of the FFI is to find hidden failures before a demand occurs. When a failure is found:
- The device is restored to a functional state immediately (repair or replacement).
- The failure is recorded with the date of the last successful test. This provides an upper bound on the time the device was unavailable.
- The failure event is added to the historical record for the device. As this record accumulates, the actual MTBF can be estimated and compared to the value used in the FFI calculation.
- If failures are being found at a rate that suggests the actual MTBF is significantly shorter than assumed, the FFI must be shortened to maintain the target unavailability.
This feedback loop of test, find, record, analyze, and adjust is what makes FFI management a living program rather than a static schedule. It is also what separates a mature risk-based maintenance program from one where intervals are set once and never reviewed.
Integrating FFI Into the Maintenance Schedule
FFI tasks are scheduled in the same maintenance management system as all other work orders. However, a few practical considerations are specific to failure finding tasks:
The task must be a genuine functional test. A visual inspection of a sprinkler head is not a functional test of the sprinkler system. The test must actually verify that the protective function will operate correctly under its required conditions. Partial tests or proxy measures that do not confirm full functionality do not satisfy the FFI requirement.
Access and safety during testing. Many functional tests involve temporarily defeating or bypassing the protective function in order to test it. This creates a window of unavailability. Good maintenance practice minimizes this window, documents it, and ensures that other protective layers are in place during the test period.
Record keeping. Regulatory audits and safety cases require evidence that FFI tasks have been carried out at the required frequency and that the results have been recorded. Work orders must capture the test procedure followed, the result (pass or fail), and any corrective action taken.
FFI and Modern Condition Monitoring
For some protective devices, continuous or periodic condition-based maintenance techniques can supplement or replace traditional functional testing. Self-diagnostic features in modern safety instrumented systems, for example, detect some failure modes continuously, which effectively reduces the detectable failure rate and may support a longer FFI without increasing the unavailability.
Online partial stroke testing of emergency shutdown valves allows a portion of the valve's travel to be tested during normal operation without fully closing the process. This tests some failure modes (mechanical binding, actuator fault) while avoiding the operational disruption of a full stroke test, and it can support higher test frequencies that would otherwise be operationally impractical.
Predictive maintenance technologies, including vibration analysis, electrical current signature analysis, and thermal imaging, can detect degradation in standby equipment that would not be caught by a binary pass/fail functional test. Integrating these signals alongside scheduled FFI tasks gives a fuller picture of protective system health.
The key principle is that any monitoring technique used to reduce or replace an FFI task must be demonstrably effective at detecting the specific failure modes that the FFI was designed to find. The logic is the same; only the technology changes.
Common Mistakes in FFI Management
Using a fixed schedule without a calculation. Many maintenance programs assign test intervals to protective equipment based on OEM recommendations, regulatory minimums, or habit rather than calculating from MTBF and unavailability targets. This may result in intervals that are far longer than the risk profile justifies.
Treating the FFI as a maximum rather than a target. The FFI defines the maximum allowable interval consistent with the target unavailability. Testing more frequently is always permissible and may be appropriate when operational access makes it convenient. Testing less frequently violates the safety or risk objective.
Not recording test results consistently. The value of the failure finding program depends entirely on the quality of the failure records. If failed devices are restored without being documented, the MTBF estimate is never corrected and the FFI remains based on assumptions rather than evidence.
Conflating the FFI task with the restoration task. The FFI task is the test. If the test reveals a failure, a separate corrective maintenance work order should be raised to restore the device. Mixing these two activities in the same work order makes it harder to track failure occurrences accurately.
Ignoring common cause failures. When multiple identical protective devices are installed in parallel (redundant safety loops, for example), a single cause can fail all of them simultaneously. FFI calculations that assume independent failure modes will underestimate the true unavailability of the system. Staggering the tests of redundant devices helps surface common cause failures that synchronized testing would miss.
Frequently Asked Questions
Is FFI the same as a proof test?
The terms are often used interchangeably in practice. In IEC 61511 and process safety literature, the term "proof test" is used for functional tests of safety instrumented system components. The FFI is the RCM-derived interval at which the proof test must be performed. The calculation method, objective, and documentation requirements are the same.
What if no MTBF data is available for the protective device?
When MTBF data is unavailable, engineers use generic industry databases (OREDA, EXIDA, IEEE 493), manufacturer reliability data, or conservative estimates from similar device classes. The initial FFI should err on the side of more frequent testing, with the interval extended as actual site data accumulates. A sensitivity analysis, recalculating the FFI across a range of MTBF assumptions, helps quantify the uncertainty and set a conservative starting point.
Does the FFI apply to redundant systems differently?
Yes. When redundant protective channels are installed (for example, a 2-out-of-3 voting configuration), the unavailability of the overall system is lower than the unavailability of any single channel. The system unavailability formula accounts for the redundancy configuration. This means that each individual channel may be tested less frequently than a single-channel system would require, while still maintaining the same overall system unavailability target. The calculation must be done at the system level, not the individual component level.
How does the FFI relate to asset availability?
The FFI is calculated to control the unavailability of the protective function, which is a specific type of asset availability concern. However, the testing itself introduces a brief planned unavailability period. For protective systems that must be taken offline to be tested, the time the system is out of service for testing must be factored into the overall availability calculation for the safety layer. This is one reason why modern online testing methods (partial stroke testing, self-diagnostics) are preferred for critical applications.
Is FFI used outside of RCM programs?
FFI calculations are used whenever maintenance intervals for protective or standby equipment need to be formally justified. Regulatory frameworks for process safety, nuclear power, aviation maintenance, and defence systems all require evidence that protective equipment test frequencies are grounded in a quantitative risk assessment. The FFI formula is the standard tool for providing that evidence, regardless of whether the overall maintenance program is formally RCM-structured.
The Bottom Line
The failure-finding interval is the quantitative foundation for testing standby and protective equipment at the right frequency. It replaces guesswork with a defensible, risk-grounded calculation that balances the cost of inspection against the probability of the hidden failure going undetected and contributing to a dangerous or production-impacting event.
In regulated industries such as oil and gas, nuclear power, and aviation, FFI calculations are not optional — they are audit requirements tied to formal safety cases. For maintenance teams in less strictly regulated environments, applying FFI methodology to hidden function tests improves program quality and provides the documentation needed to justify inspection intervals to engineers, managers, and regulators alike.
Detect Hidden Failures Before a Demand Occurs
FFI management requires accurate failure data, consistent test records, and the ability to act quickly when a protective device is found failed. Tractian's condition monitoring platform continuously tracks the health of critical and standby assets, generates work orders automatically at the required FFI interval, and logs every test result in one place, giving your reliability team the data needed to validate and refine your intervals over time.
See Condition MonitoringRelated terms
Forward Workload: Definition
Forward workload is the total planned maintenance work ready to schedule, expressed in weeks. Learn how to measure it, target ranges, and how to manage it.
FRACAS (Failure Reporting Analysis Corrective Action System)
FRACAS is the closed-loop system for capturing every failure, analysing its root cause, and tracking corrective actions to verified completion to improve reliability.
Hazard Analysis Control Point: Definition
A hazard analysis control point is any process step where a biological, chemical, or physical hazard can be controlled. Learn how CPs differ from CCPs in a HACCP plan.
HACCP: Definition
HACCP is the systematic food safety framework that identifies and controls biological, chemical, and physical hazards at critical control points before they reach consumers.
IIoT (Industrial Internet of Things): Definition
IIoT connects industrial sensors, devices, and systems to collect real-time data that improves equipment reliability, enables predictive maintenance, and drives Industry 4.0 transformation.