Fault Tolerance: Definition, Testing & Importance
Fault tolerance refers to a system's ability to operate when components fail.
Even the most well-designed system fails from time to time. Viruses strike. Servers overheat. Computer components wear out. Fault tolerance allows for smooth operation despite these failures.
Losing even a moment or two of connectivity can be catastrophic. Just ask Disney+. When the organisation's servers delivered glitchy performance in February 2021, users got mad. Instead of watching WandaVision, they wrote nasty tweets.
Fault tolerance plans may not keep your entire organisation running smoothly all the time. But your work could prevent a worst-case scenario from happening.
What is fault tolerance?
When a computer, server, network, or another IT component keeps operating even when a component fails, fault tolerance is responsible.
Create a fault-tolerant design to:
- Stay operational. Make sure your system doesn't go down altogether when something breaks.
- Reduce risks. Bar disruptions stemming from one critical piece of hardware or software. Overlap functions, so you can share the load in a crisis.
- Buy time. Fixing any kind of IT problem requires investigation and savvy. Fault tolerance ensures people can keep working while you hunt down the source.
Imagine that you run servers in Washington, D.C., and you just opened a portal for vaccine registration. Users flood you with responses, and your servers crash. Reporters take notice and write about your mistake all over the United States.
Now imagine that you've built a fault-tolerant system. When the influx overloads one server, another takes over, and users never know that anything went wrong.
The fault-tolerance concept isn't new. IT professionals have used it since the 1950s to describe systems that must stay online, no matter what.
But early fault-tolerance plans involved alerts. A system notified staff when something was about to fail, and they had to step in and do something immediately. Modern plans involve backups and redundancies, so the team can work while the system stays online.
People sometimes confuse fault tolerance with high availability. A company's high-availability score refers to how often the system stays up when compared to overall run times. To maintain high availability, a system switches to another system when something fails. The backup often provides reduced capacity and a poor experience. The company stays online, but work can slow.
In a true fault-tolerant system, redundant hardware does exactly the same job when the original system is offline.
How does fault tolerance work?
How can you keep something up and running even while parts and pieces of it are breaking? Answer this question with a comprehensive fault-tolerance plan.
At its core, your program should:
- Eliminate. Don’t allow a single point of failure. The system operates without stopping, even if you must make repairs.
- Isolate. You should remove the defective piece from system operation rather than letting it cause a cascade of problems.
- Engage. When you complete the repair, the part should come back online with no noticeable disruption.
Your fault-tolerance plan might include:
- Hardware. Build in backups so one can take over when another breaks. Run them in parallel, so they're always online and ready to go.
- Software. Multiple instances can take over for one another if one fails.
- Power. Your IT system always has current, even if your power company experiences a catastrophe.
There are multiple fault-tolerance techniques, including:
- Replication. Everything breaks in time. For example, most computers last about eight years, even with appropriate maintenance. Duplicating hard