Fault Tolerance in Distributed System

Fault Tolerance in Distributed System

In this tutorial you are going to learn about Fault Tolerance in Distributed System.

Fault Tolerance:

Fault tolerance is the ability that enables a system to continue operating properly in the event of the failure of (or one or more fault within) some of its components. System as a whole continues working, despite faults (some maximum number of fault assumed).

Failure: System as a whole is not working.

Fault: Fault is nothing but when some part of the system is not working properly/correctly.

  • Node fault – crash, deviating from the algorithm
  • Network fault – dropping or significantly delaying messages.

Failure detectors: Algorithm that detects whether another node is faulty.

Perfect failure detector: Labels a node as faulty if and only if it has crashed.

Introduction to fault tolerance:

Being fault tolerant is strongly stated in what are called dependable systems.

Key features:

  1. Availability
  2. Reliability
  3. Safety
  4. Maintainability

System Attributes:

Availability – Availability is defined as the property that a system is available for the use.

  • A highly available system is one that will likely be working at a given instance of time even after the failure.
  • system always ready for use, or probability that system is ready or available at a given time.


Online shops want to sell stuff 24/7!

Service unavailability=downtime=losing money.

Availability=update=fraction of time that a service is functioning correct.

  • “Two nines”=99% up = down 3.7 days/year.
  • “Three nines” = 99.99% up = down 8.8 hours/day.
  • “Four times” = 99.99% up = down 53 minutes/year.
  • “Five nines” = 99.999% up = down 5.3 minutes/year.

Service level objective (SLO): 99.9% response will be getting 200 ms.

Service level agreement(SLA): Contract specifying some SLO, penalties for violation.

Reliability- Reliable refers to the property that a system can run continuously without failure.

  • A highly reliable system is one that will most likely continue work without interruption.
  • Example: If a system goes down on milliseconds every 1 hour then the system is 99.99% available but it is highly unreliable.
  • property that a system can run without failure, for a given time.


  • Safety refers to the situation that when a system temporarily fails to operate correctly nothing bad or dangerous happens.
  • Indicates the safety issues in the case the system fails.

Example: Sending people into space requires a high degree of safety . If such a control system temporarily fails for only a very brief moment the effect will be disastrous.

Maintainability – Maintainability refers to the ease of repair of the failed system.

  • Failure in a distributed system = when a service cannot be fully provided.
  • System failure may be a portal.
  • A single failure may affect other parts of a system (Failure escalation).

Omission failure – A server fails to respond.

Receive omission – A server fails to receive messages.

Send omission – A server fails to send messages.

Timing failure – A server’s responsibility lies outside the specified time interval.

Arbitrary failure – A server may produce arbitrary responses at arbitrary times.

This article on Fault Tolerance in Distributed System is contributed by Hemalatha P. If you like TheCode11 and would like to contribute, you can also write your article and mail to thecode11info@gmail.com

Previous Post Next Post

Contact Form