Partial Failures

Next: Asynchronous Communication Up: Distributed System Foundations Previous: No Global Clock

Partial Failures

At any one time, many elements of the distributed system may have failed. If the distributed system is designed correctly, these failures have little visibility to the customer of the system. This property is called high availability and is usually realized by replication of a service over multiple components and by duplication of information.

Many distributed network protocols, like UDP, have as an underlying assumption that there are failures and that packets are lost. By design, the protocol automatically recovers from many classes of failures. This recovery happens nearly transparent to the customer. From the customer's prospective functionality is not lost though response time might be slower.

If the distributed system gradually loses capabilities as more and more of the elements in the system fail, it is said to exhibit graceful degradation. Just because some of the services are not available does not mean that useful work can not be accomplished. A journal or log file might be created to batch the changes when eventually the service becomes available.

Eventually, a point is reached where so many elements have failed that a distributed system will dissolve to individual partitions. The re-grouping of the partitions is usually accomplished with the aid of voting algorithms after quorums have been established.

Many distributed systems application can tolerate one point failure of their underlying services. This happens when any one point in the system can fail, yet the application, as a whole, continues to correctly function. This is called one-point-failure safe. Of course, in a similar fashion, an application can be two-point-failure safe. In general, as the number of tolerated failures increase in an application so does the complexity and cost of the system to support such an application.

An application that never fails during extended periods of time due to hardware errors is fault tolerant. Fault tolerant hardware usually includes triple redundancy for every component with a vote and a compare unit to establish results and to detect potential faults. Each unit has three identical copies each running exactly the same software. The output results are given to a compare unit. If all three units have the same output, the results are considered correct. If two of the three units agree but the third differs, the third unit is considered at fault and the results of the common two are presented as the answer.

Next: Asynchronous Communication Up: Distributed System Foundations Previous: No Global Clock

Ronald LeRoi Burback
1998-12-16