A NonStop Kernel

Joel Bartlett

One-line summary: This paper presents a fault-tolerant, expandable and distributed computer system designed for online transaction processing.

Overview/Main Points

System Motivation and Design:
- Attempt to build a fault-tolerant system for general purpose computing. Traditionally, such systems were custom-designed for tasks, such as telephone switching, with high failure costs. The primary function is generally to move data between discs and terminals, with little actual data processing.
- Redundant hardware (processors, interconnect, power, i/o control, disk mirroring), meant to provide continuous operation in the presence of a single fault. System designed to detect, diagnose, and repair/reintegrate hardware failures.
- Three classes of physical failure
  - Permanent hardware failure. Requires recovery algorithms and may suffer from contamination occurring before detection.
  - Intermittent component failure. Much more likely to corrupt data.
  - External interference. Likely to crash entire system.
- Fault tolerance in the OS.
  - All processors contain a monitor and memory management process.
  - All information flow is carried by messages rather than shared storage, to provide location independence. Positive acknowledgment is used for fault tolerance and detection. Both localized and end-to-end checking are used.
  - The OS controls the message system, providing protection, information hiding, and control of error recovery for message failures.
  - Server processes are given a guaranteed minimum of resources, to prevent deadlock. All processors use checksums and send "are you alive?" messages to each other on a time basis to check for failure.
  - Server and I/O processes have checkpointed backup processes that kick-in when the primary process fails.
Performance:
Not covered quantitatively at all. It might have been interesting to see how such a system compared in cost and performance to an un-fault-tolerant system (especially w/ 2Mb memory per processor in 1974!).
Qualitatively, the messages system causes a performance penalty, in exchange for fault tolerance, ease of system expansion, and possibility for system evolution. A major problem was designing non-monolithic software that will run optimally on such a machine.

Conclusions

This system must have been among the first of its kind, which makes it remarkable for its time, although it seems very expensive and somewhat simple/obvious today.

The modern functional equivalent might be a NOW.

Back to index