A NonStop Kernel Joel Bartlett
One-line summary: This paper presents a fault-tolerant, expandable and
distributed computer system designed for online transaction processing.
- System Motivation and Design:
- Attempt to build a fault-tolerant system for general purpose computing.
Traditionally, such systems were custom-designed for tasks, such as
telephone switching, with high failure costs. The primary function is
generally to move data between discs and terminals, with little actual data
- Redundant hardware (processors, interconnect, power, i/o control, disk
mirroring), meant to provide continuous operation in the presence of a
single fault. System designed to detect, diagnose, and repair/reintegrate
- Three classes of physical failure
- Permanent hardware failure. Requires recovery algorithms and may
suffer from contamination occurring before detection.
- Intermittent component failure. Much more likely to corrupt data.
- External interference. Likely to crash entire system.
- Fault tolerance in the OS.
- All processors contain a monitor and memory management process.
- All information flow is carried by messages rather than shared
storage, to provide location independence. Positive acknowledgment is used
for fault tolerance and detection. Both localized and end-to-end checking
- The OS controls the message system, providing protection, information
hiding, and control of error recovery for message failures.
- Server processes are given a guaranteed minimum of resources, to
prevent deadlock. All processors use checksums and send "are you alive?"
messages to each other on a time basis to check for failure.
- Server and I/O processes have checkpointed backup processes that
kick-in when the primary process fails.
Not covered quantitatively at all. It might have been interesting to see
how such a system compared in cost and performance to an un-fault-tolerant
system (especially w/ 2Mb memory per processor in 1974!).
Qualitatively, the messages system causes a performance penalty, in
exchange for fault tolerance, ease of system expansion, and possibility for
system evolution. A major problem was designing non-monolithic software that
will run optimally on such a machine.
ConclusionsThis system must have been among the first of its kind,
which makes it remarkable for its time, although it seems very expensive and
somewhat simple/obvious today.
The modern functional equivalent might be a NOW.