Recovery Management in QuickSilver

Haskin, Malachi, Sawdon and Chan

One line summary:

Use atomic transaction as a mechanism to do failure recovery in the QuickSilver client-server structured distributed system.

Timeouts: clients set timeouts on their requests to servers. Problem: cannot distinguish slow from crash, which can lead to inconsistencies in the system.
Connectionless protocals: servers are stateless, connectionless and idempotent. Problems: some action cannot be made idempotent; quiting in the middle of a request and retrying can also lead to inconsistencies.
Virtual circuits: failures are detected by the communication system employing connection-oriented protocols. Problem: cannot achieve multiserver atomicity because virtual circuits can fail independently.
Replication: eg. Nonstop Kernel. Problem: too expensive; need a transaction system underneath to provide recovery.
Transactions: Basic Idea: Everything belongs to some transaction, and transactions are designated by globally unique transaction ids. A transaction has an owner process and multiple other participant processes. The owner may commit or abort the transaction, but the participant can only abort.

One phase: used by servers that maintain only volatile state. Server sends an end request to each one-phase participant. Volatile server: does not maintain permanent storage.
Two phase: used by servers that maintain recoverable state. Participants need to vote on the commit:
- vote-abort: participant undo its action, and the second phase is used to announce abort to everybody else.
- vote-commit-read-only: which means participant has not modified any recoverable resources, and requests not be included in phase two of the commit.
- vote-commit-volatile: same as vote-commit-read-only, but wants to be notified of the results.
- vote-commit-recoverable: participant has modified recoverable state, so needs to be informed of the results of phase two.
Need rules to handle special cases like: commit before participate; cycles in transaction graph; new requests after becoming prepared; reappearance of a forgotten transaction, etc.
The coordinator is at the transaction birth-site, which usually means user workstations that a likely to fail. To ensure reliability, can either migrate and/or replicate the coordinator.

A lot of detailed numbers. What is the big picture? Should be applicable in real systems because QuickSilver is used in IBM as their production system.