Report Number: CSL-TR-99-776
Institution: Stanford University, Computer Systems Laboratory
Title: Novel Checkpointing Algorithm for Fault Tolerance on a
Tightly-Coupled Multiprocessor
Author: Sunada, Dwight
Author: Glasco, David
Author: Flynn, Michael
Date: January 1999
Abstract: The tightly-coupled multiprocessor (TCMP), where specialized
hardware maintains the image of a single shared memory,
offers the highest performance in a computer system. In order
to deploy a TCMP in the commercial world, the TCMP must be
fault tolerant. Researchers have designed various
checkpointing algorithms to implement fault tolerance in a
TCMP. To date, these algorithms fall into 2 principal
classes, where processors can be checkpoint dependent on each
other. We introduce a new apparatus and algorithm that
represents a 3rd class of checkpointing scheme. Our algorithm
is distributed recoverable shared memory with logs (DRSM-L)
and is the first of its kind for TCMPs. DRSM-L has the
desirable property that a processor can establish a
checkpoint or roll back to the last checkpoint in a manner
that is independent of any other processor. In this paper, we
describe DRSM-L, show the optimal value of its principal
design parameter, and present results indicating its
performance under simulation.
http://i.stanford.edu/pub/cstr/reports/csl/tr/99/776/CSL-TR-99-776.pdf