Recovery Techniques for Database Systems Joost Verhofstad
One-line summary: This paper presents seven techniques commonly used
for recovery in database systems.
- Failure: An event at which the system does not perform according
to specifications. There are three kinds of failures:
- failure of a program or transaction
- failure of the total system
- hardware failure
- Recovery Data: Data required by the recovery system for the
recovery of the primary data. In very high reliability systems, this data
might also need to be covered by a recovery mechanism... Data recovery data
is divided into two categories : 1) data required to keep current values,
and 2) data to make the restoration of previous values possible.
- Transaction: The base unit of locking and recovery (for undo,
redo, or completion), appears atomic to the user.
- Database:A collection of related storage objects together with
controlled redundancy that serves one or more applications. Data is stored
in a way that is independent of programs using it, with a single approach
used to add, modify, or retrieve data.
- Correct State:Information in the database consists of the most
recent copies of data put in the database by users and contains no data
deleted by users.
- Valid State:The database contains part of the information of the
correct state. There is no spurious data, although pieces may be missing.
- Consistent State:In a valid state, with the information contained
satisfying user consistency constraints. Varies depending on the database
- Crash:A failure of a system that is covered by a recovery
- Catastrophe:A failure of a system not covered by a recovery
- Possible Levels of Recovery:
- Recovery to the correct state.
- Recovery to a checkpointed (past) correct state.
- Recovery to a possible previous state.
- Recovery to a valid state.
- Recovery to a consistent state.
- Crash resistance (prevention).
The bigger the damage, the cruder the recovery technique used.
- Recovery Techniques:
- Salvation program: Run after a crash to attempt to restore the
system to a valid state. No recovery data used. Used when all other
techniques fail or were not used. Good for cases where buffers were lost in
a crash and one wants to reconstruct what was lost...(4,5)
- Incremental dumping: Modified files copied to archive after job
completed or at intervals. (3,4)
- Audit trail: Sequences of actions on files are recorded. Optimal
for "backing out" of transactions. (Ideal if trail is written out before
- Differential files: Separate file is maintained to keep track of
changes, periodically merged with the main file. (2,3)
- Backup/current version: Present files form the current version of
the database. Files containing previous values form a consistent backup
- Multiple copies: Multiple active copies of each file are
maintained during normal operation of the database. In cases of failure,
comparison between the versions can be used to find a consistent version.
- Careful replacement: Nothing is updated in place, with the
original only being deleted after operation is complete. (2,6)
(Parens and numbers are used to indicate which levels from above are
supported by each technique).
Combinations of two techniques can be used to offer similar protection
against different kinds of failures. The techniques above, when implemented,
force changes to:
- The way data is structured (4,5,6).
- The way data is updated and manipulated (7).
- nothing (available as utilities) (1,2,3).
- Examples and bits of wisdom:
- Original Multics system : all disk files updated or created by the user
are copied when the user signs off. All newly created of modified files not
previously dumped are copied to tapes once per hour. High reliability, but
very high overhead. Changed to a system using a mix of incremental dumping,
full checkpointing, and salvage programs.
- Several other systems maintain backup copies of data through the paging
system (keep backups in the swap space).
- Use of buffers is dangerous for consistency.
- Intention lists: specify audit trail before it actually occurs.
- Recovery among interacting processes is hard. You can either prevent the
interaction or synchronize with respect to recovery.
- Error detection is difficult, and can be costly.
RelevanceRecovery from failure is a critical factor in databases. In
case of disaster, it is very important that as much as possible (if not
everything) is recovered. This paper surveys the methods that we in use at the
time for data recovery.
FlawsThis paper contained excess verbosity. This paper could and should
have been shorter and more concise than it was. The examples especially could
have been clearer and less involved. It might have been more valuable to give a
single complete view of several systems than to detail the migration of (now
obsolete) systems over time. The terminology and categories presented at the
beginning were useful and potentially timeless, which the example were not.