Recovery Techniques for Database Systems

Joost Verhofstad

One-line summary: This paper presents seven techniques commonly used for recovery in database systems.

Overview/Main Points

Definitions:
- Failure: An event at which the system does not perform according to specifications. There are three kinds of failures:
  1. failure of a program or transaction
  2. failure of the total system
  3. hardware failure
- Recovery Data: Data required by the recovery system for the recovery of the primary data. In very high reliability systems, this data might also need to be covered by a recovery mechanism... Data recovery data is divided into two categories : 1) data required to keep current values, and 2) data to make the restoration of previous values possible.
- Transaction: The base unit of locking and recovery (for undo, redo, or completion), appears atomic to the user.
- Database:A collection of related storage objects together with controlled redundancy that serves one or more applications. Data is stored in a way that is independent of programs using it, with a single approach used to add, modify, or retrieve data.
- Correct State:Information in the database consists of the most recent copies of data put in the database by users and contains no data deleted by users.
- Valid State:The database contains part of the information of the correct state. There is no spurious data, although pieces may be missing.
- Consistent State:In a valid state, with the information contained satisfying user consistency constraints. Varies depending on the database and users.
- Crash:A failure of a system that is covered by a recovery technique.
- Catastrophe:A failure of a system not covered by a recovery technique.
Possible Levels of Recovery:
1. Recovery to the correct state.
2. Recovery to a checkpointed (past) correct state.
3. Recovery to a possible previous state.
4. Recovery to a valid state.
5. Recovery to a consistent state.
6. Crash resistance (prevention).
The bigger the damage, the cruder the recovery technique used.
Recovery Techniques:
1. Salvation program: Run after a crash to attempt to restore the system to a valid state. No recovery data used. Used when all other techniques fail or were not used. Good for cases where buffers were lost in a crash and one wants to reconstruct what was lost...(4,5)
2. Incremental dumping: Modified files copied to archive after job completed or at intervals. (3,4)
3. Audit trail: Sequences of actions on files are recorded. Optimal for "backing out" of transactions. (Ideal if trail is written out before changes). (1,2,3)
4. Differential files: Separate file is maintained to keep track of changes, periodically merged with the main file. (2,3)
5. Backup/current version: Present files form the current version of the database. Files containing previous values form a consistent backup version. (2,3)
6. Multiple copies: Multiple active copies of each file are maintained during normal operation of the database. In cases of failure, comparison between the versions can be used to find a consistent version. (6)
7. Careful replacement: Nothing is updated in place, with the original only being deleted after operation is complete. (2,6)
(Parens and numbers are used to indicate which levels from above are supported by each technique).
Combinations of two techniques can be used to offer similar protection against different kinds of failures. The techniques above, when implemented, force changes to:
- The way data is structured (4,5,6).
- The way data is updated and manipulated (7).
- nothing (available as utilities) (1,2,3).
Examples and bits of wisdom:
- Original Multics system : all disk files updated or created by the user are copied when the user signs off. All newly created of modified files not previously dumped are copied to tapes once per hour. High reliability, but very high overhead. Changed to a system using a mix of incremental dumping, full checkpointing, and salvage programs.
- Several other systems maintain backup copies of data through the paging system (keep backups in the swap space).
- Use of buffers is dangerous for consistency.
- Intention lists: specify audit trail before it actually occurs.
- Recovery among interacting processes is hard. You can either prevent the interaction or synchronize with respect to recovery.
- Error detection is difficult, and can be costly.

Relevance

Recovery from failure is a critical factor in databases. In case of disaster, it is very important that as much as possible (if not everything) is recovered. This paper surveys the methods that we in use at the time for data recovery.

Flaws

This paper contained excess verbosity. This paper could and should have been shorter and more concise than it was. The examples especially could have been clearer and less involved. It might have been more valuable to give a single complete view of several systems than to detail the migration of (now obsolete) systems over time. The terminology and categories presented at the beginning were useful and potentially timeless, which the example were not.

Back to index