    A.  Databases and Their Objectives                            3


    B.  Introduction to Data Bases.                                4


       1.  Components of databases                                4

       2.  File management systems versus database management systems   6

       3.  Related Systems                                     7


    C.  Scientific Basis for Database Technology                8


       1.  The schema                                        8

       2.  The data model                               10

       3.  Types of data models                         11


    D.      Database Operation                                    1


       1.  Entering data into the database                        12

       2.  Data storage                                    14

       3.  Data organization for retrieval                     15

       4.  Data presentation                                    17

       5.  Database administration                            18


    Acknowledgements                                      18


    References                                            19







In this paper we will introduce the concepts of database technology in a

way that will make it easy to relate the issues of the technology to

problems in health care.  After the objectives of the database approach

have been defined, the major components of

databases and their function will be discussed. The remainder of this

paper presents the scientific and the operational issues associated

with database technology in healthcare.  The importance and growth of

these systems has been documented [Lindbe79].  Rather than providing a

survey of the field, this exposition is intended to link general concepts

to the practices observed by us [Henley75] and others [Palley75,




A.  Databases and Their Objectives


A database is a collection of related data, organized so that

useful information may be extracted.  The effectiveness of databases

derives from the fact that much of the information relevant to a variety

of organizational purposes may be obtained from one single, comprehensive

database.  In health care the same database may be used by medical

personnel for patient care recording, for surveillance of patient status,

and for treatment advice; it may be used by researchers in assessing the

effectiveness of drugs and clinical procedures; and it can be used by

administrative personnel in cost accounting and by management for the

planning of service facilities.


The fact that data are shared promotes consistency of information for

decision-making and reduces duplicate data collection. A major benefit

of databases in health care is due to the application of the information

to the management of services and the allocation of resources needed for

those services.  Communication through the shared information among

health care providers, and the validation of medical care hypotheses from

observations on patients are also a significant aspect of sharing data

and can be the primary objective in certain health care settings.


The contents and the description of a database has to be carefully managed

in order to provide for this wide range of services, so that some degree

of formal data management is implied when we speak of databases. The

formalization, and the large data quantity implied in effective database

operations make computerization of the database function essential; in fact,

much of the incentive for early [Bush45] and current computing technology

[Barsam79] is due to the demands made by information processing needs.


In order to process data we need data and processing tools.  The notion

of a database hence encompasses the stored data themselves, the hardware

used to store the data, and the software used to manipulate the data.  When

the database is used for multiple purposes we find also an administration

which controls and assigns the resources needed to maintain the data

collection and permit the generation of information.


We will now define the technical scope of databases, and begin with a more

precise definition of the terms.

B.        Introduction to Data Bases.


      Within the scope of databases are a number of concepts, that are

easily confused with each other.  The objective of a database is to provide

information, but not all systems that provide information are databases.

We will first define the term 'database', and then some terms that describe

aspects of database technology.  As we refine the descriptions more terms

will appear and be defined.  For further clarification we will in Section

B.3 present types of systems which are related or similar to databases, but

are not considerd databases within this review.


            A database is a collection of related data,

            with facilities that process these data to

            yield information.            


A database system facilitates the collection, organization, storage,

and processing of data.  The processing of data from many sources can

provide information that would not have been available before the data

were combined into a database.  Hence, a collection of data is not by

itself a database, a system that supports data storage is not neccessarily

a database system, and not all the information provided by computer

systems is produced from databases.


B.1  Components of Databases


We stated earlier that a database is composed of data, of programs or

software to enter and

manipulate the data, and of computers or hardware. Both data and software

are stored within the computers which support the database.

While multiple, interconneccted computers may be used to support a single

database, today such systems are few, and we will define the software

systems in terms of a single, central computer.  The users may of course be

remote, and use terminals connected by telephone lines to access the

database.  The internal organization of the stored data, the hardware, and

the software may not be obvious to the users.  For many users that innocence

may be desirable, but when decisions about databases have to be made

a good understanding of their components and the interactions among them is



We will now describe some of the components that are

part of database software.  Databases require the availability of

well written prrogamming tools, or software subsystems. Some of these tools,

that are used to support databases can also be used independently,

and hence they are at times confused with the database system itself.

Important subsystems are:


     a) File Storage Systems : software to allocate and manage space

      for data kept on large computer storage devices, such as disks or


     b) File Access Methods : software to rapidly access and update

      data stored on those devices.

     c) Data Description Languages : languages that are used to describe

      data so that users and machines can refer to data elements and

      aggregations of similar data elements conveniently and unambigously.

     d) Data Manipulation Languages : languages to allow the user to

      write programs that retrieve and process data from the database.


In a databased system these subsystems have to be well integrated, so that

the entire data manipulation is carried out in response to simple

commands that refer to the terms defined in the data descriptions.  Storage

is allocated and rearranged as new data enter the database, and access

to old and new data is provided as needed for manipulation.  To provide the

reliability that is needed to satisfy demands by users who can not be

bothered by the technical complexities of computer systems, redundant

backup data is stored separatly and appropriatly identified whenever

database is changed.  Optional software components of a database may

provide on-line, conversational access to the database, help with the

formulation of statistical queries, and provide printed reports on a regular


B.2  File Management Systems versus Database Management Systems


Of primary concern to a database effort is the reliable operation of the

devices used to store the data over long periods of time.  The programming

systems which provide such services, typically inclusive of the tools listed

in a) and b) above, are called file management systems.            


When data are shared, then access conflicts can occur due to concurrent

updates of data, use of data while it being changed, and attempts to read

confidential data without having the proper access privilege.   

To resolve access conflicts the system must be able to control database

requests made by the individual users, and recognize the specific

data units which these users will be referencing.  Control

over the data and its use can only be achieved if all users access the

database always via programs that will protect the reliability, privacy, and

integrity of the database. We achieve reliability when data are not lost due

to hardware and software errors.  We protect privacy when we guarantee that

only authorized access will occur.  We define integrity as freedom from

errors that could be introduced by simultaneous use of the database by users

that may update its contents.  A database management system should

provide all the required database support programs, including management of

files, scheduling of user programs, database manipulation, and recovery from

errors.  To make the system easy to use all of these functions should be

well integrated.


Not every database is managed by a database management system (DBMS). 

Database support can also be provided by programs that use one of the

available file management systems (FMS).  When an FMS is used programs in 

some computer language will have to be written to augment the FMS with the

functions that carry out the query understanding, the dataprocessing,

the output presentation , and any protection tasks that are desired.

The contents of the database can be identical for a system using

a generalized DBMS product or one using programs and an FMS. 

A locally developed collection of programs rarely has the all of

the protective features that are desirable when multiple users interact with

the database from terminals.  The manner in which users gain access

to shared data will always depend on the choice of the DBMS or

the file management system.  For instance, a file system does not

provide automatic scheduling of user requested activities.   Without a

DBMS the users will have to schedule their own activities in such a

way that conflicting data entry or update is avoided.  Some

file systems will simply disallow such access, in other systems such usage

could lead to inconsistent data.  If data entry activities are organized so

that such conflicts are avoided then there is less need for the complexity

of a DBMS.  A very popular file management system in medicine is MUMPS,

developed at Massachusetts General Hospital to support clinical use of

relatively small computer systems [Bowie77].

Both file management systems (FMS) and database management systems DBMS)

are available from commercial vendors for most computers.  Some DBMS's will

make use

of an existing FMS, others will perform all but the most primitive file

access functions themselves.  A DBMS interacts closely with the user

of the database, but the specification of the interaction varies greatly.

We find that several distinct types of DBMSs have been

developed.  Some systems stress user-oriented data description and operation,

and others stress a high performance for large databases, and expect that a

computer professional will design the database structure, and provide the

users with the functions they need.

DBMS's also differ in terms of the comprehensiveness of software services,

and some commercial DBMS systems are large, have more functions than any

single user would need, and hence are often costly to obtain and to maintain.

Most manufacturers provide an FMS at no additional direct cost, but

acquisition of a DBMS is rarely free.


The choice of a particular type of database management system will influence

the structure of the future database.  Not every type of DBMS will be

available on a given computer, but for most medium to large computers there

is some choice.  Simplicity versus generality and cost are often a trade-off.

Even so-called generalized database management systems impose, to a great

extent, the view of the designer or sponsor of such a DBMS. Many of the

major systems now being marketed were designed to solve the complexities

of specific applications.  We find today some DBMS's that excel in inventory

management, some do excellent retrieval of bibliographic citations, others

have a strong bias towards statistical processing.  Even within the medical

area different DBMS's will emphasize one of the many objectives that are

found within the range from patient care to medical research.  The following

table will list some database systems found in medicine with an indication

of their objective.  We distinguish in this table: general ambulatory patient

care, clinical or speciality outpatient care, hospital inpatient care, or

patient management and record keeping in these areas. Clinical studies refers

to research data collection on defined populations.  Guidance refers to the

giving of medical advice during the inquiry process.  Details of these types

of application are given in [Wieder80].  The types of database organizations

can be categorized as tabular, relational, hierarchial, or network.  These

terms will be defined in section B.4.                



  |  Name  : Application :   Type    :  FMS used    : Computers  : Reference|


  | CCSS   | Clin.Studies| Tab. DBMS | Seq. files   | various    | Kronma78 |

  |      |        |         |                 |           |         |

  | CIS    | Med.Guidance| Tab. DBMS | DEC seq.files| DEC11      | McDona77 |

  |      |        |         |                 |           |         |

  | CLINFO | Clin.Studies| Tab. DBMS | DG ISAM      | DG Eclipse | Groner78 |

  |      |        |         |                 |           |        |

  | COSTAR | Amb.Pat.Rec.| Hier.DBMS | MUMPS         | DEC 11    | Barnet79 |

  |      |        |         |                 |           |         |

  | GEMISCH| Clin.Recs.  | Hier.DBMS | DEC-11 DOS   | DEC 11    | Hammon78 |

  |      |        |         |                 |           |         |

  | GMDB   | Clin.Studies| Tab. DBMS | IBM VSAM     | IBM 370    | Wirtsc78 |

  |      |        |         |                 |           |         |

  | FAME   | Clin.Studies| Hier.DBMS | CDC SIS       | CDC Cyber  | Brown78  |

  |      |        |         |                 |           |         |

  | IDMS   | Pat.Mgmt.   | Netw.DBMS | Basic Access | IBM 370    | Penick76 |

  |      |        |         |                 |           |         |

  | IMS        | Hosp.Recs.      | Multi-hie-| IBM VSAM        | IBM 370   | Sauter76 |

  |      |        | rarch.DBMS|             |           |          |

  |      |        |         |                 |           |         |

  | LIM        | Regional Rec| Hier.DBMS | IBM DL/1(IMS)| IBM 360     | Jainz76  |

  |      |        |         |                 |           |         |

  | MIDAS  | Regional Rec| Hier. FMS | Direct files | Univac494   | Fenna78  |

  |      |        |         |                 |           |         |

  | MISAR  | Res.Data    | Tab. DBMS | MUMPS         | DEC 15     | Karpin71 |

  |      |        |         |                 |           |         |

  | MUMPS  | Med.Records | Hier. FMS | self.cont.   | DEC 15,11  | Barnet76 |

  |      |        |         |              | DG and more| Bowie77  |

  |      |        |         |                 |           |         |

  | OCIS   | Clin.Recs.  | Mult.Hier.| MUMPS-11     | DEC 11-70  | Blum79   |

  |      |        |         |                 |           |         |

  | TOD    | Clin.Recs.  | Tab. DBMS | PL/1 ind.seq.| IBM 370    | Wieder75 |

  |      |        |         |                 |           |         |

  | PROMIS | Hosp.Recs.& | Tab. FMS  | direct files | Univac     | Schult79 |

  |      | Med.Guidance|          |                  |     V77-600   |        |

  |      |        |         |                 |           |         |

  | RISS   | Hosp.Recs.  | Relat.DBMS| RTS11 ind.seq| DEC 11     | Meldma78 |

  |      |        |         |                 |           |         |



         Database and File Management Systems Found in Health Care

B.3  Related Systems


Data are collected and stored into a database with the expectation that at a

later time the data can be analyzed, conclusions can be drawn, and that the

information obtained can be used to influence future actions.  Information

is generated from data through processing, and should increase the knowledge

of the receiver of this information.  This person then should have the means

to act upon the information, perhaps to the benefit of a larger community.


 || The production of information is the central objective of a database. ||


There are other automated information processing systems which are not

considerd databases, although they may share some of the technology.  In the

remainder of this section two categories of such related systems will be



INFORMATION SYSTEMS store information - often the output of earlier data

analyses - for rapid selective retrieval [Beckle77].  A well known example

is the MEDLARS system [Leiter77,Doszko80], a service of the National Library

of Medicine, which provides access to papers published in the medical

literature.  The task of such an information system is the selection and

retrieval of information, but not the generation of information [Lucas78].

Index Medicus for instance only provides the references, and depends on the

user's own library [Kunz79]. Even maintenance of personal reference files

can be effectively automated [Reiche68].  The benefits are due to the speed

and improved coverage with which the documents can be found.


The boundary between information systems and database systems is not at all

absolute.  One can perhaps even speak of a spectrum of system types.  When

the queries are simple the two system types are in fact indistinguishable. 

Retrieval of the age of a patient, for instance, can be carried out with

equal facility on either type of system. But when another observation, say

cholesterol level, has to be compared with the average cholesterol level for

all other patients of the same age, then a computation to generate this

information is needed, and a system which is able to do this is placed more

on the database side of the spectrum.



DECISION SUPPORT SYSTEMS assist with the manipulation of data supplied by

the user [Davis78].  The help may be principally algorithmic - perhaps

assuring that Bayes' rule is properly applied.  More specialized systems

embody medical knowledge [Johnso79], for instance in acid-base balance

assessment [Bleich72] and anti-microbial therapy [Yu79].  While these systems

could be coupled to databases, so that they become also knowledgeable about

a specific patient, today they are typically separate [Gabriel78].

The knowledge embodied in databases could provide an objective basis for

knowledge-oriented systems [Blum78].  Work in decision making for health

care cost control has indicated a need for database facilities in these

applications [BrookW76].


The HELP system, at the LDS hospital in Salt Lake City, does keep a separate

file of clinical decision criteria and applies them to the patient database

as it is updated.  The system then advises the physician to consider certain

actions or further diagnostic tests [Warner78].  As medical databases become

more reliable and comprehensive we can envisage increased exploitation of the

information contained in them by systems which embody medical knowledge.

I.C   The Scientific Basis for Database Technology



The emergence of databases is not so much due to particular inventions, but

is a logical step in the natural development of computing technology.  The

evolution of computational power began with the achievement of adequate

reliability of complex electronic devices. The mean-time-to-failure reached

several hours for powerful computers about 1955.  At that point the concerns

moved to the development of programming languages, so that programs of

reasonable power could be written.  These programs had the capacity to

process large quantities of data, and in the early sixties magnetic tape and

disk devices were developed to make the data available. Operating systems to

allocate storage and processing power to the programs became the next

challenge.  By the late sixties these systems had matured so that multi-user

operation became the norm. As these foundations were laid it became feasible

to keep data available on-line, i.e., directly accessible by the computer

system without manual intervention, like fetching and mounting computer

tapes. Now a variety of application programs can use those data as needed. 

In current systems valuable data can be kept on-line over long periods

without fear of loss or damage to the database.


C.1  The Schema


The one technical concept which is central to database management systems

is the schema.  A schema is a formalized description of the data that are

contained in the database, available to the programs that wish to use the

data.  All data kept in such a database is identified with a name, say DOB

for date-of-birth.  With a schema it is sufficient for application programs

to specify the name of the data they wish to retrieve.  A command may state:


    date-of-birth = GET ( current-patient, DOB )


The database system will use the schema to match the name of the requested

data.  When a corresponding entry in the schema is found, the database system

can use information associated with the entry to determine where the

requested data have been stored, locate the data values, and retrieve them

into the application program area ( date-of-birth ) for analysis or display

During this process it is possible to check that the requestor is authorized

to access the data.  The DBMS may also have to change the data into a

representation that the program can handle [Feins70A].  Similar processes

are carried out by the DBMS when old data are to be updated and when new

data are to be added to the database.


The schema is established before any data can be placed into the database

and embodies all the decisions that have been made about the contents and

the structure of the database.  Each individual type of data element will

receive a reference name. The data to be kept under this name may be further

defined.  The most important specification is whether the data are numeric,

a character string, or a code.  Codes then need tables or programs for their

definition.  Other schema entries give the format and length of the data

element, and perhaps the range of acceptable values.  For observations of

body temperature the five descriptors might be:


      TEMP, temperature in degrees C, numeric, XX.X, 36.0 to 44.0.


The data elements so described will have to fit into a structure; a value by

itself, say TEMP = 41.9, is of course meaningless.  This data element belongs

in an observation record, and the observation record must contain other

data elements, namely a patient identification (ID), a date, and a time.

These data elements, which are used to identify the entity described in the

record, constitute the ruling part; without these there is insufficient

information present to make the TEMPerature observation useful.  The ruling

part data types { ID, DATE, TIME } will also appear in the schema. 


The observation record may contain, in addition to TEMP, other dependent 

data elements as:  the pulse rate, the blood pressure, and the name of the

observer.  The entire observation record can then be described as a list of

seven attributes, as follows:


   Observations:  ID, DATE, TIME > TEMP, PULSE, BP, OBSERVER;


The first three attributes form the ruling part, the other four are the

dependent part; we seperate the two parts with a > symbol.  Each attribute

has associated with it a schema entry with the five descriptors shown for

the TEMP entry above.  There will be other kinds of records in the database: 

a patient demographic data record will exist in most databases we consider.

Here the only data element in the ruling part will be the ID field. This

record may be in part as below:


   Patients:      ID > PATIENT-NAME, ADDRESS, DOB, SEX, ... ;


Matching of the ID fields establishes the relationship between patient

demographic data records and the observation records. The known relationships

between record types should also be described in the schema, so that the use

of the schema is simplified [Manach75, ChangO76].


We use three types of connections to describe relationships between records

[Wieder79], their use is also sketched in the figure on the next page.


 a) The Identity Connection - used where the ruling parts are similar,

               but different groupings are described;

       for instance both hospital patients and diabetes clinic patients

         are patients with patient ID's, but have different dependent

         data stored in their files.


 b) The Reference Connection - used where there is a common descriptive

                record referred to by multiple data records;

         for instance the Physician-seen is a record type referred to

         from the patients clinic visit records.


 c) The Ownership Connection - used where there are many subsidiary records

           of some type which depend on a higher level record;

           multiple ownership connections define an association;

       for instance the multiple clinic visits of a specific patient,

         each with data on his temperature, blood pressure, etc. form

         an owned nest of the patient record. 

         An association occurs in the figure where a physician has

         admitting privileges at one or more hospitals, and each hospital

         grants admitting privileges to a number of physicians. The

         admitting-privileges file has as ruling part both the physician's

         ID and the hospital name, a dependent data element might be the

         date the privilege was granted.


Associated with the connection types may be rules for the maintenance of

database integrity.  Such rules can inform the database system that certain

update operations are not permissible, since they would make the database

inconsistent.  For example we would not want to add a clinic patient without

adding a corresponding record to the general patient file, if the patient

did not yet exist there.  Similarily deletion of a physicians record from

the database implies deletion of the associated admitting privileges.



C.2  The Data Model


In order to provide guidance for the creator of the schema it is important

to have design tools.  A large database can contain many types of records,

and even more relationships between the record types.  These have to be

understood and used by a variety of people: the programmers who devise data

entry and analysis programs, the researchers who wish to explore the database

in order to formulate or verify new hypotheses, and the planners who wish to

use the data as basis for modelling so that they can predict the response

to future actions.  A variety of models exist [ACM76]; some models are

abstractions of the facilities that certain types of database management

systems can provide, other use more generalized, mathematical abstractions

to represent the data and their relationships.  Recent work in database

research is directed towards improving the representation of the semantics

of the data [Hammer78, ElMasr79, Codd79] so that the constraints of the

relationships that exist in the real world can be used to verify the

appropriateness of data that are entered into the database. 


Any reasonable model of the database can provide a common ground for

communication between users and implementors, without a model there is apt

to be an excess of detail [Wieder78].  An example of a data model for a

clinical database is shown below.


  ----------            -----------------                  

 | Patients | <<<<< | Clinic Patients |

  ----------            -----------------                  


                  |            -----------      ----------

                        |    seen .-->| Physician |    | Hospital |

                  *     /     ----------- ----------     

               ---------------/      |                  |   

                | Clinic Visits |               |           |

               ---------------       |             |

                  |               *           *

  -------------- |                     ---------------------

 | Pharmacopeia |       |                | Admitting Privilege |

  -------------- |                     ---------------------

           |            |

           *            *


         | Drugs Prescribed |



        The ownership ( --* ) connections indicate that there may be

          multiple inferior instances for each superior instance. 

      The reference ( --> ) connections indicate that there may be

          multiple references to each instance. 

      The identity ( >>> ) connection defines a subgroup.

C.3  Types of Database Models


A popular approach to database analysis distinguishes several categories

of databases. Database system implementations can be associated

with each category.  These categories are represented by database

model types, the best known types are the


      Relational model- derived from the mathematical theory of

            relations and sets.

      Hierarchical model- related to tree-shaped database implementations,

            similar to corporate organization diagrams.

      Network model - permits interconnections that are more complex

            than hierarchies, based on a definition developed by a

            committee of specialists in commercial system languages.


The structural model, used here, can describe the structures of any of these

three models, as well as of other database implementations.  If only a

single record-type - a box in the above diagram - is implemented then we are

dealing with a 'universal relation' [Ullman80].  A single box for a

complex database would have many columns and rows, and contain many null

entries.  If the data are organized into several record-types, each

corresponding to some meaningful entity, then we are dealing with a

'tabular database'; if a completely general query and processing capability

exists in such a system, we have implemented the 'relational model'



At this point the entities stand alone, and some analysis is needed to

relate them. If any of the indicated connections have been implemented then

we may have a network or a hierarchical database. In the hierarchical model

a record-type may have only one ownership connection ( --* ) pointing to it.

Hierarchies model well the view of the data held by physicians or other

controlling groups [Davis70].

The implementation of multiple ownership connections, which creates a

network with associations, is considerably more complex [Stoneb75, Wieder77].

Several of the larger commercial DBMS's are based on work by the Data Base

Task Group of CODASYL, and do support such network structures [Olle78].

These systems often do not support the general inquiry capability of the

relational model implementations.


It is important to note that there is a distinction between a model and its

implementation.  A model is an abstraction and provides a level of insight

which can cut through masses of confusing detail.  Some models specialize

in selected aspects of the use of data [Bolour79].  In the implementation

this detail has to be considered.  It is likely that the implementation will

differ considerably from the model used to describe it.  As more powerful

models are developed this distinction may become greater. An implementation

may then be best described in terms of transformations that are applied to

the model which defines the database at a high conceptual level.  Most

transformations are done for reasons of operational performance and

reliability.  A large class of transformations adds redundant data into

the implementation in order to speed up retrieval and gain reliability,

whereas our models should have minimal redundancy.  Examples of redundant

data that may be added during implementation are indexes, duplicated

reference information, and pointers between related data which physically

implement connections.




D.  Database Operations


In the section above we have discussed the scientific basis of databases.

In order to use and benefit from that science a database operation has to

be established, and that involves many decisions of practical, but critical

importance. This section will consider such topics.


After a database has been established and a suitable software system has

been obtained, data collection can commence.  Data is often collected

partially from sources that were in existence before a database was

considered.  To complete the database, so it can serve the intended broad

scope, new data collection points may have to be defined.  The value of

adding any quantity of data to the database has to be considered since

data collection and entry is costly and susceptible to errors.  We will

begin with a discussion of issues in entering of data, and then proceed to

data storage and organization concerns, discuss data presentation issues,

and finish with some remarks about database administration.


D.1  Entering Data into the Database


The relatively high cost of data entry is a major concern.  It is obvious

that data that cost more to collect than they are worth should be avoided.

When a certain data element is entered its utility is hard to predict: its

usefulness may depend on its value, on the completeness of this patient's

record, and on the patient's returning to the clinic, so that follow-up is

possible.  These factors are not easy to control.  The actual problem of

data acquisition can, however, be addressed.  Much less formal attention

has been given in the literature to this subject than to the topic of data

retrieval [Greenf76].


When data are to be collected there are the costs of the actual collection,

of the transcription to some processable form, and of the actual entry into

a computer.  The data collection is to a great extent the physician's task. 

While automated clinical instruments can collect objective values [Friedm78],

and the patients themselves can enter their own history [Slack66], many

subjective and important findings emanate from the physician [Collen74].


It may be considered desirable to minimize changes to the traditional

manner of medical data recording, so that the physicians continue to collect

their findings as notes in free text or by dictation. These reports are

then transcribed by clerical personnel into the computer.  This format

presents the medical information in a way that is least affected by

mechanical restrictions.  To enable retrieval of such observations the

specific statements or paragraphs may be categorized into functional

groups as findings, treatment, plans, etc. as proposed by [Korein71].  A

system, based on these concepts has served well in a city hospital

pediatric clinic setting.  Of particular importance was that patient data

retrieval for emergency and unscheduled visits became possible [Lyman76].

When textual data are to be used for analysis, we find that they are nearly

impossible to process in the form they were entered.  An immediate problem

is that the natural language text has to be parsed so that its meaning

can be extracted.  Both the parsers and the associated dictionaries are

substantial pieces of software.  But even when language understanding is

achieved, consistent data for entry may not have been obtained since

medical terminology varies over time and among health care providers.  In

general some encoding is needed.  It may then be of benefit both to the

physician and to the system to choose a method of data collection which

encodes data immediately into a more rigorous form.

This is done typically through the use of preprinted, perhaps problem

area specific, forms [Barnet79]/  A system for patient surveillance

uses computer-printed forms which are prepared to be specific to each

patient visit [McDonal75].

Various choices exist

to encode data :


    1.      The encoding can be carried out by clerical personnel [Valbon75].


    2.  Natural language, i.e. English text, may be analyzed and converted

      by a program that processes the text within the medical context

        [Pratt73, Okubo75].


    3.  A constrained set of keywords for data values, for example the list:

             {no, light, moderate, serious},

      can be attached to the schema entry for a specific data type.

      These data values will be converted on data entry to an internal

      code [Wieder75].


    4.  Where the number of possible data elements, for which data are to be

           collected, is large, the name of the data element, i.e. 'facial rash',

      may be encoded in addition to the data value itself [Hammon73, Wong78].


    5.  Keywords may be checked on a form or selected from a menu presented

            on a display screen [Schult76].  Selection can de accomplished using

      touch-sensitive screens, lightpens, cursors operated by joysticks or

      key-pads, or by entering on a keyboard a digit which refers to a

      line of he presented menu.


    6.  Where the list of keywords is too long for screen presentation

      a hierarchical menu selection can be provided or a subset of the

      keywords corresponding to a few initial letters can be displayed



    7.  The forms or menus to be used for data collection may be generated

      using the schema of the database managment system [Hanley78].


With the continuing development of fast display technology the latter choices

seem to have the most promise.  The response for screen selection and

presentation of the next menu has to be extremely rapid ( 0.4 sec. per screen

is cited ) to encourage direct physician use of the devices [Watson77].  Such

speeds are very difficult to achieve today, since the display frames reside

on remote disk storage devices and have to be fetched, formatted, and

transmitted by file, application dependent, and communication programs for

presentation on terminals.  When those terminals are connected via telephone

lines to the computer another bottleneck appears. To transmit a display

frame of 24 lines of 50 characters each, at the fastest available rate, 9600

bits/second, still requires one second. To cope with this problem either

special communication lines or storage devices local to the terminal are


Numeric values are not as easily entered on a touch-screen as are choices

among discrete elements. Keyboard entry may continue to dominate this part

of data entry, unless the values can be obtained directly from medical

instrumentation.  Typed data requires much editing.  Comprehensive commands

for specification of input editing are part of the MUMPS language, and

have contributed greatly to its acceptance.  Modern computer languages, as

PASCAL, also provide within the variable declarations a capability to limit

the range or the set of choices of values to be entered.


D.2  Data Storage


The cost of data storage is now much lower than cost of data entry.  This

means that if data entry is worthwhile, the entered data can be stored for a

reasonably long time.  The characteristics of medical record structure can,

however, easily lead to a waste of computer storage space which is an order

of magnitude greater than the actual data storage space needed.  This

occurs when data are stored in simple rectangular tables, since the variety

of medical data requires many columns, but at one encounter only a few

values will be collected.  Hierarchical file organizations allow linkage to

a variable number of subsidiary data elements, and in this manner provide

efficient storage utilization, whereas the older tabular files dealt

poorly with medical data [Greene69].  The encoding techniques used for data

entry can also provide compaction of stored data since short codes are

used to denote long keywords.


Data structures can often be compressed by suitable data encoding

techniques applied to the files.  Unobserved data elements should

not need actual storage space.  Data compression can reduce both the

storage requirements and the access times greatly [Wieder77]. When space

considerations are no longer an important issue in data organization, an

apparent tabular format can again be used, and this can simplify data

analysis programs.  In a clinical databank, TOD [Weyl75], the compressed

data, after encoding to account for missing, zero, or repeating data,

occupied only 15% of the original storage space.


Older data often become less interesting, and can be moved to archival

storage. Storage on magnetic tape is quite inexpensive and the data can be

recovered, if needed for analysis, with a moderate delay.  The cost of

long term data storage on tape is less than $1 per million characters per

year, for recent large disk devices on-line costs are still about $200 per

million characters per year, and several times that on older or smaller

devices.  Entry of a million characters by keyboard entry costs at

least $500, and much more from typical medical data sources.  In a well-run

operation data can be recovered from tapes for on-line access in about

an hour [Soffee76]. The major problems are the development of effective

criteria for selection of data for archival storage and the cataloging of

archival data, so that they can be retrieved when needed.  Candidates for

archiving are detailed records of past hospitalizations and episodes of

acute illnesses.




D.3  Data Organization for Retrieval


The important point in research usage of databases is that information is

not produced by the retrieval and inspection of a few values, but rather

from the relating of many findings in accordance with hypothesized cause

and effect relationships. When the data files grow very large, repetitive

scans for data selection may become prohibitively slow, especially during

the data exploration phase.  We distinguish the following phases in the

research use of clinical databases:


    1.      Initial definition of the data to be collected, with consideration

            for clinical needs. The expected usefulness is often based on

      vague or ill-defined initial hypotheses.        


    2.  Exploratory analysis, using tabulations and simple graphics in

      order to compare subsets of the population.


    3.  Hypothesis generation based on perceived patterns, definition of

          independent and dependent variables according to some clinical



    4.  Data validation and sometimes expansion of data collection in the

      areas in which patterns appear interesting.


    5.  Subset definition and generation so that differences due to the

          independent variables can be made explicit.


    6.  Exhaustive statistical analysis of the subsets to verify or refute

      the hypotheses.


It is important to have good subsetting facilities and efficient access to

defined subsets.  Such services are provided in many clinical systems, but

the techniques vary widely.  Often the subsets are extracted and

manipulated as distinct databases [Mabry77].  In other systems a subset is

kept as a collection of references to records in the main database

[German75], and in yet another system the subset is recreated from the

definition of the subset [Todd75].


Since access to data in research is primarily by attribute field rather than

by patient record, it can be profitable to transpose the database [Wieder77].

Transposition generates one, possibly very long, record for each attribute

of the database.  Such a record now contains a sequence of values of this

particular attribute for all patients or all visits of all patients.  Many

current computer systems cannot manage such long records easily but the

benefits should be clear:  to relate blood pressure results to dosage of

an anti-hypertensive drug only two records have to be retrieved from the

transposed file.  In a conventional file organized by patient visit every

visit record is accessed to retrieve the two fields needed to accomplish

this comparison.


As indicated earlier, the addition of redundant data can speed up the

retrieval process.  An index is an access structure which repeats key

data in a compact form, and is used to avoid scanning an entire

conventional file to search for a record.  Attributes which are expected

to be used in record selection are entered into the auxiliary index file,

which is then maintained in sorted order.  If the attribute being kept sorted

is "bloodpressure", then all hypertensives will appear at the beginning of

the "bloodpressure" index file.  With every bloodpressure value a reference

pointer to the corresponding visit record will be kept.  Now only the data

records for patient visits where the bloodpressure was high will have to be



Bitmaps provide a simplified form of indexing.  Whereas an index is based

on the actual data values, a bitmap uses simple categorizations of these

values.  In a list with entries which correspond to the records in the

datafile a bit is set to one if the data values in the record meet a certain

condition. This condition could be a blood pressure greater than 160/100

[Ragan78].  Both indexing and bitmaps can be viewed as providing

the capability of preselection of relevant records.  If the selection of

indexes or bit map definitions matches the retrieval requests well,

access to conventional files can become much faster.  The maintenance of

such access structures will of course require additional effort at the

time of data entry.


There are many cases where more computation at the time of data entry can

reduce the effort that required at data retrieval time. In some applications

it may be known that certain computable results of the collected data will

be needed at a later time. Then such results may actually be already

computed and stored within the database when the source data are entered. 

Typical of a precomputed or actual result is the maximal value of a

clinical observation on a given patient, say blood pressure, which could

be kept available so that no search through multiple visits is needed to

identify a patient with evidence of hypertension [Melski78]. Other

candidates for precomputation are totals, averages, or the range of values

of a variable [Wieder75]. The total amount outstanding on a bill and the

range of a diabetics blood-sugar level are other examples.



D.4  Data Presentation


Data from databases can be presented in the form of extensive reports

for manual scanning, as summary tabulations, or as graphs to provide rapid

visual comprehension of trends. An extensive data analysis may lead to a

printout of statistical findings and their significance, or may provide

clinical advice in terms of diagnosis or treatment.  When simple facts are

to be retrieved the results are apt to be compact and easy to display or

print.  Queries formulated in English can provide retrieval of such

data [Epstein78].  If much computation is used to generate the information

then presentation of the end-results alone is rarely acceptable.  Most medical

researchers will want an explantion of the data sources and algorithms

that led to the output results, as well as information about the expected

reliability of the final values.


These human interface requirements increase the volume of the output for

research studies, so that printed reports dominate.  In most clinical

situations less output is used, so that other methods may be practical.

We have seen the following alternative [Fries74]::


      1. Detailed listings or rapid video display terminal presentations

         for quick scanning of data.


      2. Cross tabulations or graphics to aid human pattern detection.


      3. Well-structured summaries, with automatic data selection and

         advice for patient care.


      4. Summaries with explanatory backup available on a terminal

         when needed.


      5. Structured report presentation for outside distribution,

         as with billing or result publication.


During a routine patient encounter a paper summary is probably least

distracting, but in emergency situations video terminal access can be much

more rapid.  Terminal access helps the researcher in the formulation of

queries, and graphics provide insight to clinicians uncomfortable with

long columns of numbers. As systems mature and become more accepted the

user should be able to move smoothly from one form of output presentation

to another, but most systems now in use do not provide many options for

data presentation, and even fewer offer a smooth transition between

interaction modes.



D.5  Database Administration


Even when all the right decisions have been made and a database exists,

there has to be an ongoing concern with reliability, adaptation to

changing institutional needs, planning for growth, and technical updating

of the facilities.  In many institutions a new function, that of database

administrator, is defined to deal with these operational issues.  The

database administrator needs strong support from management and high

quality technical assistance.  Since the function is responsible for

day-to-day operations it is not reasonable to expect a high level of

innovation from the database administrator, but responsiveness to the

institutional goals is essential.





This paper was prepared as part of a review for a Compendium on Computers

in Health Care, Dr. D.A.B. Lindberg, Univ. of Missouri, Editor.  Much

intellectual stimulation has come from my colleagues on many evaluation

committees.  Marty Epstein of NIH DCRT, Dr. Bob Blum of Stanford, and Dr.

Alan Rector of Nottingham have provided important remarks and insights

into the development of databases.  Dr. Blum was also of major assistance

in making this paper more readable and Marty Epstein provided many

corrections and references.  The resources provided by the SUMEX Facility,

supported by grant NIH RR-00785 were essential to the preparation of this

document.  Background was provided through a study sponsored by the National

Center for Health Services Research (NCHSR) at the University of California

[Henley75] and is aided by research carried out as part of the RX Project,

sponsored by NCHSR under grant No. 1RO3 HS93650.  Basic research in database

design leading to some of the concepts presented here is supported by ARPA

within the Knowledge-Based Management Systems project at Stanford University.

Wiederhold:   Database Technology in Healthcare


Wiederhold:   Database Technology in Healthcare


Wiederhold:   Database Technology in Healthcare


Wiederhold:   Database Technology in Healthcare


Wiederhold:   Database Technology in Healthcare


