From Data Engineering to Information Engineering

Gio Wiederhold

Stanford University

based on an Extended Abstract for the IEEE Data Engineering Conference Keynote, 17Feb1994. Historical remarks have been removed.

Databases are ubiquitous, but are also seen as contributing to 'information overload'. We must reconsider why people need databases, or, more in particular, data to be retrieved from the databases.
We need data to make decisions.
We can exploit the definition implied by Shannon in his Information Theory in 1949, that information is novel, i.e., previously unknown to its receiver and hence can lead to action. The making of decisions means making a choice among alternatives, and having more information helps in assessing the cost and benefits of the alternatives. Information hence also reduces risk, because it becomes less likely that alternatives with poor benefit/cost ratios will be taken.

Today's databases do not support the decision making processes directly. Decision making is typically preceded by a planning activity, in which alternatives are developed, assessed, and pruned. Databases of various flavors can contribute, but human and artificial intelligence are needed to help in selecting the data, summarizing them in the forms needed for the decision to be made, merging the results with other sources, and attaching the resulting information to a branch of the tree of alternatives. The decision tree is rarely fully under our control; for every action of ours there are likely to be reactions by the other parties, and we need intelligence to enumerate those branches and assign probabilities and costs to them.

There is already one crucial difference: whereas past data should report a consistent history, there are many possible futures. An information system must be able to deal with forward projections: costs and benefits of todays and tomorrows actions into the futures. That means that information systems must deal with multiple future worlds. Using relational terms, every tuple must be stamped with a projected time and a label identifying the alternate world. At every decision point at least one more alternate world is created. If one decision applies to multiple alternatives, then that many future worlds will be created. The possible reactions that we enumerate create yet more.

It is obvious that dealing with planning information requires new engineering concepts. Data for these multiple worlds is highly redundant, and must be represented effectively. Each world must be identified with a sequence of actions and reactions, and, when one of them changes, must be rapidly recomputed. To assess alternative actions it must be possible to summarize, across one point in future time, the benefits and costs incurred, and risks remaining in the future beyond. The summarization must also identify which world is best, so that the best actions can be determined. Note that a MAX-function by itself is inasdequate. Some spreadsheets today permit recording of several alternatives, but their matrix representation limits the complexity that can be represented and hence assessed.

If databases are to move closer to applications in planning and decision making they must be engineered to serve information needs. In addition to dealing with the demands to manage more and more varied data, as discussed above, we will list here some technological hurdles that information systems must address in the next decade.

Temporal algebras that permit aggregation, analysis, and projection from the past into the future.

Concurrent management of multiple future worlds.

Application of planned actions to all relevant future worlds.

Rapid recomputation of future worlds based on recent factual observations.

Assignment and computation of probabilities over the tree of future worlds.

Evaluation functions that can assess probabilities and multi-factor costs and benefits of future worlds at remote points in time

Efficient pruning of future worlds, based on automatic or computer-assisted human evaluation.

This list cannot be exhaustive, but provides a direction for research to help move the success of database technology into the needs of the application world. This direction should not be isolated from ongoing research in dealing with the larger volumes, distribution, and heterogeneity of data we encounter. However, without a focus on the actionable benefits of data, just supplying more and fancier data to the end-user will create overload and frustration. Databases and their abstract models have largely removed the frustration of dealing with programmer controlled files, and replaced them with effective data services. Understanding, abstracting, and modeling information requirements can lead to a new level of services.