Towards a Synopsis Warehouse
           Peter Haas, IBM Almaden Research Center

Data synopses are an essential ingredient of methods for fast
approximate analytical processing, interactive data exploration,
auditing, and automated metadata discovery. We consider the problem of
maintaining a warehouse of synposes that "shadows" a full-scale data
warehouse. Incoming data is decomposed into partitions, and a synopsis
is created for each partition. As the data partitions are rolled in
and out of the full-scale warehouse, the corresponding synopses are
rolled in and out of the synopsis warehouse. Synopses are combined as
needed to yield synopses of the corresponding combination of
partitions. This approach is efficient, allowing parallel processing,
as well as flexible. We discuss some recent work aimed at supporting a
warehouse of synopses. Our focus is on two types of synopses: uniform
random samples and synopses for estimating the number of distinct data
values in a partition. Our algorithms correct, improve, and extend
techniques such as classical reservoir and Bernoulli sampling, the
"concise" and "sample counting" schemes of Gibbons and Matias, and
various probabilistic-counting methods.