A System for Provenance and Data


(Excerpted from December 2009 short overview paper)

In its most general form, provenance (also sometimes called lineage) captures where data came from, how it was derived, manipulated, and combined, and how it has been updated over time. Provenance can serve a number of important functions:

There has been a large body of very interesting work in lineage and provenance over the past two decades. Nevertheless, we believe there are still many limitations and open areas. Specifically:
  1. Most work has been either: data-based, in which fine-grained provenance of data elements is tracked based on well-defined, transparent properties of data models and query languages; or process-based, in which coarse-grained provenance is tracked, typically involving workflows and data at the schema level.

  2. Often the primary focus is on modeling and capturing provenance: How is provenance information represented? How is it generated? There has been considerably less work on querying provenance: What can we do with provenance information once we've captured it?

  3. Many projects have focused on specific functions or application domains, rather than developing a general provenance system that can be used for different purposes and across domains.
Our goal is to fill these gaps. Specifically, we want to:
  1. Seamlessly merge data-based and process-based provenance, so that the two types of provenance can be combined (e.g., workflows that combine "opaque" processing nodes with well-understood relational queries and transformations). We also want to develop a model and system that offers users a full range from fine-grained to coarse-grained provenance.

  2. Define a set of useful operators for taking advantage of provenance after it has been captured, as well as a general-purpose language for querying and analyzing provenance, and for combining provenance with relevant data.

  3. Develop a general-purpose open-source system that is flexible and configurable enough to be used for a wide variety of applications. The system will support its own mechanisms for provenance capture, storage, operators, and queries, while also offering interfaces for coupling with outside data sources, processes, and systems.
The Panda project is supported by the National Science Foundation (grant IIS-0904497), the Boeing Corporation, KAUST, and an Amazon Web Services Research Grant.

Panda project Wiki




Last edited by J. Widom, July 2011