Tracing the Provenance and Flow of Data

In this talk, I shall describe an annotation management system that
can be used to "eagerly" trace the provenance (i.e. origins) or flow
of a piece of data. In this system, every piece of data is assumed to
have one or more annotations attached to it. As data is being
transformed, e.g., through a query, the relevant annotations are
automatically propagated along. This system also has potential
applications in other areas such as markup of data and quality
control.

We show that optimizing a query in such an annotation management
system can be rather different from traditional query optimizations:
Two queries that are considered to be equivalent by a traditional
query optimizer may not be annotation-equivalent (i.e. generate the
same annotated outcome) in general. Despite this, we show that the
same annotated result is obtained whether intermediate constructs of a
query are evaluated with set or bag semantics. We also give a
necessary and sufficient condition, via homomorphisms, that checks
whether a query is annotation-contained in another. Even though our
characterization suggests that annotation-containment is more complex
than query containment, we show that the annotation-containment
problem is NP-complete, thus putting it in the same complexity class
as query containment. In addition, we show that the annotation
placement problem, which was first shown to be NP-hard, is in fact
DP-hard and the exact complexity of this problem still remains open.