DataMotion Report June 2007
Research
During our third year, we have continued our DataMotion research,
in the four thrust areas. In what follows we describe
our results.
** Finding Frequent Elements in Non-Bursty Streams
We developed an algorithm for finding frequent elements in a stream
where the arrivals are not bursty. The algorithm requires a modest amount
of memory, depending on the burstiness of the stream.
A major advantage of our algorithm is that even
if the relative frequencies of the different elements is fixed, the
space complexity decreases with the length of the stream if the
stream is not bursty. To our knowledge this is the only algorithm
with such property.
Rina Panigrahy and Dilys Thomas. Finding Frequent Elements in
non-bursty Streams. European Symposium on Algorithms (ESA), 2007.
Available at http://www.stanford.edu/~dilys/papers/FreqElem.pdf.
** Optimization of Continuous Queries with Shared
Expensive Filters
We considered the problem of optimizing and executing
multiple
continuous queries, where each query is a conjunction of
filters and
each filter may occur in multiple queries. When filters
are
expensive, significant performance gains are achieved by
sharing
filter evaluations across queries. A shared execution
strategy in our
scenario can either be fixed, in which filters are
evaluated in the
same predetermined order for all input, or adaptive, in
which case
the next filter to be evaluated is chosen at runtime
based on the
results of the filters evaluated so far. Our research has shown that as
filter costs increase, the best adaptive strategy is
superior to any fixed
strategy, despite the overhead of adaptivity. We have
also shown that it is
NP-hard to find the optimal adaptive strategy, even if we
are willing
to approximate within any factor smaller than logarithmic
in the
number of queries. We developed a greedy execution
strategy and showed
that it approximates the best adaptive strategy to within
a factor
polylogarithmic in the number of queries and filters. We
have also shown
how the execution overhead of adaptive strategies can be
reduced by
appropriate precomputation. We have carried out an
experimental evaluation
that demonstrates the effectiveness of our techniques.
K. Munagala, U. Srivastava, and J. Widom. Optimization of
Continuous
Queries with Shared Expensive Filters. To appear in
Proceedings of
the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles
of Database
Systems, Beijing, China, June 2007.
Available at http://dbpubs.stanford.edu/pub/2005-36.
** Beyond Just Data Privacy
In our project we have studied how to enforce security
and privacy of data streams. In doing so, we have found
that
designing a system that "guarantees" the
privacy of its
information may not be enough. One must also consider the
price for
providing that protection: For example, is the
information preserved
adequately? Does the system perform well? Thus, we have
started to explore
how one can design systems that strike a good balance
between security, privacy, performance, data
preservation.
In the paper cited below we explore how this balance
may be achieved.
Mungamuru, B. and Garcia-Molina, H. Beyond Just Data
Privacy. 3rd
Biennial Conference On Innovative Data Systems Research (CIDR).
Pacific Grove, CA, USA, January 2007.
Available at http://dbpubs.stanford.edu/pub/2006-18.
** Configurations -- A Model For Distributed Data Storage
We have continued our work on "configurations"
which facilitate the
description and analysis of secure data systems. Both the
longevity
and privacy of sensitive data are considered by
configurations. The
model uses two basic operators: copy, which replicates
data for
longevity, and split, which decomposes data (e.g., into
ciphertext
and a key) for privacy. The operators can be recursively
composed to
describe how data and their associated "keys"
are managed. We have
defined various classes of configurations, where each
class has
desirable properties with respect to physical
realizability and
semantic correctness. We have developed formal techniques
to verify
these properties for a given configuration.
Mungamuru, B., Garcia-Molina, H. and Olston, C. Brief
Announcement:
Configurations -- A Model For Distributed Data Storage.
26th ACM
Symposium on Principles of Distributed Computing.
Portland, OR, USA,
August 2007.
Available at http://dbpubs.stanford.edu/pub/2005-41.
** Collusion and Data Privacy
In certain situations where a firm or organization has an
interest in
privacy, external parties can be in a position to collude
with each
other in a manner that violates the firm's privacy. These
external
parties often have a monetary incentive to collude
against the firm.
We study this threat of collusion within the context of a
simple
economic model.
Although a simple reward scheme can be used to deter
collusion, a deterrent based on a long-term business
relationship is
more effective. We have developed an optimal
collusion-resistant
mechanism with which a firm can filter out parties with
low discount
factors.
Mungamuru, Bob; Padilla, Michael; Garcia-Molina, Hector. Collusion
and Data Privacy. Technical Report.
Available at http://dbpubs.stanford.edu/pub/2007-21.
** Distributing Data For Secure Database Services
The advent of database services has resulted in privacy concerns
on the part of the client storing data with third party
database service providers. Previous approaches to enabling such a
service have been based on data encryption, causing a large overhead
in query processing. Instead, we have proposed a distributed architecture
for secure database services as a solution to this problem where data was
stored at multiple sites. The distributed architecture provides both
privacy as well as fault tolerance to the client. We have developed
algorithms for (1) distributing data: our results include
hardness of approximation results and hence a heuristic greedy hill
climbing algorithm for the distribution problem (2) partitioning the
query at the client to queries for the various sites which is done by a
bottom up state based algorithm. Finally the results at
the sites are integrated to obtain the answer at the client. We have
provided an experimental validation and performance study of our
algorithms.
Vignesh Ganapathy, Dilys Thomas, Tomas Feder, Hector garcia-Molina,
Rajeev Motwani; Distributing Data For Secure Database Services;
Technical Report.
Available at http://dbpubs.stanford.edu/pub/2007-23.
PhD Thesis
This part year, two PhD thesis were completed by students
with partial DataMotion support:
** Dilys Thomas
Thesis Title:
Algorithms and Architectures for Data Privacy
Abstract:
The explosive progress in networking, storage, and
processor
technologies has resulted in an unprecedented volume of
digital data.
In concert with this escalating increase in digital data,
concerns
about privacy of personal information have emerged
globally. The ease
at which data can be collected automatically, stored in
databases and
queried efficiently over the internet has worsened the
privacy
situation, and has raised numerous ethical and legal
concerns.
Privacy enforcement today is being handled primarily
through
legislation. We aim to provide technological solutions to
achieve a
tradeoff between data privacy and data utility. We focus
on three
problems in the area of database privacy in this thesis.
The first problem is that of data sanitization before
publication.
Publishing health and financial information for research
purposes
requires the data be anonymized so that the privacy of
individuals in
the database is protected. This anonymized information
can be used as
is or can be combined with another (anonymized) dataset
that shares
columns or rows with the original anonymized dataset. We
explore both
these scenarios in this thesis. Another reason for
sanitization is to
give the data to an out-sourced software developer for
testing
software applications without the out-sourced developer
learning
information about its client. We briefly explain such a
tool in this
thesis.
The second part of the thesis is auditing query logs for
privacy.
Given certain forbidden views of a database that must be
kept
confidential, a batch of SQL queries that were posed over
this
database, and a definition of suspiciousness, we study
the problem to
determine whether the batch of queries is suspicious with
respect to
the forbidden views.
The third part of the thesis deals with distributed
architectures for
data privacy. The advent of databases as an
out-sourced service has
resulted in privacy concerns on the part of the client
storing data
with third party database service providers. Previous
approaches to
enabling such a service have been based on data
encryption, causing a
large overhead in query processing. In this thesis we
provide a
distributed architecture for secure database services. We
develop
algorithms for distributing data and executing queries
over this
distributed data.
** Krishnaram Kenthapadi
Thesis Title: Models and Algorithms for Data Privacy
Abstract:
Over the last twenty years, there has been a tremendous
growth in the
amount of private data collected about individuals.
With the rapid growth in
database, networking, and computing technologies, such
data can be
integrated and analyzed digitally. On the one hand, this
has led to the
development of data mining tools that aim to infer useful
trends from
this data. But, on the other hand, easy access to personal
data poses a
threat to individual privacy. In this thesis, we provide
models and
algorithms for protecting the privacy of individuals in
such large data
sets while still allowing users to mine useful trends and
statistics.
We focus on the problem of statistical disclosure control
-
revealing aggregate statistics about a population while
preserving the
privacy of individuals. A statistical database can be
viewed as a
table containing personal records, where the rows
correspond to
individuals and the columns correspond to different
attributes. For
example, a medical database may contain attributes such
as name, social
security number, address, age, gender, ethnicity, and
medical history
for each patient. We would like the medical researchers
to have some
form of access to this database so as to learn trends
such as
correlation between age and heart disease, while
maintaining individual
privacy. There are broadly two frameworks for protecting
privacy in
statistical databases. In the interactive framework, the
user
(researcher) queries the database through a privacy
mechanism, which may
deny the query or alter the answer to the query in order
to ensure
privacy. In the non-interactive framework, the original
database is
first sanitized so as to preserve privacy and then the
modified version
is released. We study methods under both these frameworks
as each method
is useful in different contexts.
The first part of the thesis focuses on the interactive
framework and
provides models and algorithms for two methods used in
this
framework. We first consider the online query auditing
problem: given a
sequence of queries that have already been posed about
the data, their
corresponding answers and given a new query, deny the
answer if privacy
can be breached or give the true answer otherwise. We uncover the
fundamental problem that query denials leak information.
As this problem
was overlooked in previous work, some of the previously
suggested
auditors can be used by an attacker to compromise the
privacy of a large
fraction of the individuals in the data.
To overcome this problem, we introduce a new model called
simulatable
auditing where query denials provably do not leak
information. We also
describe a probabilistic notion of (partial) compromise,
in order to
overcome the known limitations of the existing privacy
definition. We
then present simulatable auditing algorithms under both
these
definitions. The second problem we consider is output
perturbation, in
which the database administrator computes exact answer to
the query and then
outputs a perturbed answer (by adding random noise) as
the response to
the query. Inspired by the desire to enable individuals
to retain
control over their information, we provide a
fault-tolerant distributed
implementation of output perturbation schemes, thereby
eliminating the
need for a trusted database administrator. In the
process, we provide
protocols for the cooperative generation of shares of
random noise
according to different distributions.
The second part of the thesis focuses on the
non-interactive framework
and considers two anonymization methods for publishing
data for analysis
from a table containing personal records. We consider the
k-Anonymity
model proposed by Samarati and Sweeney, and present
approximation
algorithms for anonymizing databases. Then we propose a
new method for
anonymizing data records, where the data records are
clustered and then
cluster centers are published. To ensure privacy of the
data records, we
impose the constraint that each cluster must contain no
fewer than a
pre-specified number of data records. We provide
approximation
algorithms to come up with such a clustering.