DataMotion Report May 2006

DataMotion Report June 2007

Research

During our third year, we have continued our DataMotion research,

in the four thrust areas. In what follows we describe

our results.

** Finding Frequent Elements in Non-Bursty Streams

We developed an algorithm for finding frequent elements in a stream

where the arrivals are not bursty. The algorithm requires a modest amount

of memory, depending on the burstiness of the stream.

A major advantage of our algorithm is that even

if the relative frequencies of the different elements is fixed, the

space complexity decreases with the length of the stream if the

stream is not bursty. To our knowledge this is the only algorithm

with such property.

Rina Panigrahy and Dilys Thomas. Finding Frequent Elements in

non-bursty Streams. European Symposium on Algorithms (ESA), 2007.

Available at http://www.stanford.edu/~dilys/papers/FreqElem.pdf.

** Optimization of Continuous Queries with Shared Expensive Filters

We considered the problem of optimizing and executing multiple

continuous queries, where each query is a conjunction of filters and

each filter may occur in multiple queries. When filters are

expensive, significant performance gains are achieved by sharing

filter evaluations across queries. A shared execution strategy in our

scenario can either be fixed, in which filters are evaluated in the

same predetermined order for all input, or adaptive, in which case

the next filter to be evaluated is chosen at runtime based on the

results of the filters evaluated so far. Our research has shown that as

filter costs increase, the best adaptive strategy is superior to any fixed

strategy, despite the overhead of adaptivity. We have also shown that it is

NP-hard to find the optimal adaptive strategy, even if we are willing

to approximate within any factor smaller than logarithmic in the

number of queries. We developed a greedy execution strategy and showed

that it approximates the best adaptive strategy to within a factor

polylogarithmic in the number of queries and filters. We have also shown

how the execution overhead of adaptive strategies can be reduced by

appropriate precomputation. We have carried out an experimental evaluation

that demonstrates the effectiveness of our techniques.

K. Munagala, U. Srivastava, and J. Widom. Optimization of Continuous

Queries with Shared Expensive Filters. To appear in Proceedings of

the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database

Systems, Beijing, China, June 2007.

Available at http://dbpubs.stanford.edu/pub/2005-36.

** Beyond Just Data Privacy

In our project we have studied how to enforce security

and privacy of data streams. In doing so, we have found that

designing a system that "guarantees" the privacy of its

information may not be enough. One must also consider the price for

providing that protection: For example, is the information preserved

adequately? Does the system perform well? Thus, we have started to explore

how one can design systems that strike a good balance

between security, privacy, performance, data preservation.

In the paper cited below we explore how this balance

may be achieved.

Mungamuru, B. and Garcia-Molina, H. Beyond Just Data Privacy. 3rd

Biennial Conference On Innovative Data Systems Research (CIDR).

Pacific Grove, CA, USA, January 2007.

Available at http://dbpubs.stanford.edu/pub/2006-18.

** Configurations -- A Model For Distributed Data Storage

We have continued our work on "configurations" which facilitate the

description and analysis of secure data systems. Both the longevity

and privacy of sensitive data are considered by configurations. The

model uses two basic operators: copy, which replicates data for

longevity, and split, which decomposes data (e.g., into ciphertext

and a key) for privacy. The operators can be recursively composed to

describe how data and their associated "keys" are managed. We have

defined various classes of configurations, where each class has

desirable properties with respect to physical realizability and

semantic correctness. We have developed formal techniques to verify

these properties for a given configuration.

Mungamuru, B., Garcia-Molina, H. and Olston, C. Brief Announcement:

Configurations -- A Model For Distributed Data Storage. 26th ACM

Symposium on Principles of Distributed Computing. Portland, OR, USA,

August 2007.

Available at http://dbpubs.stanford.edu/pub/2005-41.

** Collusion and Data Privacy

In certain situations where a firm or organization has an interest in

privacy, external parties can be in a position to collude with each

other in a manner that violates the firm's privacy. These external

parties often have a monetary incentive to collude against the firm.

We study this threat of collusion within the context of a simple

economic model. Although a simple reward scheme can be used to deter

collusion, a deterrent based on a long-term business relationship is

more effective. We have developed an optimal collusion-resistant

mechanism with which a firm can filter out parties with low discount

factors.

Mungamuru, Bob; Padilla, Michael; Garcia-Molina, Hector. Collusion

and Data Privacy. Technical Report.

Available at http://dbpubs.stanford.edu/pub/2007-21.

** Distributing Data For Secure Database Services

The advent of database services has resulted in privacy concerns

on the part of the client storing data with third party

database service providers. Previous approaches to enabling such a

service have been based on data encryption, causing a large overhead

in query processing. Instead, we have proposed a distributed architecture

for secure database services as a solution to this problem where data was

stored at multiple sites. The distributed architecture provides both

privacy as well as fault tolerance to the client. We have developed

algorithms for (1) distributing data: our results include

hardness of approximation results and hence a heuristic greedy hill

climbing algorithm for the distribution problem (2) partitioning the

query at the client to queries for the various sites which is done by a

bottom up state based algorithm. Finally the results at

the sites are integrated to obtain the answer at the client. We have

provided an experimental validation and performance study of our

algorithms.

Vignesh Ganapathy, Dilys Thomas, Tomas Feder, Hector garcia-Molina,

Rajeev Motwani; Distributing Data For Secure Database Services;

Technical Report.

Available at http://dbpubs.stanford.edu/pub/2007-23.

PhD Thesis

This part year, two PhD thesis were completed by students

with partial DataMotion support:

** Dilys Thomas

Thesis Title: Algorithms and Architectures for Data Privacy

Abstract:

The explosive progress in networking, storage, and processor

technologies has resulted in an unprecedented volume of digital data.

In concert with this escalating increase in digital data, concerns

about privacy of personal information have emerged globally. The ease

at which data can be collected automatically, stored in databases and

queried efficiently over the internet has worsened the privacy

situation, and has raised numerous ethical and legal concerns.

Privacy enforcement today is being handled primarily through

legislation. We aim to provide technological solutions to achieve a

tradeoff between data privacy and data utility. We focus on three

problems in the area of database privacy in this thesis.

The first problem is that of data sanitization before publication.

Publishing health and financial information for research purposes

requires the data be anonymized so that the privacy of individuals in

the database is protected. This anonymized information can be used as

is or can be combined with another (anonymized) dataset that shares

columns or rows with the original anonymized dataset. We explore both

these scenarios in this thesis. Another reason for sanitization is to

give the data to an out-sourced software developer for testing

software applications without the out-sourced developer learning

information about its client. We briefly explain such a tool in this

thesis.

The second part of the thesis is auditing query logs for privacy.

Given certain forbidden views of a database that must be kept

confidential, a batch of SQL queries that were posed over this

database, and a definition of suspiciousness, we study the problem to

determine whether the batch of queries is suspicious with respect to

the forbidden views.

The third part of the thesis deals with distributed architectures for

data privacy. The advent of databases as an out-sourced service has

resulted in privacy concerns on the part of the client storing data

with third party database service providers. Previous approaches to

enabling such a service have been based on data encryption, causing a

large overhead in query processing. In this thesis we provide a

distributed architecture for secure database services. We develop

algorithms for distributing data and executing queries over this

distributed data.

** Krishnaram Kenthapadi

Thesis Title: Models and Algorithms for Data Privacy

Abstract:

Over the last twenty years, there has been a tremendous growth in the

amount of private data collected about individuals.

With the rapid growth in

database, networking, and computing technologies, such data can be

integrated and analyzed digitally. On the one hand, this has led to the

development of data mining tools that aim to infer useful trends from

this data. But, on the other hand, easy access to personal data poses a

threat to individual privacy. In this thesis, we provide models and

algorithms for protecting the privacy of individuals in such large data

sets while still allowing users to mine useful trends and statistics.

We focus on the problem of statistical disclosure control -

revealing aggregate statistics about a population while preserving the

privacy of individuals. A statistical database can be viewed as a

table containing personal records, where the rows correspond to

individuals and the columns correspond to different attributes. For

example, a medical database may contain attributes such as name, social

security number, address, age, gender, ethnicity, and medical history

for each patient. We would like the medical researchers to have some

form of access to this database so as to learn trends such as

correlation between age and heart disease, while maintaining individual

privacy. There are broadly two frameworks for protecting privacy in

statistical databases. In the interactive framework, the user

(researcher) queries the database through a privacy mechanism, which may

deny the query or alter the answer to the query in order to ensure

privacy. In the non-interactive framework, the original database is

first sanitized so as to preserve privacy and then the modified version

is released. We study methods under both these frameworks as each method

is useful in different contexts.

The first part of the thesis focuses on the interactive framework and

provides models and algorithms for two methods used in this

framework. We first consider the online query auditing problem: given a

sequence of queries that have already been posed about the data, their

corresponding answers and given a new query, deny the answer if privacy

can be breached or give the true answer otherwise. We uncover the

fundamental problem that query denials leak information. As this problem

was overlooked in previous work, some of the previously suggested

auditors can be used by an attacker to compromise the privacy of a large

fraction of the individuals in the data.

To overcome this problem, we introduce a new model called simulatable

auditing where query denials provably do not leak information. We also

describe a probabilistic notion of (partial) compromise, in order to

overcome the known limitations of the existing privacy definition. We

then present simulatable auditing algorithms under both these

definitions. The second problem we consider is output perturbation, in

which the database administrator computes exact answer to the query and then

outputs a perturbed answer (by adding random noise) as the response to

the query. Inspired by the desire to enable individuals to retain

control over their information, we provide a fault-tolerant distributed

implementation of output perturbation schemes, thereby eliminating the

need for a trusted database administrator. In the process, we provide

protocols for the cooperative generation of shares of random noise

according to different distributions.

The second part of the thesis focuses on the non-interactive framework

and considers two anonymization methods for publishing data for analysis

from a table containing personal records. We consider the k-Anonymity

model proposed by Samarati and Sweeney, and present approximation

algorithms for anonymizing databases. Then we propose a new method for

anonymizing data records, where the data records are clustered and then

cluster centers are published. To ensure privacy of the data records, we

impose the constraint that each cluster must contain no fewer than a

pre-specified number of data records. We provide approximation

algorithms to come up with such a clustering.