Research

A list of recent publications by topic. (Some publications span multiple topics; I list them under only one.)

Clicking on a paper title toggles a detailed view of the paper that includes the abstract, links to different versions of the paper, and powerpoint slides, if any.

Peer-to-Peer Systems [ Show Details | Hide Details ]

Conference Papers

On Cooperative Content Distribution and the Price of Barter, Prasanna Ganesan and Mukund Seshadri, To appear in Proc. ICDCS 2005.

LSH Forest: Self-Tuning Indexes for Similarity Search, Mayank Bawa, Tyson Condie and Prasanna Ganesan, To appear in Proc. WWW 2005.

Adlib: A Self-Tuning Index for Dynamic Peer-to-Peer Systems, Prasanna Ganesan, Qixiang Sun and Hector Garcia-Molina, To appear in Proc. ICDE, 2005. (Short paper)

Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems, Prasanna Ganesan, Mayank Bawa and Hector Garcia-Molina, Proc. VLDB 2004.

Canon in G Major: Designing DHTs with Hierarchical Structure, Prasanna Ganesan, Krishna Gummadi and Hector Garcia-Molina, Proc. ICDCS 2004.

Optimal Routing in Chord, Prasanna Ganesan and Gurmeet Singh Manku, Proc. SODA 2004.

YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology, Prasanna Ganesan, Qixiang Sun and Hector Garcia-Molina, Proc. INFOCOM 2003.

Workshop Papers

One Torus to Rule Them All: Multi-dimensional Queries in P2P Systems, Prasanna Ganesan, Beverly Yang and Hector Garcia-Molina, Proc. WebDB 2004.

DHT Routing using Social Links, Sergio Marti, Prasanna Ganesan and Hector Garcia-Molina, Proc. IPTPS 2004. (Also in P2P&DB 2004)

Invited Papers

Peer-to-peer research at Stanford, Mayank Bawa, Brian F. Cooper, Arturo Crespo, Neil Daswani, Prasanna Ganesan, Hector Garcia-Molina, Sepandar Kamvar, Sergio Marti, Mario Schlosser, Qi Sun, Patrick Vinograd and Beverly Yang, SIGMOD Record, September 2003.

In this paper, we present recent and ongoing research projects of the Peers group at Stanford University.
( Paper )

Technical Reports

A Distributed Ordered Dictionary with (1+e) Load Balance, Brian Babcock and Prasanna Ganesan, Stanford Technical Report, 2004.

Apocrypha: Making P2P Overlays Network-aware, Prasanna Ganesan, Qixiang Sun and Hector Garcia-Molina, Stanford Technical Report, 2004.

Distributed Balanced Tables: Not Making a Hash of it All, Prasanna Ganesan and Mayank Bawa, Stanford Technical Report, 2003.

A Case for Locally Organized Lookup Services, Prasanna Ganesan, Qixiang Sun and Hector Garcia-Molina, Stanford Technical Report, 2002.

Database Systems [ Show Details | Hide Details ]

Primitives for Workload Summarization and Implications for SQL, Surajit Chaudhuri, Prasanna Ganesan and Vivek Narasayya, Proc. VLDB 2003.

Factorizing Complex Predicates in Queries to Exploit Indexes, Surajit Chaudhuri, Prasanna Ganesan and Sunita Sarawagi, Proc. SIGMOD 2003.

Data Privacy [ Show Details | Hide Details ]

Vision Paper: Enabling Privacy for the Paranoids, Gagan Aggarwal, Mayank Bawa, Prasanna Ganesan, Hector Garcia-Molina, Krishnaram Kenthapadi, Nina Mishra, Rajeev Motwani, Utkarsh Srivastava, Dilys Thomas, Jennifer Widom, Proc. VLDB 2004.

Two Can Keep a Secret: A Distributed Architecture for Secure Database Services, Gagan Aggarwal, Mayank Bawa, Prasanna Ganesan, Hector Garcia-Molina, Krishnaram Kenthapadi, Rajeev Motwani, Utkarsh Srivastava, Dilys Thomas, Ying Xu, Proc. CIDR 2005.

Miscellaneous

Exploiting Hierarchical Domain Structure to Compute Similarity, Prasanna Ganesan, Hector Garcia-Molina and Jennifer Widom, ACM Transactions on Information Systems(TOIS), Jan 2003.


Abstract: We study how a server may disseminate a large volume of data to a set of clients in the shortest possible time. We first consider a cooperative scenario where clients are willing to upload data to each other and, under a simple bandwidth model, derive an optimal solution involving communication on a hypercube-like overlay network. We also study different randomized algorithms, and show that their performance is surprisingly good. We then consider non-cooperative scenarios based on the principle of barter, in which one client does not upload to another unless it receives data in return. A strict barter requirement increases the optimal completion time considerably compared to the cooperative case. We consider relaxations of the barter model in which an efficient solution is theoretically feasible, and show that obtaining a high-performance practical solution may require careful choices of overlay networks and data-transfer algorithms. (Coming Soon!)


Abstract: We consider the problem of indexing high-dimensional data for answering (approximate) similarity search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for a variety of high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes that can index data with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest. (Coming Soon!)


Abstract: Peer-to-peer (P2P) systems enable queries over a large database horizontally partitioned across a dynamic set of nodes. We devise a self-tuning index for such systems that can trade off index maintenance cost against query efficiency, in order to optimize the overall system cost. The index, Adlib, dynamically adapts itself to operate at the optimal trade-off point, even as the optimal configuration changes with nodes joining and leaving the system. We use experiments on realistic workloads to demonstrate that Adlib can reduce the overall system cost by a factor of four. ( Conference Version \| Extended Technical Report)


Abstract: Distributed Hash Tables have been proposed as flat, non-hierarchical structures, in contrast to most scalable distributed systems of the past. We show how to construct hierarchical DHTs while retaining the homogeneity of load and functionality offered by flat designs. Our generic construction, Canon, offers the same routing state v/s routing hops trade-off provided by standard DHT designs. The advantages of Canon include (but are not limited to) (a) fault isolation, (b) efficient caching and effective bandwidth usage for multicast, (c) adaptation to the underlying physical network, (d) hierarchical storage of content, and (e) hierarchical access control. Canon can be applied to many different proposed DHTs to construct their Canonical versions. We show how four different DHTs---Chord, Symphony, CAN and Kademlia---can be converted into their Canonical versions that we call Crescendo, Cacophony, Can-Can and Kandy respectively. ( Conference Version \| Extended Technical Report \| Talk Slides )


Abstract: We propose optimal routing algorithms for Chord, a popular topology for routing in peer-to-peer networks. Chord is an undirected graph on $2^b$ nodes arranged in a circle, with edges connecting pairs of nodes that are $2^k$ positions apart for any $k \geq 0$. The standard Chord routing algorithm uses edges in only one direction. Our algorithms exploit the bidirectionality of edges for optimality. At the heart of the new protocols lie algorithms for writing a positive integer $d$ as the difference of two non-negative integers $d'$ and $d''$ such that the total number of 1-bits in the binary representation of $d'$ and $d''$ is minimized. Given that Chord is a variant of the hypercube, the optimal routes possess a surprising combinatorial structure. ( Conference Version\| Gurmeet's Talk Slides )


Abstract: Existing peer-to-peer search networks generally fall into two categories: Gnutella-style systems that use arbitrary topology and rely on controlled flooding for search, and systems that explicitly build an underlying topology to efficiently support a distributed hash table (DHT). In this paper, we propose a hybrid scheme for building a peer-to-peer lookup service over arbitrary network topology. Specifically, for each node in the search network, we build a small DHT consisting of nearby nodes and then provide an intelligent search mechanism that can traverse all the small DHTs. Our hybrid approach can reduce the nodes contacted for a lookup by an order of magnitude compared to Gnutella, allows rapid searching of nearby nodes through quick fan-out, does not reorganize the underlying overlay, and isolates the effect of topology changes to small areas for better scalability and stability. (Conference Version)


Abstract: Peer-to-peer systems enable access to data spread over an extremely large number of machines. Most P2P systems support only simple lookup queries. However, many new applications, such as P2P photo sharing and massively multi-player games, would benefit greatly from support for multi-dimensional range queries. We show how such queries may be supported in a P2P system by adapting traditional spatial-database technologies with novel P2P routing networks and load-balancing algorithms. We show how to adapt two popular spatial-database solutions -- kd-trees and space-filling curves -- and experimentally compare their effectiveness. ( Workshop Version \| Talk Slides )


Abstract: The equality and anonymity of peer-to-peer networks makes them vulnerable to routing denial of service attacks from misbehaving nodes. In this paper, we investigate how existing social networks can benefit P2P networks by leveraging the inherent trust associated with social links. We present a trust model that lets us compare routing algorithms for P2P networks overlaying social networks. We propose, SPROUT, a DHT routing algorithm that significantly increases the probability of successful routing by using social links. Finally, we discuss further optimization and design choices for both the model and the routing algorithm. ( IPTPS Version \| P2P&DB Version )


In this paper, we present recent and ongoing research projects of the Peers group at Stanford University. ( Paper )


Abstract: We consider how to maintain a distributed dictionary over a set of nodes such that each node stores all the keys in one contiguous range of the (ordered) key domain. Such range-partitioned dictionaries are commonly used in parallel databases as they enable efficient range queries. As keys are inserted and removed from the dictionary, the partitioning needs to be adjusted in order to ensure storage balance across nodes. We develop an online algorithm that ensures that the asymptotic ratio of storage load between any pair of nodes is at most $(1+\epsilon)$, for any constant $\epsilon >0$, while ensuring that the amortized cost per key insertion or deletion, measured as the number of keys that are migrated across nodes, is constant. Our algorithm can be extended to work for peer-to-peer systems where nodes themselves may join and leave the distributed dictionary. ( Tech Report )


Abstract: Workload information has proved to be a crucial component for database-administration tasks as well as for analysis of query logs to understand user behavior and system usage. These tasks require the ability to summarize large SQL workloads. In this paper, we identify primitives that are important to enable many important workload-summarization tasks. These primitives also appear to be useful in a variety of practical scenarios besides workload summarization. Today’s SQL is inadequate to express these primitives conveniently. We discuss possible extensions to SQL and the relational engine to efficiently support such summarization primitives. ( Conference Paper \| Talk Slides)


Abstract: P3P is a set of standards that allow corporations to declare their privacy policies. Hippocratic Databases have been proposed to implement such policies within a corporation's datastore. From an end-user individual's point of view, both of these rest on an uncomfortable philosophy of trusting corporations to protect his/her privacy. Recent history chronicles several episodes when such trust has been willingly or accidentally violated by corporations facing bankruptcy courts, civil subpoenas or lucrative mergers. We contend that data management solutions for information privacy must restore controls in the individual's hands. We suggest that enabling such control will require a radical re-think on modeling, release/acquisition, and management of personal data. ( Conference Paper )


Abstract: The notion of similarity between objects finds use in many contexts, e.g., in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We also extend our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), e.g., to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and describe an informal user study that evaluated how well our measures match human intuition. ( Technical Report Version )


Abstract: Many P2P systems organize nodes in an overlay network, in order to enable communication between nodes. We devise a generic mechanism called Apocrypha that can optimize a variety of overlay networks with respect to the underlying physical-network structure. In particular, Apocrypha is applicable to deterministic and randomized overlay topologies, which offer optimal degree-diameter trade-offs but are not amenable to common overlay-optimization techniques proposed in prior work. We show that the use of Apocrypha can lead to optimized overlays of higher quality than was possible with earlier work. We discuss how Apocrypha operates under high rates of system churn, with nodes joining and leaving all the time, by trading off optimization cost against the benefit obtainable at high churn rates. ( Tech Report )


Abstract: DHTs implement a distributed dictionary, supporting key insertion, deletion and lookup. They use hashing to (a) enable efficient dictionary operations, and (b) achieve storage balance across the participant nodes. Hashing can be inappropriate for both problems, as it (a) destroys data ordering, thus making sequential key access and range queries expensive, and (b) fails to provide storage balance when keys are not unique. We propose generalizing DHTs to create Distributed Balanced Tables (DBTs), which eliminate the above two problems. To solve problem (a), we discuss how DHT routing structures can be adapted for use in DBTs, while preserving the costs of the standard dictionary operations and supporting efficient range queries. To solve problem (b), we describe an efficient algorithm that guarantees storage balance, even against an adversarial insertion and deletion of keys. ( Tech Report )


Abstract: Distributed lookup services have predominantly fallen into one of two categories: Gnutella-based systems and DHTs. In this paper, we identify a set of applications for P2P lookup services, and analyze each of their requirements along a commonly-chosen set of dimensions. We show that neither Gnutella nor DHTs may provide the desired trade-offs among these dimensions. We go on to demonstrate a locally-organized P2P lookup service that matches more closely with these application requirements. ( Tech Report )


Abstract: Decision-support applications generate queries with complex predicates. We show how the factorization of complex query expressions exposes significant opportunities for exploiting available indexes. We also present a novel idea of relaxing predicates in a complex condition to create possibilities for factoring. Our algorithms are designed for easy integration with existing query optimizers and support multiple optimization levels, providing different trade-offs between plan complexity and optimization time. ( Conference Paper \| Talk Slides )


Abstract: Recent trends towards database outsourcing, as well as concerns and laws governing data privacy, have led to great interest in enabling secure database services. Previous approaches to enabling such a service have been based on data encryption, causing a large overhead in query processing. We propose a new, distributed architecture that allows an organization to outsource its data management to {\em two} untrusted servers while preserving data privacy. We show how the presence of two servers enables efficient partitioning of data so that the contents at any one server are guaranteed not to breach data privacy. We show how to optimize and execute queries in this architecture, and discuss new challenges that emerge in designing the database schema. ( Technical Report \| Talk Slides )