In applications such as
network monitoring, telecommunications data management, clickstream
monitoring, manufacturing, sensor networks, and others, data takes the
form of continuous data streams rather than finite stored data
sets, and clients require long-running continuous queries as
opposed to one-time queries. Traditional database systems and data
processing algorithms are ill-equipped to handle complex and numerous
continuous queries over data streams, and many aspects of data
management and processing need to be reconsidered in their
presence. In the STREAM project, we are reinvestigating data
management and query processing in the presence of multiple,
continuous, rapid, time-varying data streams. We are attacking
problems ranging from basic theory results to algorithms
to implementing a comprehensive prototype data stream management
system.
The STREAM project has been supported in part by
the National Science Foundation under grants IIS-0118173, IIS-9817799,
IIS-0324431, and IIS-1098447.
- (January 2006) The STREAM project
has officially wound down. Key students have finished their
Ph.D.'s and left Stanford, while others have moved on to new and
exciting different research topics. (Not to be taken as an
indication that research in streams and continuous queries is
"done" -- many topics are still wide open.) We will endeavor to
continue answering questions and supporting the publicly available
system, however resources are minimal at this stage.
- (ongoing) We maintain a Stream Query Repository
as a resource for researchers in data streams.
- (March 2004) A new overview paper on the STREAM project is
available: STREAM: The
Stanford Data Stream Management System. It will appear in a book on
data stream management edited by Garofalakis, Gehrke, and Rastogi.
- (August 2003) The latest informal get-together of data streams
research groups was hosted by David Maier's group at OGI in Portland. Meeting
notes are available from the Stream
Team Web page. The next Stream Team meeting is planned for spring '04,
location TBD. [Summer '04: It was held in Berkeley, but the notes are in an indefinite state of "being assembled".]
- (May 2003) We instantiated the Linear Road Benchmark
with a detailed
suite of CQL schemas and queries.
Our prototype Data Stream Management System is
available for public use. You may try the system over the internet, or
you may download the code.
Internet System
You can try the STREAM prototype without downloading and installing the
source -- we start up a server at Stanford and a client on your
machine. CLICK HERE to give it a
try, and don't forget to read the Online Help so you know what
you're doing. If the server above is overloaded or unavailable,
try this
one instead.
Using the internet-accessible system you can try out much of the
prototype's functionality: a few predefined streams and queries
(including the Linear
Road Benchmark), registering new streams and queries, visualizing
and monitoring query plans, and other features.
Source Code
If you're interested in installing the prototype system at your own
site, the source
code is available for download. This release contains the code
for the main STREAM server and for a GUI client to interact with the
server over a network. The server can also be used as a library, and
directly linked from a C++ program. The code release includes a fairly
comprehensive user
manual with detailed information about the functionality
and design of the STREAM prototype.
Faculty
Students (grad and undergrad)
Alums
- The Stanford Data Stream Management System.
Jennifer Widom's
standard talk on the project, updated January 2006
- CQL: A Language for Continuous Queries over Streams and Relations.
Invited talk given by Jennifer Widom at the 2003 DBPL Workshop
(Sept. 2003)
- Randomization for Massive and Streaming Data Sets.
Talk given
by Rajeev Motwani at the Stanford
Computer Science Forum - Annual Affiliates Meeting (May 2003)
- Models and Issues in Data Stream Systems.
Invited talk given by
Rajeev Motwani at the 2002
PODS Conference (June 2002)
Overviews and Surveys
- The STREAM Group.
Stanford Data Stream Management System (latest overview paper)
To appear in a book on
data stream management edited by Garofalakis, Gehrke, and Rastogi.
- The STREAM Group. STREAM:
The Stanford Stream Data Manager (short overview paper)
IEEE Data
Engineering Bulletin, March 2003
- R. Motwani, J. Widom, A. Arasu, B. Babcock, S.
Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource
Management, and Approximation in a Data Stream Management System
In
Proc. of CIDR 2003, Jan.
2003
This paper describes our ongoing work developing
the Stanford Stream Data Manager (STREAM), a system for executing continuous
queries over multiple continuous data streams. The STREAM system supports a
declarative query language, and it copes with high data rates and query
workloads by providing approximate answers when resources are limited. This
paper describes specific contributions made so far and enumerates our next
steps in developing a general-purpose Data Stream Management System.
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J.
Widom. Models and Issues in
Data Stream Systems
Invited paper in Proc. of PODS 2002, June 2002
In
this overview paper we motivate the need for and research issues arising from
a new model of data processing. In this model, data does not take the form of
persistent relations, but rather arrives in multiple, continuous, rapid,
time-varying data streams. In addition to reviewing past work relevant to data
stream systems and current projects in the area, the paper explores topics in
stream query languages, new requirements and challenges in query processing,
and algorithmic issues.
- S. Babu and J. Widom. Continuous Queries over Data
Streams
In SIGMOD Record, Sep. 2001
We
specify a general and flexible framework for query processing in the presence
of data streams. The framework captures most previous work on continuous
queries and data streams, as well as subsuming related concepts such as
triggers and materialized views. We further map out problems, techniques, and
challenges in processing continuous queries over data streams.
System
- U. Srivastava and J. Widom. Flexible Time Management in Data
Stream Systems
In Proc. of PODS 2004, June 2004
Flexible application-defined time poses challenges to a Data Stream
Management System, since streams may be out of order and uncoordinated with
each other, they may incur latency reaching the DSMS, and they may pause or
stop. We formalize these challenges and specify how to generate heartbeats so
that queries can be evaluated correctly and continuously in an
application-defined time domain. Our heartbeat generation algorithm is based
on parameters capturing skew between streams, unordering within streams, and
latency in streams reaching the DSMS. We also describe how to estimate these
parameters at run-time, and we discuss how heartbeats can be used for
processing continuous queries.
- B. Babcock, M. Datar, and R. Motwani. Load Shedding for Aggregation Queries over Data Streams
In Proc. of ICDE 2004, March 2004
We present load shedding techniques for a
restricted class of stream queries: Aggregation queries over sliding windows,
possibly with selections, projections and foreign key joins with stored
relations. We present optimal solutions for placing load shedders (operators
which randomly drop tuples) in the query plan, which reduce the load on the
system below the required threshold, while minimizing the inaccuracy
introduced in the queries.
- D. Thomas and R. Motwani. Caching Queues in Memory Buffers
In Proc. of SODA 2004, Jan. 2004
We study the
problem of maintaining queues in a cache, which occurs in a number of
important settings like DataStream systems, Network Router design and
Distributed Messaging services. We analyze why DataStream systems built on top
of buffer managers that use traditional algorithms like LRU perform badly. We
provide online competitive algorithms for this problem for different
interesting cost models.
- B. Babcock, S. Babu, M. Datar, R. Motwani, and D.
Thomas. Operator Scheduling
in Data Stream Systems
To appear in VLDB Journal, 2005
This paper is an extended version of our paper titled "Chain: Operator
Scheduling for Memory Minimization in Data Stream Systems" that appeared in
the proceedings of SIGMOD 2003. This paper extends the Chain
operator-scheduling strategy proposed in the SIGMOD paper to minimize run-time
memory requirements subject to user-specified latency constraints. This paper
also proves an NP-completeness result showing the intractability of the
problem of minimizing run-time memory requirements in the stream setting.
Query Language
- A. Arasu and J. Widom. A Denotational Semantics
for Continuous Queries over Streams and Relations
In SIGMOD Record, Sep. 2004
We present formal,
denotational semantics for a generic continuous query language
based on streams, time-varying relations, and three classes of operators
over streams and relations.
- A. Arasu, S. Babu and J. Widom. The CQL Continuous Query
Language: Semantic Foundations and Query Execution
To appear in VLDB Journal, 2005
We first present an abstract semantics
for continuous queries over streams based on several building blocks: formal
definitions for streams and relations, mappings among them, and any relational
query language. We then propose a concrete language, CQL (for Continuous Query
Language), which instantiates the abstract semantics using SQL as the
relational query language and window specifications derived from SQL-99 to map
from streams to relations. Finally, we present the implementation of CQL in
the STREAM prototype, describing query execution plans, operators,
inter-operator queues, synopses, and sharing of data and computation among
multiple operators and queries. Examples throughout the paper are drawn from
the Linear Road Benchmark recently proposed for Data Stream Management
Systems.
Note: A preliminary, much shorter version of this paper
appeared as the November 2002 Technical Report An Abstract Semantics and
Concrete Language for Continuous Queries over Streams and Relations, and
an even shorter version titled "CQL: A Language for Continuous Queries over
Streams and Relations" appeared as an invited paper in the DBPL workshop,
Sep. 2003.
Query Processing
- K. Munagala, U. Srivastava, and J. Widom. Optimization of Continuous Queries with Shared Expensive Filters
Technical Report, Nov. 2005
We consider the problem of optimizing and executing multiple
continuous queries, where each query is a conjunction of filters and
each filter may occur in multiple queries. When filters are expensive,
significant performance gains are achieved by sharing filter
evaluations across queries. A shared execution strategy in our
scenario can either be fixed, in which filters are evaluated in the
same predetermined order for all input, or adaptive, in which
the next filter to be evaluated is chosen at runtime based on the
results of the filters evaluated so far. We show that as filter costs
increase, the best adaptive strategy is superior to any fixed
strategy, despite the overhead of adaptivity. We show that it is
NP-hard to find the optimal adaptive strategy, even if we are willing
to approximate within any factor smaller than logarithmic in the
number of queries. We present a greedy execution strategy and show
that it approximates the best adaptive strategy to within a factor
polylogarithmic in the number of queries and filters. We also show how
the execution overhead of adaptive strategies can be reduced by
appropriate precomputation. Finally, we present an experimental
evaluation demonstrating the effectiveness of our techniques.
- U. Srivastava, K. Munagala and J. Widom. Operator Placement for
In-Network Query Processing
In Proc. of PODS 2005, June 2005
In this paper we consider the problem of executing queries over a
network of nodes where data is acquired at low-capability sensors and
then transmitted through a hierarchy of nodes having progressively
increasing network bandwidth and computational power. The goal is to
perform ``in-network'' query processing and to decide the placement of
the query plan operators such that the total cost of computation and
transmission is minimized. We give optimal operator-placement algorithms
for queries involving possibly expensive conjunctive filters, and joins.
- S. Babu, K. Munagala, J. Widom, and R. Motwani. Adaptive Caching for
Continuous Queries
In Proc. of ICDE 2005, April 2005
We study the problem of using caches to improve performance and
adaptivity in continuous multiway joins. We propose different cache
types and algorithms for cache maintenance, monitoring cache cost and
benefits, selecting caches to use, allocating memory to caches, and
adapting over the entire spectrum between stateless MJoins and
cache-rich join trees as stream and system conditions change.
Although we focus on joins, our algorithms generalize easily to
query plans composed of of one or more operator pipelines, and to any
number of such query plans.
- K. Munagala, S. Babu, R. Motwani, and J. Widom.
The Pipelined Set Cover Problem
In Proc. of ICDT 2005, Jan. 2005
A classical problem in query optimization is to find the optimal
ordering of a set of possibly correlated selections or joins. We
provide an abstraction of this problem as a generalization of set
cover called pipelined set cover, where the sets are applied
sequentially to the elements to be covered and the elements covered at
each stage are discarded. We show that several natural heuristics for
this NP-hard problem, such as the greedy set-cover heuristic and a
local-search heuristic, can be analyzed using a linear-programming
framework which bounds not only the approximation ratio, but also the
running time of the corresponding algorithms.
We also consider the online version of pipelined
set cover and present a competitive algorithm with a logarithmic
performance guarantee.
- A. Arasu and J. Widom. Resource Sharing in Continuous
Sliding-Window Aggregates
In Proc. of VLDB 2004,
Sep. 2004
We consider the problem of resource sharing when processing large
numbers of continuous queries. We specifically address sliding-window
aggregates over data streams, an important class of continuous operators
for which sharing has not been addressed. We present a suite of sharing
techniques that cover a wide range of possible scenarios: different
classes of aggregation functions (algebraic, distributive, holistic),
different window types (time-based, tuple-based, suffix, historical),
and different input models (single stream, multiple substreams). We
provide precise theoretical performance guarantees for our techniques,
and show their practical effectiveness through a thorough experimental
study.
- U. Srivastava and J. Widom.
Memory-Limited Execution of Windowed Stream Joins
In Proc. of VLDB 2004, Sep. 2004
We address the problem of computing approximate answers to
sliding-window
joins over data streams when the available memory may be insufficient
to
keep the entire join state. The objective of the approximation may be
either to return a maximum-size subset of the result or a random sample
of
the result. We introduce a new age-based model of stream arrival that
is
often more appropriate for addressing these problems than the
traditional
frequency-based model used in previous work. We also provide an
algorithm
for optimal memory allocation across multiple joins being executed in
the
system.
- S. Babu, U. Srivastava, and J. Widom. Exploiting k-Constraints to
Reduce Memory Overhead in Continuous Queries over Data Streams
In ACM TODS, Sep. 2004
We introduce the
important concept of k-constraints, which are likely to hold in data
stream environments even when strict constraints do not hold. We demonstrate
how to incorporate k-constraints into a data stream query processor in
order to reduce memory overhead for continuous queries. We show empirically
that k-constraints are very effective at reducing the memory
requirement in a wide variety of SPJ queries and that these constraints can be
monitored and exploited with very low computational overhead.
- S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive Ordering of
Pipelined Stream Filters
In Proc. of SIGMOD 2004, June 2004
We consider the problem of pipelined filters, where a continuous
stream of elements is processed by a set of commutative filters.
We focus on the problem of
ordering the filters adaptively to minimize processing cost in an
environment where stream and filter characteristics vary unpredictably
over time. Our core algorithm, A-Greedy (for Adaptive
Greedy), has strong theoretical guarantees: If stream and filter
characteristics were to stabilize, A-Greedy would converge to an
ordering within a small constant factor of optimal.
(In experiments A-Greedy
usually converges to the optimal ordering.)
We identify and study a three-way tradeoff
among provable convergence to good orderings, run-time overhead, and
speed of adaptivity.
- A. Arasu, B. Babcock, S. Babu, J.
McAlister, and J. Widom. Characterizing Memory
Requirements for Queries over Continuous Data Streams
In
ACM TODS, March 2004.
We
consider conjunctive queries with arithmetic comparisons over multiple
continuous data streams. We specify an algorithm for determining whether or
not a query can be evaluated using a bounded amount of memory for all possible
instances of the data streams. When a query can be evaluated using bounded
memory, our algorithm produces an evaluation plan based on constant-sized
synopses of the data streams.
Note: This paper is an extension of the
paper of the same name that appeared in PODS 2002.
- U. Srivastava, S. Babu, and J. Widom. Monitoring Stream Properties for
Continuous Query Processing (short paper)
In Proc. of MPDS 2003, June 2003
Distributed Streams
- C. Olston, J. Jiang, and J. Widom. Adaptive Filters for
Continuous Queries over Distributed Data Streams
In Proc. of SIGMOD 2003, June 2003
We consider an environment where distributed data sources continuously
stream updates to a centralized processor that monitors continuous queries
over the distributed data. Significant communication overhead is incurred in
the presence of rapid update streams, and we propose a new technique for
reducing the overhead. Users register continuous queries with precision
requirements at the central stream processor, which installs filters at remote
data sources. The filters adapt to changing conditions to minimize stream
rates while guaranteeing that all continuous queries still receive the updates
necessary to provide answers of adequate precision at all times.
- B. Babcock and C. Olston. Distributed
Top-K Monitoring
In Proc. of SIGMOD 2003, June 2003
We study a useful class of
queries that continuously report the k largest values obtained from
distributed data streams ("top-k monitoring queries"), which are of particular
interest because they can be used to reduce the overhead incurred while
running other types of monitoring queries. We show that transmitting entire
data streams is unnecessary to support these queries and present an
alternative approach that reduces communication significantly. In our
approach, arithmetic constraints are maintained at remote stream sources to
ensure that the most recently provided top-k answer remains valid to within a
user-specified error tolerance.
Statistics
- A. Arasu and G. Manku. Approximate Counts and
Quantiles over Sliding Windows
In Proc. of PODS 2004, June 2004
We consider the problem of maintaining approximate counts and quantiles
over fixed- and variable-size sliding windows in limited space. For
quantiles, we present deterministic algorithms whose space requirements
are O(1/e log(1/e)log N) and O(1/e log(1/e) log(eN) log N) in the
worst-case for fixed- and variable-size windows, respectively, where N
denotes the current number of elements in the window and e, the relative
error. Our space bounds improve upon the previous best bounds of O(1/e^2
polylog (1/e,N)). For counts, we present both deterministic and randomized
algorithms. The deterministic algorithms require O(1/e log^2 (1/e)) and
O(1/e log^2 (1/e) log eN) for worst-case space for fixed- and
variable-size windows, respectively, while the randomized ones require
O(1/e log (1/(e d))) and O(1/e log(1/(ed)) log eN) worst-case space, where
d denotes the probability of failure. We believe no previous work on
space-efficient approximate counts for sliding windows exists.
- B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining Variance and
k-Medians over Data Stream Windows
In Proc. of PODS 2003, June 2003
We present a novel technique for solving two important and related
problems in the sliding window model -- maintaining variance and maintaining
k-medians clustering.
- M. Datar and S. Muthukrishnan. Estimating
Rarity and Similarity over Data Stream Windows
In Proc. of
European Symposium of Algorithms, Sep. 2002
We present
solutions to two problems in the sliding window model: estimating
rarity and similarity over streams. The rarity of a stream is
defined as the ratio of the number of elements that occur exactly once to the
number of distinct elements. The similarity between two streams is defined as
the similarity between the sets of distinct elements seen over the two
streams; the ratio of the size of the intersection to the size of the union.
- G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing Data Streams Using
Hamming Norm
In Proc. of VLDB 2002, Aug. 2002
We present
solution to the problem of computing Hamming norm over data streams. Hamming
norm computation is more general than the well studied distinct value
estimation problem. Our solution uses sketching techniques and works in the
presence of inserts and deletes.
- G. S. Manku and R. Motwani. Approximate
Frequency Counts over Streaming Data
In Proc. of VLDB 2002, Aug. 2002
This paper present algorithms for computing frequency counts exceeding
a user-specified threshold over data streams. Some applications are also
presented.
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining
Stream Statistics Over Sliding Windows
In SIAM Journal on Computing,
Vol. 31 No. 6
We consider the problem of
maintaining statistics over sliding windows. We design data structures with
small memory requirements and provide matching lower bounds.
Note: This
paper is an extension of the paper of the same name that appeared in SODA
2002.
- B. Babcock, M. Datar, and R. Motwani. Sampling From a Moving Window
Over Streaming Data
In Proc. of SODA 2002, Jan. 2002
We
introduce the problem of sampling from a moving window of recent items from a
data stream and develop the "chain-sample" and "priority-sample" algorithms
for this problem.
Clustering
- S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams:
Theory and Practice
IEEE Trans. on Knowledge and Data
Engineering, vol. 15 (2003)
Under the data stream model,
the data set to be processed is assumed to be too large to be processed
together in RAM, and to be only accessible via linear scans, so that, for
example, random access is unavailable. This model has recently attracted
attention for its applicability to numerous types of data, including telephone
records, web documents and clickstreams. For algorithms designed to analyze
such data, the ability to process the data in a single pass, or a small number
of passes, while using little memory, is crucial. We describe a one-pass,
memory-efficient streaming algorithm that effectively clusters large data
streams. We also provide empirical evidence of the algorithm's performance on
synthetic and real data streams.
- L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. High-Performance
Clustering of Streams and Large Data Sets
In Proc. of ICDE 2002, Feb. 2002
We give innovative techniques to transform theoretically well-founded
algorithms for clustering into ones that perform well in practice. We further
show that their performance is competitive with popular empirical approaches
for clustering data streams.
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering
Data Streams
In Proc. of FOCS 2000, Nov. 2000
We study
clustering under the data stream model of computation where given a sequence
of points, the objective is to maintain a consistently good clustering of the
sequence observed so far using little memory and time.
Applications
- S. Babu, L. Subramanian, and J. Widom. A Data Stream Management System
for Network Traffic Management
In Proc. of NRDM 2001, May 2001
In
this short position paper, we describe the demands of network traffic
management applications and we discuss how a Data Stream Management System can
provide a general and scalable platform for deploying these applications.
Last modified: January 5, 2006