Streams Meeting, March 20,2002.
Stanford:
Jennifer Widom | widom@db.stanford.edu |
Rajeev Motwani | !
rajeev@cs.stanford.edu |
Jeff Ullman | ullman@cs.stanford.edu |
Arvind Arasu | arvinda@cs.stanford.edu |
Brian Babcock | babcock@cs.stanford.edu |
Shivnath Babu | shivnath@cs.stanford.edu |
Prasanna Ganesan | prasanna.ganesan@cs.stanford.edu |
Gurmeet Manku | manku@cs.stanford.edu |
Liadan O'Callaghan | loc@cs.stanford.edu |
Rohit Varma | rohit.varma@cs.stanford.edu |
Berkeley:
Mike Franklin | franklin@cs.berkeley.edu |
Sirish Chandrasekaran | sirish@cs.berkeley.edu |
Yanlei Diao | diaoyl@cs.berkeley.edu |
Sam Madden | madden@cs.berkeley.edu |
OGI:
Dave Maier | maier@cse.ogi.edu |
Pete Tucker | ptucker@cse.ogi.edu |
Dave Meier:
Pipelined Evaluation
Requiressome ! local state at operators
Monotonic Evaluation:
Basic SPJ, dupl.. elim aremonotonic
Is Nest (aka groupby) monotonic?
What does monotonic mean?
Maier: If a tupleÕs in the answerit stays in the answer forever.
Widow: If the input gets bigger,the answer gets bigger.
Groups donÕt grow monotonically inWidomÕs view Ð may change as
new values arrive
In MaierÕs view, montonicity of nest depends on definitionof Òless-thanÓ (e.g. ordering) Ð if ordering is su! bset, not nest not monotonic,if substructure, nest is monotonic.
One way to do it is to make Òreconstitution functionÓ handleit Ð e.g. make client compute groups. Jennifer sez reconstitution function is a lot like deltas.
Maier sez you need to Òdescribe the intent of the streamÓ Ðis it deltas, values, groups, partial groups?
Maier: Jayaval / Kristen showed that streaming aggregates(e.g. nonmonotonic) results might ma! ke deltas preferable.
Maier: Kinds of stream semantics:
-Append
- Delta
- Merge
- ! ÒBroadcast StreamsÓ Ð a sequence of snapshots, cyclic updateswith modified entries in the cycle.
- Strict snapshots?
- Invalidations
What does merge mean for two XML documents?
Two streams; which i! s the correct merge?
A Ð B Ð D &n!
bsp;
&nb!
sp;/
A Ð B A
\
C
\
C
A Ð B Ð C
OGI wants to do it via templates. DTDs donÕt specify this in enough detail, Xschema probablydoesnÕt either.
Pete Tucker:
Punctuated Streams
Deal with both unbounded state and blocking operators
Punctuations terminate subsets!
within the streams;
Punctuation semantics must be understood by operators.
Used so far:
Seen everything in a range. ( or up to a value)
Seen everything in a list
Seen all occurrences of a v! alue
Seen everything (e.g EOAttr)
Can punctuations have a ÒleaseÓ Ð e.g. valid for a certaintime range?
For now, punctuations are idempotent.
Three kinds of rules:
Pass rules:
Purge rules: What cana stateful operator discard when it sees punctuation?
Propagation rules:
Sirish: Whatabout pre-punctuation that indicates what the user can expect to see?
Maier: ThatÕdbe a different kind of stream semantics Ð useful.
Current state: partially implemented in Niagara Query Engine.
&!
nbsp;
Are punctuations and windows and epochs the same thing?
Sam: windowsare essentially ÒimplicitÓ punctuations
Jennifer: different kinds of constraints on streams:
- Schema level streamconstraints (e.g. ordering, or strict referential integrity)
o
- Data-level constraints(e.g. punctuations)
- Query-level constraints(e.g. time windows)
- Operator-levelconstraints
Are constraints always about optimization?
Correlated aggregates (Gehrke), last years PODS (or SIGMOD),talks about ÒlandmarksÓ, which look a lot like punctuations.
Chronicle data model Ð any join is an implicit band joi! n ontime, which constrains things.
Ullman Ð seems like you should be able to reason about thetypes of punctuations that can be output, demonstrate that state actually getspurged everywhere, etc. At least for the very restricted set of punctuationscurrently being considered.
Data Stream Queries
- Answer availability
o
o
o
- Registration time
o
o
- Stream Access
o
o
Applications
Networkmanagement and traffic engineering
Streamsof measurements and packet traces
Networksecurity
Networkpacket streams, user session information
Queries:URL filtering, intrusion detection DOS attacks, viruses
Finan! cialApplications
Streamsof trading data, stock tickers, news feeds
Queries:
Webtrack and personalization
MassiveDatabases (e.g. astronomy, physics data)
Query Evaluation: Distributed Streams
Manyphysical streams but one logical stream (not physically combining data!)
e.gyahoo top 100 pages
Correlatestreams at distributed servers
e.g.network monitoring
&! nbsp; Manystreams controlled by a few servers
e.g.sensor networks in a centralized environment
Issues/ concepts:
move processing tostreams, not streams to processor
approximation-bandwidthtradeoff
Architecture:
- Operators have synopses; more memory == bigger synopses
- Synopses orthogonal to operators; operators can use different synopses
- &nb! sp; Operators in query plan, connected by queries, scheduled byscheduler
- Approximation inherent: e.g. duplicate elimination Ð more memory improves the quality of theresults (fewer duplicates)
- Hope to share synopses Ð donÕt know how.
- Global memory allocation problem: lower operator with a small synopses limits amount of useful state at higher level.
- Query plans == Fjords!
DSMS Internals
Operators,synopses, queues
Accuracyvs. memory tradeoff ! in every operator
Everyoperator adapts gracefully irrespective of memory
Scheduler:
System Implementation
ÒDevelopersworkbenchÓ:
- Submit queries in extended SQL or algebra
- Submit or edit query plans in XML or GUI
- Query plan execution visualizer
- On-the-fly modification of memory allocation, schedulingpolicies, etc.
Lots of algorithmic work on synopses (published stuff viaRajeev &! ; co.)
Constraints:
- Important to exploit constraints in query processing(specified at schema level)
o
o
o
- Need not be exact (e.g. k-clustered)
- Reduce memory requirements
- Unblock blocking operators (e.g. if you know a stream issorted on time, you can output results up to the most recent time received)
Some relationship to punctuations Ð seems as though some punctuationscan be expressed as schema level propertiesÉ
Approximation in query processing
- Understanding behavior of approximate operators when composed
- Memory allocation to operators in a plan, given per operatormemory-accuracy curve
- Best query plan, assuming best memory allocation
- Multiple (weighted) queries sharing resources
Operator tells you how itÕs approximating, maybe outputsconfidence intervals. If weunderstand how approximate operators compose, how do I allocate memory g!
loballyto maximize accuracy? Then, how doI choose the best query plan, assuming best memory allocation?
Mike: What doaccuracy curves look like? DonÕtreally know
Maier: Couldyou do query checkpoints as XML? Jennifer: XML in DSMS is not about state storage Ð itÕs about a queryplan, or memory allocation, but not actual tuples.
Answers tend to get less accu! rate as they flow up the queryplan in this scheme. Lowerlevel operators could make use of state stored in higher level operators todecide what to discard / keep.
Adaptive to resource allocation / memory, not adaptivity ofquery plans. Not even reallyattacking problem of query plan generation Ð in initial implementation users implementjust query graphs.
Tuples are all timestamped.
Windows are based on timestamps. (also can specify by # oftuples? or punc?)
&!
nbsp;
Tuple timestamps are applied either when the data is created
in the source, or when the tuples arrive.
Time series analysis vs. most current
reconstitution - twofunctions: one to time series, one to most recent
Maier: What issent on to the person when a match is found? Whole document
or some subset or the matching!
path?
has to be delivered to the user. Delivery seems to be the bottleneck -- probably
need distributed routing / delivery.
Maier: Is itnecessary to find all matches or just one match? Need to support both. NFA supports both. Can we optimize in the case where you just want to
XML parsing is bottleneck! Not filter cost.
Maier: Parsingis a problem in their world too. Found that combining 10 documents
into a single document yielded a significant performancebenefit. Theory: symbol
table construction is important.
Yanlei: Javaparsers 60-80ms, C parsers 6-10 ms.
Maier: This isa stream shredder? Yup.
Jennifer: Thisreally looks like a query plan. Yup.
Delivery seems to dominate result construction.
Open Questions:
Streams benchmark?
Streaming data model? Can we converge on a data / query model we agree on?
Benchmark might be a good starting point -- we'll have toagree what queries
to run and what data to run them over.
| Sensors | ClickStream | Networking Monitoring and Routers | Call Record Stuff | Financial | Baseball (Event Streams) | XML dissemination Pub/Sub |
# Of Clients | 1-1000 | 100 | 1 | 2 | 10^7 | | 10^6 |
Delivery Requirements | | | | | | | |
Content Complexity | low | low (homog) | med | low | low | med (hetero) | high |
Query Types | relational, time-series, alerts | data-mining | relational, time-series, data-mining, alerts | data-mining | alerts, time series | xquery, aggregates | filtering, alerts, excerpting, restructuring (not aggregating or combining) |
Distribution | | | | | | | |
Stream Semantics | | | | | | stream of xml fragments | stream of xml |
Data Arrival Rate | low-high | med | very high | high | med | low-med | med |
Data Delivery | both | push | push | push | both | both | push |
Query Flux | ? | low | low | low | ? | ? | ? |
Query Load | ? | med | high | low | high | ? | high |
Accuracy Requirements | low; variable | low | med? | low (load), high (fraud) | high | ? | high |
Lossiness | yes | no | no | no | no | no | no |
rate flux | | | | | | | |
Band Join
Sliding Window Join
Window Join
Acti! ve/Inactive Queries
Transfer Queries: Logging a view
Predefined == queries exist before data
Ad-hoc == streams exist before queries
Flying join