Streams Meeting, March 20,2002.

@ Stanford

 

 

Stanford:

!

Jennifer Widom

widom@db.stanford.edu

Rajeev Motwani

rajeev@cs.stanford.edu

Jeff Ullman

ullman@cs.stanford.edu

Arvind Arasu

arvinda@cs.stanford.edu

Brian Babcock

babcock@cs.stanford.edu

Shivnath Babu

shivnath@cs.stanford.edu

Prasanna Ganesan

prasanna.ganesan@cs.stanford.edu

Gurmeet Manku

manku@cs.stanford.edu

Liadan O'Callaghan

loc@cs.stanford.edu

Rohit Varma

rohit.varma@cs.stanford.edu

 

Berkeley:

Mike Franklin

franklin@cs.berkeley.edu

Sirish Chandrasekaran

sirish@cs.berkeley.edu

Yanlei Diao

diaoyl@cs.berkeley.edu

Sam Madden

madden@cs.berkeley.edu

 

OGI:

Dave Maier

maier@cse.ogi.edu

Pete Tucker

ptucker@cse.ogi.edu

 

 

Dave Meier:

 

Pipelined Evaluation

            Requiressome ! local state at operators

Monotonic Evaluation:

Basic SPJ, dupl.. elim aremonotonic

Is Nest (aka groupby) monotonic?

What does monotonic mean?

Maier: If a tupleÕs in the answerit stays in the answer forever.

Widow: If the input gets bigger,the answer gets bigger.

Groups donÕt grow monotonically inWidomÕs view  Ð may change as

new values arrive

 

In MaierÕs view, montonicity of nest depends on definitionof Òless-thanÓ (e.g. ordering) Ð if ordering is su! bset, not nest not monotonic,if substructure, nest is monotonic.

 

One way to do it is to make Òreconstitution functionÓ handleit Ð e.g. make client compute groups. Jennifer sez reconstitution function is a lot like deltas.  Maier wants fancy reconstitutionfunctions (e.g. partial preaggregation.) Zaniello talks about making aggregates monotonic.

 

Maier sez you need to Òdescribe the intent of the streamÓ Ðis it deltas, values, groups, partial groups?

 

Maier: Jayaval / Kristen showed that streaming aggregates(e.g. nonmonotonic) results might ma! ke deltas preferable.

 

Maier: Kinds of stream semantics:
            -Append

-      Delta

-      Merge

-      ! ÒBroadcast StreamsÓ Ð a sequence of snapshots, cyclic updateswith modified entries in the cycle.

-      Strict snapshots?

-      Invalidations

 

What does merge mean for two XML documents?

 

 

Two streams; which i! s the correct merge?

 

                                                                               D

A Ð B Ð D                                            &n! bsp;               /

                                      D                                  B

                 &nb! sp;/                                    /

     A Ð B                                       A

                  \                                     \

       C                                  B

      \

         C

A Ð B Ð C

 

 

 

OGI wants to do it via templates.  DTDs donÕt specify this in enough detail, Xschema probablydoesnÕt either.

 

 

Pete Tucker: 

 

Punctuated Streams

Deal with both unbounded state and blocking operators

 

Punctuations terminate subsets! within the streams;  e.g. end of time windows, groups, etc.  Allows state to be discarded early, andallow partial results to be output.

 

Punctuation semantics must be understood by operators.  For now, punctuations are very simple Ðterminate subsets, apply only to a single attribute.  But could imagine lots of more complicated punctuations.

 

Used so far:

 

Seen everything in a range. ( or up to a value)

Seen everything in a list

Seen all occurrences of a v! alue

Seen everything (e.g EOAttr)

 

Can punctuations have a ÒleaseÓ Ð e.g. valid for a certaintime range?

 

For now, punctuations are idempotent.  Stream is ÒgrammaticalÓ Ð always ok tosend punctuations later, or to periodically send cumulative punctuations.  Not clear what this means for morecomplicated punctuations.

 

Three kinds of rules:

 

Pass rules:  Indicates what a blocking operat! oroutput when its sees a punctuation.

Purge rules: What cana stateful operator discard when it sees punctuation?

Propagation rules:  When can a operator pass a punctuation?

 

Sirish:  Whatabout pre-punctuation that indicates what the user can expect to see?

Maier:  ThatÕdbe a different kind of stream semantics Ð useful.

 

Current state: partially implemented in Niagara Query Engine.

&! nbsp;

Are punctuations and windows and epochs the same thing?

Sam:  windowsare essentially ÒimplicitÓ punctuations

Jennifer: different kinds of constraints on streams:

 

-      Schema level streamconstraints (e.g. ordering, or strict referential integrity)

o      Allowdata to be thrown out

-      Data-level constraints(e.g. punctuations)

-      Query-level constraints(e.g. time windows)

-      Operator-levelconstraints

 

Are constraints always about optimization?  Schema level, and data level are, querylevel maybe  (system constraintsare about optimization, user constraints are about query correctness).  (Yarick Geryz Ð Last yearÕs SIGMODindustrial track Ð inexact properties.) 

 

Correlated aggregates (Gehrke), last years PODS (or SIGMOD),talks about ÒlandmarksÓ, which look a lot like punctuations.

 

Chronicle data model Ð any join is an implicit band joi! n ontime, which constrains things.

 

Ullman Ð seems like you should be able to reason about thetypes of punctuations that can be output, demonstrate that state actually getspurged everywhere, etc. At least for the very restricted set of punctuationscurrently being considered.

 

 

 

Jennifer

 

Data Stream Queries

 

-      Answer availability

o      Onetime

o      Multipletime

o      Continuous(ÒstandingÓ)

-      Registration time

o      Predefined

o      AdHoc

-      Stream Access

o      Arbitrary

o      SlidingWindow (special case size = 1)

 

 

Applications

            Networkmanagement and traffic engineering

                        Streamsof measurements and packet traces

            Networksecurity

            Networkpacket streams, user session information

            Queries:URL filtering, intrusion detection DOS attacks, viruses

            Finan! cialApplications

                        Streamsof trading data, stock tickers, news feeds

                        Queries:  arbitrage opportunities, analytics,patterns

            Webtrack and personalization

            MassiveDatabases (e.g. astronomy, physics data)

 

Query Evaluation: Distributed Streams

            Manyphysical streams but one logical stream (not physically combining data!)

                        e.gyahoo top 100 pages

            Correlatestreams at distributed servers

                        e.g.network monitoring

  &! nbsp;         Manystreams controlled by a few servers

                        e.g.sensor networks in a centralized environment

           

            Issues/ concepts: 

move processing tostreams, not streams to processor

approximation-bandwidthtradeoff

 

Architecture:

 

-      Operators have synopses; more memory == bigger synopses

-      Synopses orthogonal to operators;  operators can use different synopses

-    &nb! sp; Operators in query plan, connected by queries, scheduled byscheduler

-      Approximation inherent: e.g. duplicate elimination Ð more memory improves the quality of theresults (fewer duplicates)

-      Hope to share synopses Ð donÕt know how.

-      Global memory allocation problem:  lower operator with a small synopses limits amount of useful state at higher level.

-      Query plans == Fjords!

 

 

DSMS Internals

            Operators,synopses, queues

            Accuracyvs. memory tradeoff ! in every operator

            Everyoperator adapts gracefully irrespective of memory

            Scheduler:  round robin scheduler

 

System Implementation

            ÒDevelopersworkbenchÓ:

-      Submit queries in extended SQL or algebra

-      Submit or edit query plans in XML or GUI

-      Query plan execution visualizer

-      On-the-fly modification of memory allocation, schedulingpolicies, etc.

 

Lots of algorithmic work on synopses (published stuff viaRajeev &! ; co.)

 

Constraints:

-      Important to exploit constraints in query processing(specified at schema level)

o      Foreignkey joins

o      Referentialintegrity : combine streams

o      Clustering,ordering : what tuples are in time window

-      Need not be exact (e.g. k-clustered)

-      Reduce memory requirements

-      Unblock blocking operators (e.g. if you know a stream issorted on time, you can output results up to the most recent time received)

Some relationship to punctuations Ð seems as though some punctuationscan be expressed as schema level propertiesÉ

 

Approximation in query processing

-      Understanding behavior of approximate operators when composed

-      Memory allocation to operators in a plan, given per operatormemory-accuracy curve

-      Best query plan, assuming best memory allocation

-      Multiple (weighted) queries sharing resources

Operator tells you how itÕs approximating, maybe outputsconfidence intervals.  If weunderstand how approximate operators compose, how do I allocate memory g! loballyto maximize accuracy?  Then, how doI choose the best query plan, assuming best memory allocation?  Finally, how to I allocate memorybetween multiple queries or multiple query plans.

 

Mike:  What doaccuracy curves look like?  DonÕtreally know

Maier:  Couldyou do query checkpoints as XML? Jennifer: XML in DSMS is not about state storage Ð itÕs about a queryplan, or memory allocation, but not actual tuples. 

 

Answers tend to get less accu! rate as they flow up the queryplan in this scheme.   Lowerlevel operators could make use of state stored in higher level operators todecide what to discard / keep. 

 

Adaptive to resource allocation / memory, not adaptivity ofquery plans.  Not even reallyattacking problem of query plan generation Ð in initial implementation users implementjust query graphs.

 

Sam and Sirish

Tuples are all timestamped.

 

Windows are based on timestamps. (also can specify by # oftuples? or punc?)

&! nbsp;

Tuple timestamps are applied either when the data is created

in the source, or when the tuples arrive.

 

Time series analysis vs. most current

        reconstitution - twofunctions: one to time series, one to most recent

 

 

Yanlei

 

Maier:  What issent on to the person when a match is found?  Whole document

or some subset or the matching! path?  XQuery has a construction phase --once   you know a documentmatches, you have to build the result. Then the result

has to be delivered to the user.  Delivery seems to be the bottleneck -- probably

need distributed routing / delivery.

 

Maier:  Is itnecessary to find all matches or just one match?  Need to support both. NFA supports both.  Can we optimize in the case where you just want to  find one?  Not really -- don't prune the NFA to eliminate paths you'vealready navigated.

 

XML parsing is bottleneck! Not filter cost.  Result collection is more expensivethan filter too, in YFilter (not Xfilter or hybrid scheme (RajeevRastogi?).)  

 

Maier:  Parsingis a problem in their world too. Found that combining 10 documents

into a single document yielded a significant performancebenefit.  Theory: symbol

table construction is important.

 

Yanlei:  Javaparsers 60-80ms, C parsers 6-10 ms.

 

Maier:  This isa stream shredder?  Yup.

 

Jennifer:  Thisreally looks like a query plan. Yup.

 

Delivery seems to dominate result construction.

 

 

Open Questions:!

 

Streams benchmark?

Streaming data model? Can we converge on a data / query model we agree on?

 

 

Benchmark might be a good starting point -- we'll have toagree what queries

to run and what data to run them over.

 

Stream Taxonomy

 

 

Sensors

ClickStream

Networking Monitoring and Routers

Call Record Stuff

Financial

Baseball (Event Streams)

XML dissemination Pub/Sub

# Of Clients

1-1000

100

1

2

10^7

 

10^6

Delivery Requirements

 

 

 

 

 

 

 

Content Complexity

low

low (homog)

med

low

low

med (hetero)

high

Query Types

relational, time-series, alerts

data-mining

relational, time-series, data-mining, alerts

data-mining

alerts, time series

xquery, aggregates

filtering, alerts, excerpting, restructuring

(not aggregating or combining)

Distribution

 

 

 

 

 

 

 

Stream Semantics

 

 

 

 

 

stream of xml fragments

stream of xml

Data Arrival Rate

low-high

med

very high

high

med

low-med

med

Data Delivery

both

push

push

push

both

both

push

Query Flux

?

low

low

low

?

?

?

Query Load

?

med

high

low

high

?

high

Accuracy Requirements

low; variable

low

med?

low (load), high (fraud)

high

?

high

Lossiness

yes

no

no

no

no

no

no

rate flux

 

 

 

 

 

 

 

 

 

 

Glossary

 

Band Join

Sliding Window Join

Window Join

Acti! ve/Inactive Queries

Transfer Queries: Logging a view

Predefined == queries exist before data

Ad-hoc == streams exist before queries

Flying join