Notes
Outline
Profiting from Data Mining
Gio Wiederhold
November 2003
Steps needed to profit
Obtaining relevant data
Always incomplete
Extracting relationships
Imputing causality
Finding applicability
Determining leverage points
Inventing candidate actions
Assessing likely outcomes and benefits
Selecting action to be taken
Measuring the outcome
Ý  Collecting data for next round
Today's Problem: Disjointness
Database administrators
Focus on data collection, organization, currency
Analysts
Focus on slicing, dicing, relationships
Middle managers
Focus on their costs, profits
MBAs
Focus on business models, planning
Executives
Must make decisions based on diverse inputs
1. Data Collection
Two choices
(rare) Collect data specifically for analysis
allows careful design --
model causes and effects
Purchase = f(price, color, size, custumer inc., gender,. ,,
costly
often small to make collection manageable
imposes delays
(common) Use data collected for other purposes
take advantage of what is readily available
low cost
filtering, reformatting, integration
incomplete - rarely covers all causes / effects
biased -- missing categories
only people with phones, cars -- shopping in super markets
1a. Data Integration
Needed when sources have inadequate coverage
in distinct DBs for
 Prices,  Number purchased
Customer segments (supermarket, stores, on-line)
implies some expectations
append attributes where keys match: Joe
include semantic match  Joe = 012 34 567
append rows where key types match: customer
include semantic match  customer = owner
2. Data analyis
Find relationships
already known - ignore or adjust in next round
requires comparison with expert knowledge
now have quantification
unknown
uninteresting per expert
interesting per expert
3. Establish causality
Already known -- Prior Model
But is it complete,   i.e., does it explain all effects ?
Analyze relationships
 use expertise to decide direction
often obvious
"common world knowledge"
sometimes ambiguous
smoking Ø Cancer Ø not-smoking
often major true cause not captured in data
food color 10%,
food price 20%,
buyer gender 2%
unknown  75%
guess: ethnicity, income
Establishing causality is risky
1.  Is a Volvo a safe car?
Change causecreate effects
To use results of data mining
have to understand direction of relationships
4. Causes provide the leverage
Language of analyst / Language of modeling
Many causes -- independent variables
A few may be controllable
Some may be controlled by our competition
Others are forces-of-nature
Even more effects -- dependent variables
A few may be desired
Some may be disastrous
Many are poorly understood
Intermediate effects
Provide a means for measuring effectiveness
Allow correction of actions taken
5. Planning & Assessment
Analyze Alternatives
Current Capabilities
Future Expectations
Process tasks:
List resources
Enumerate alternatives
Prune alternative
Compare alternatives
Prediction Requires Tools
Simulations predict
Back-of-the-envelope
Common
Adequate if model is simple
Assumptions are easily forgotten after some time,         not distinguished from data "Why are we doing this"
Spreadsheets
Most common computing tool
Specialist modeler can help
New, recent data can be pasted in
Awkward for the tree of future alternatives
3. Constructed to order
Costly, powerful technology
Specialist modelers required
Expressive simulation languages
Requires specialists to set up, run, and rerun with new data
Simulation results: likelihoods
Next period alternatives
Simulation services
Wide variety, but common principle
       Inputs        Model             Output (time, $, place, ...)
Spreadsheets
Identify independent, controlable, and resulting values
2. Execution specific to query: what-if assessment
may require HPC power for adequate response
3. Continously executing:  weather prediction
Search for best match ( location, time )
4. Past simulations results collected for future use
Typically sparse -- the dimension of the futures is too large:
Tables in a design handbook: materials
Perform inter- or extra-polations to match query parameters
6. Specify Value of Effects
Still needed: Value of alternative outcomes
Decision maker / owner input
Benefits  and  Costs
Potential  Profit
Correct for risk, and adjust to present value
Having it all together
Relationships from analyses of past data
Data representing the current state
List of actionable alternatives
Tree of subsequent alternatives
Probabilities of                          those alternatives
Values of the outcomes
Ability to predict the likelihood of futures
Vision: Putting it all together
Needed: Information Systems that also
project seamlessly into the Futures
Support of decision-making requires dealing with the futures,  as well the past
Databases deal well with the past
Streaming sensors supply current status
Spreadsheets, simulations deal with the likely futures
Future information systems should combine all these sources
Connecting it all
Build super systems
Coherent, consistent
Expensive
Unmaintainable
Too many cooks:
Database folk
Data miners
Analysts
Planners
Simulation specialists
Decision makers
Interfaces enable integration:
New: SimQL to access Simulations
Slide 22
Demonstration of SimQL
Information system
use of simulation results
Simulation results are mapped to
      alternative Courses-of-actions
Information system should support model driving the the computation and recomputation of likelihoods
Likelihoods change as now moves forwards and eliminates earlier alternatives.
The likelihoods multiply out to the end-effects  
then their values can be applied to earlier nodes
Recomputation is needed
at the next time phase
Even the present needs SimQL
Integrative information systems: research questions
 What human interfaces can support the decision maker?
 How to move seamlessly from the past to the future?
 What system interfaces are good now and stay adaptable
 How can multiple futures be managed (indexed)?
 How can multiple futures be compared, selected?
 How should joint uncertainty be computed?
 How can the NOW point be moved automatically?
SimQL research questions
How little of the model needs to be exposed?
How can defaults be set rationally?
How should expected execution cost be reported?
How should uncertainty be reported?
Are there differences among application areas that require different language structures?
Are there differences among application areas that require different language features?
How will the language interface support effective partitioning and distribution?
Moving to a Service Paradigm
Interfaces define service potentials
Server is an independent contractor, defines service
Client selects service, and specifies parameters
Server’s success depends on value provided
          Some form of payment is due for services
Summary of SimQL
A new service for Decision Making:
follows database paradigm
( by about 25 years )
coherence in prediction
displacement of ad-hoc practices
seamless information integration
single paradigm for decision makers
simulation industry infrastructure
investment has a potential market
should follows database industry model:
 Interfaces promote new industries
Summary:
Today decision making support is disjoint, each community improves its area and ignores others
The decisionmaker has few tools
Coda:
 Put relevant work together and move on