CS145 Lecture Notes (1) -- Introductory Material
This lecture material complements the assigned readings: Chapter 1
of the textbook, and the Introduction to "SQL for Web Nerds" by Philip
Greenspun (linked from the Course Schedule). Please do the
readings.
Database Management System (DBMS):
Provides
efficient, reliable, convenient, and safe multi-user
storage of and access to massive amounts of persistent
data
Most familiar use: many web sites rely heavily on DBMS's. (And you
will build one!)
Examples: (solicit from class)
Non-web examples of DBMS's: (solicit from class)
Example: Online Bookseller
Data = information on books (including categories, bestsellers, etc.),
customers, pending orders, order histories, trends and preferences, etc.
- Massive data
Hundreds of gigabytes at a minimum for medium-size
bookseller, more if keep all order histories over all time, even more
if keep clickstream logs, even more if keep images of book covers and
sample pages
=> Far too big for memory
- Persistent data
Data outlives programs that operate on it
- Safe data
From hardware failures, software failures, power outages, ...
From malicious users
- Multi-user access
Many people/programs accessing same database, or
even same data, simultaneously
=> Need careful controls
Multi-user example and discussion
Jane and John both have ID number for bookseller gift certificate
(credit) of $200 they got as a wedding gift.
Jane @ her office: orders "Database Systems: The Complete Book" ($100)
prompt user for credit ID;
get credit from database;
if credit >= 100 then
credit := credit - 100;
issue order to mail book;
if OK then put new credit into database (else handle error);
else print "sorry"
John @ his office: orders "A First Course in Database Systems" ($85)
prompt user for credit ID;
get credit from database;
if credit >= 85 then
credit := credit - 85;
issue order to mail book;
if OK then put new credit into database (else handle error);
else print "sorry"
Initial credit = $200
Ending credit = ???
- Appears similar to concurrent programming problems (synchronization,
semaphores, etc.)
BUT: data not main-memory variables
- Appears similar to file system concurrent access
BUT: want to control at smaller granularity
- Also database may be distributed, replicated
- Convenient operations
- "Physical data independence"
- Logical data model -- storage scheme invisible
- High-level query language -- optimization and execution plan invisible
Simple commands to: find a specific book,
list all books in a certain category and price range, generate an
order history, produce sales figures grouped by state, etc.
=> Also unpredicted queries should be easy
- Efficient operations
- Thousands (or more) queries/updates per second
- Massive data => DBMS's carefully tuned for performance
- Cannot search all files in order to: get
price of one book, get all customers from northern California, get
bestselling books from last week
- Reliable system
99.999% uptime
A DBMS is a Software System
- Buy or download, install, set up for particular application
- Available for handhelds, PCs, clusters, mainframes, supercomputers
- Frequently used in conjunction with application
server or other middleware, or exposed as a web service
- Basic software infrastructure: (figure)
Mainstream DBMS Vendors and Open-Source
- Oracle
- IBM: DB2, Informix
- Microsoft: SQL Server, Access
- Sybase
- MySQL (Sun)
- PostgreSQL
- SQLite
All are "relational" (or "object-relational") database systems at
their core.
All commercial vendors also provide XML support, either
built-in or built-on-top.
Terminology and Basic Ideas
Data model
- Describes conceptual structuring of data stored in database
- Example: Data model is set of records. E.g., records
might each have student-ID, name, address, courses, photo
- Example: Data is XML document. E.g., document might
contain list of books with ISBN numbers as ID's, titles and author
names as subelements
- Example: Data is graph. E.g., nodes might represent
cities, edges represent airline routes
Schema versus data
- Schema describes how data is to be structured -- defined at set-up
time, rarely changes (part of the "metadata")
- Data is actual "instance" of database, may change rapidly
- Compare to types and variables in programming languages
Data Definition Language (DDL)
- Commands for setting up schema of database
(add box to basic software infrastructure)
- Process of designing schema can be complex, may use design methodology
and/or tool
(add box to basic software infrastructure)
Data Manipulation Language (DML)
- Commands to manipulate data in database:
RETRIEVE, INSERT, DELETE, MODIFY
- Also called "query language"
People
- DBMS implementor: builds system (245, 346)
- Database designer: establishes schema (145)
- Database application developer: writes programs that query and modify data (145)
- Database administrator: loads data, tunes system, keeps whole thing running (145)
Nontraditional Data Management
- Real-time and historical "data streams": E.g., stock
monitoring, air traffic, telecom call records, network and system
monitoring, sensor data
- Distributed and heterogeneous data: E.g., collaborative
design, medical information systems, comparison shopping, querying
[not searching] the web
- Scientific data: E.g., Experimental results, satellite
feeds, telescopes
Towards the end of the course we will learn about systems being
developed to address some of these applications.