Gio's Movie files: DOC

Movies Database Documentation

This the Master file on [Varese]/MyDocs/mov/xml/doc.html; copyright 1996, 1997, 1998, 2004 by Gio Wiederhold.  Free use is permitted with identification of the source.


Loaded into DB web files 25 September 1997. Updated November 1999, June 2004. Fixes to XML linkage made Sept.2001, May 2003.

This material was entered by Gio Wiederhold

The initial objective was as to provide test data for students, to allow non-trivial exercises. There are fields suitable for complex joins, outerjoins, temporal joins, and recursion. It has also been used to investigate object-structured database technologies.

This is a HTML version of the description of the movies database. It was based originally on the description from the Kamens and Wiederhold temporal paper. It also contains literature references in appendix A.

The HTML files are no longer being updated.  Starting Nov.1999 and completed in 2004 the same and more information is available in XML form. The working documentation for the full XML conversion is dtd.html, with links to the current XML files and the documents describing the XML files.

The XML versions may undergo further restructuring to explore the most effective way to present XML and eventually RDF information.

While the description below refers to the original SQL files, the TeX version, and the subsequent HTML version, the meaning of the fields has remained substantially the same.

1. Introduction

The database was used originally in implementing the temporal queries in SQL is Wiederhold's movies database (see [21]) and for a number of other projects, both in Stanford classes and by research at the Un.of Maryland and others. The database allows testing of theories about an implementation.  A `remake' file allows testing of recursive procedures. Since it contains "real data", it allows checking of results for semantic as well as syntactic correctness. Finally, the tables in the database consist of both journal and history relations (see 21, section 3), allowing us to test all of the different elements of an implementation.

The Overview section (2) below contains a summary of all the files and the URL pointers to them, or their parts or sections; see Section 3 for a full schema description.

A number of the fields are unsuitable for relational implementation, because they contain sets of values. These are best used in an object-capable implementation. The schema describes when such sets are to be expected. Encodings for several of the fields are listed in Section 4.

At the end of this document, in Appendix B., are some conversion rules for converting the HTML format to fields for relational or object databases.

2. Overview:

The central file (MAIN) is a list of movies, each with a unique identifier. These identifiers may change in successive versions. The actors (CAST) for those movies are listed with their roles in a distinct file. More information about individual actors (ACTORS) is in a third file. All directors in MAIN are listed in a fourth file (PEOPLE), with a number of important producers, writers, and cinematographers. A fifth file (REMAKES) links movies that were copied to a substantial extent from each other. The sixth file (STUDIOS) provides some information about studios show in MAIN. This documentation file provides supplementary information, and is an essential part of the database. Some images are also available, but not on-line now. There are many cross-linking names throughout the files, so many in fact that web browsers have choked when trying to make them live. Currently we simply try to be careful in naming films and people consistently. More detail follows below. Counts have been made using emacs `count-matches'.

2.1 -- MOVIES --

The main movies file is main.html. It contains 11435 `tr/td' table row entries (July 1999). It is complete for Hitchcock movies and TV shows, and has most films that can be related by topic, actors, director, history, and such to those films. It has also many films corresponding to a variety of interests of mine, former students etc., with an emphasis on historical value. MAIN is the "central" relation in the database; it is a journal relation that contains information about the movies themselves. The relation contains information about movies such as their titles, type, directors, and producers, as well as their year of release.
There is also a subset file of 82 Hitchcock full length Movies.

2.2 -- ACTORS --

The file actors.html has 6813 `tr/td' table row entries (July.1999) for many of the actors appearing in CASTS.
Also First part and Second part of actors.html, not maintained.
The key of the relation is "stagename", and there are temporal intervals indicating the dates that the actor worked and the actor's lifetime. Other information in this relation is the actor's real name, background, and the type of roles he/she typically plays. References to images are kept here too.

2.3 -- DIRECTORS --

in file people.html The file lists all directors, as well as some other movie people, as producers and cinematographers (A total of 3290 `tr/td' table row entries, 3011 `@' directors as of July 1999).
The directors table is similar to the actors table in that it contains temporal intervals for when the director worked and when he/she lived. The key of the relation is the field "name", which is the name under which the director directed. Director's key names do not contain any blanks. Typically the last name is used, when needed prefaced by an initial. A secondary unique key is defined for each director, up to three letters, based on the initial letters of the first, middle, and last names. This key will provide HTML HREF linkages among many of the files. As with the actors table, this table also includes the real name of the director among its fields ("lastname" and "firstname"). It also contains important producers, cinematographers, musicians and composers, etc.

2.4 -- STUDIOS --

in file studios.html are important studios only (203 `tr/td' entries, sparse information).
The key of the studios relation is the name of the studio. The temporal information that is included is an interval indicating the years the studio was (or is) in operation, represented by the fields "startdate" and "enddate". This is a history relation.

2.5 -- CASTS --

in file casts.html This is a large (too large?) file documenting who acted in what role in which movie. (46 009 tr/td entries, only partial for movies and roletypes, July 1999). Casts is an association relation, linking actors with movies. The key of the relation is the catenation of the two fields "film_id" and "actor"; no temporal information is included in this relation.
This file was too big for Netscape in 1996, so that also five working subsets were made available, however these are not kept up-to-date.

  1. casts of Hitchcock films;
  2. casts Part 1, directors coded A-B;
  3. casts Part 2, directors coded C-H;
  4. casts Part 3, directors coded H-O;
  5. casts Part 4, directors coded P-Z;

The four parts are still large files (6000-15000 entries; castsCH includes Hitchcock films

2.6 -- REMAKES --

in file remakes.html (1278 `tr/td' entries in July 1999).
This table (which is not extensively used in the temporal DB paper) gives information about movies that are remakes of other movies. It is very useful to test recursion in databases.

2.7 -- SYNONYMS --

This file has been superseded. All entries are now in the main file, as Alt(T:...) in the notes field. synonyms.html, contained 379 entries in Sep 1997) This list relates to the MOVIES in main.html.
Some movies are known by alternate titles, and can be accessed indirectly via this file.

2.8 -- QUOTES --

in file casts.html QUOTES. A few (26) memorable quotes from movies are listed in quotes.html.|1.

2.9 -- AWARDS --

in file actors.html. Types of awards and the awarding agencies are in awtypes.html.

2.10 -- AWARDS-RECEIVED --

is no longer a distinct file Awards received for special occasions are listed with individual entries in the files for ACTORS (actors.html).or MOVIE PEOPLE (people.html). Regular awards associated with a particular movies are given in MOVIES (main.html) and with a particular performance are listed in CASTS (casts.html).

2.11 -- REFERENCES --

Books that provided material for this database are listed within this documentation file as Appendix A.

2.12 -- GEOGRAPHY --

Codes for countries and origins are listed within this documentation file as section 4.3: doc.html GEO.

2.13 -- CATEGORIES --

Codes for movie categories are listed within this documentation file as Section 4.4: doc.html CATS.

2.14 -- COLOR-CODES --

Codes for color processes used for movies are listed within this documentation file as Section 4.5: doc.html COLS.

2.15 -- ROLE-TYPES --

Codes that specify role-types for actors

are listed in the preamble for casts.html ROLES.

2.16 -- FIELD-IDENTIFIERS --

Codes that identify subfields in various files are listed within this documentation file as Section 4.2: doc.html FIELDS.

2.17 - AWARD TYPES --

Lists the award types used in MAIN, ACTORS, and PEOPLE, with the organizations who award them, and the span of years they were awarded.

2.19 -- IMAGES --

there is a small collection of .tiff files for actors and directors. They are kept individually in an images subdirectory.

2.20 -- ICONS --

There are about a dozen icons to be used to identify subfiles. Some of them come from the New Yorker Magazine Jan.1993. There are kept individually in an icons subdirectory.


3. Schema Definition for the Movies Database

Here we give a detailed description of the schema of the movies database, which is used for all examples in this paper and was used to implement the temporal SQL additions. General descriptions are given in Section 2, above.
This file is being updated to desctribe the HTML version. Where updates were made, the old material is in curly {brackets}.

3.1 The MOVIES Table

Col-Name = Description
. There is a distinct table for each director (Hitchcock has multiple tables, one for early silent, one for British, one for American, and one for TV movies).
The tables are broken up by year of first known film by the directors. There are some break and header records for each year.
Each director table has two types of records:

  1. one header record for the director, with the director id, as shown in people, the first year known for movies by that director, prefixed by an @ symbol, matching the people entry, and the standard name for the director, also matching the people entry.
    The remainder shows the format for the data records that follow below.. The note field is often used to describe the set of detail records
    For movies where the director in not known there is a dummy entry, either by topic (Unknown) or by year (UnYear), as shown in the people file.
  2. any number of records, one per film, formatted as shown below.

film_id =

title =

year =

director =

producers=

studios =

prc =

cat =

awards =

lc =

notes =

3.2. The ACTORS Table

There is one record for each actor listed, but not all actors listed in CAST are documented. There are also break and header records for each letter of the alphabet.
Col-Name = Description

stagenm =

dowstrt =

dowend =

birthnm =

firstnm =

gender =

dob =

dod =

type =

origin =

photo =

notes =

3.3 The PEOPLE Table

Directors are the major subset of the general people.html table. Other entries are significant producers, writers, art directors and some authors. Being a director is indicated in the Pcode field, and has some effect on other fields. There are also break and header records for each letter of the alphabet.
Col-Name = Description

id-name =

Pcode =

Did =

yearstart=

yearend=

lastnm =

firstnm =

dob =

dod =

backgrd =

notes =

3.4 The STUDIOS Table

Col-Name = Description ----------- -------------

name =

company =

city =

country =

fddate =

enddate =

founder =

successor =

notes =

3.5 The CASTS Table

The large CAST file is broken up into sections by initial letter(s) of the directors identifying code. In each section will be a number of directors, ordered by code. There is a distinct table for each director.
Col-Name = Description

Each director table has two types of entries

  1. one header record giving the director's id, name, and format information.
  2. multiple records for each movie and listed actor. There are no headers for distinct movies.

film_id =

title =

actor =

roletype=

role =

awards =

notes =

3.6 The REMAKES Table

Col-Name = Description

film_id =

title =

year =

part =

wasfilm =

wastitle=

wasyear =

3.7 -- SYNONYMS table

This file has been superseded. All entries are now in the main file, as Alt(T:...) in the notes field. Col-Name = Description

film-id =

s-title =

s-country =

p-country =

p-title =

3.8 -- QUOTES table

Col-Name = Description ----------- -----------

film-id =

title =

speaker =

role =

listenr =

quote =

3.9 -- AWARDS table

Col-Name = Description ----------- -----------

award =

agency =

place =

3.10 -- AWARDS table

Col-Name = Description ----------- -----------

recepnt =

award =

year =

reason =

notes =

3.11 -- REFERENCES table

Col-Name = Description ----------- -----------

no. =

author =

title =

pub-inf =

3.12 -- GEOGRAPHY table

Col-Name = Description ----------- -----------

code =

. See 4.1 for the encoding used. country-name =

c-adjective =

3.13 -- CATEGORIES table

Col-Name = Description


ctcode = Four-letter code

Movie categories

code

category

|

code

category

|

code

category

|

Ctxx

uncategorized

|

 

 

 

 

 

 

Actn

violence

|

Advt

adventure

|

AvGa

Avant Garde

|

Camp

now - camp

|

Cart

cartoon

|

CnR

Cops and Robbers

|

Comd

comedy

|

 

 

 

 

 

 

Disa

Disaster

|

Docu

documentary

|

Dram

drama

|

Epic

epic

|

Faml

family

|

Hist

history

|

Horr

horror

|

Musc

musical

|

Myst

mystery

|

Noir

black

|

Porn

pornography

|

Romt

romantic

|

ScFi

science fiction

|

Surl

sureal

|

Susp

thriller

|

West

western

|

 

 

 

 

 

 

3.14 -- COLOR-CODES table

Col-Name = Description ----------- -----------

color-code =

full =

·         specific color processes

They are listed in the preamble for movies.macros.

3.15 ROLE-TYPES table

Col-Name = Description ----------- -----------

role-codes =

3.16 FIELD-IDENTIFIERS table

Col-Name = Description ----------- -----------

codes =


4: CODE TABLES

A fair amount of the information is encoded for consistency of reference. Directors' names are always treated as codes, and many other movie people as well. Names with out spaces are codes, and can be found in the PEOPLE relation. Actors names are treated as codes as well, although here first names have been retained. Many actors can be found in the ACTORS relation.
All movies have been assigned a code by catenating a director's identifier, found in the PEOPLE relation with sequence digits.
Several code tables appear below, other used are

Remote code tables

Local code tables

4.1 -- FIELD-DESIGNATORS --

These codes are used in certain filed to further identify the contents.

Check this, much changed when moving to HTML.

code

definition

|

T:

film title

|

T2:

redefinition of title in main.html SYNS.

|

T3:

title used for locale file in MAIN

|

T4:

title used for License plate list in MAIN

|

T5:

title

|

T6:

title used in casts.html SAYINGS.

|

T6:

title used in quotes.html.

|

TS:

not sure of actor spelling

used in CASTS

obsolete

|

TZ:

title from Movies-dir.html

|

 

 

TZ:

not sure if actor in this film

used in CASTS

|

 

P:

producer in PEOPLE

 

|

 

PN:

producer full name

not yet classified

|

 

PU:

unknown producer

not yet classified

|

 

St:

listed studio

not yet consistent

|

 

SN:

studio name

|

 

 

SU:

studio name unknown

|

 

 

SL:

country or city of studio

|

 

 

SD:

distributor

old codes: Dtr, Ds, Dis{\Dtr

|

 

R:

role

used in CASTS

|

 

RZ:

role uncertain

used in CASTS

|

 

RU:

role unknown

used in CASTS

|

 

RN:

only name in role

used in CASTS

|

 

RS:

spelling of actor's name unsure

used in CASTS

|

 

D:

|

 

 

 

DN:

director with full name, may not be in PEOPLE file

was \DiF

|

 

DU:

|

 

 

 

4.2 -- FIELD-IDENTIFIERS --

Ancillary information, and social and professional relationships is frequently given as sub-fields in notes. Below are the prefixes used for such variable information. Some refer to candidate entries in other files. Many occur in the MAIN file, in the final NOTES field, others in the PEOPLE and ACTORS files. There should be only one sub- field of any type in the notes field, but they may have contain multiple entries If the contents references entries in a file, it is listed, but such a reference is only assured if coded with a colon ( : ). Some of these designations are repeated with the files themselves.

Code

Contents

Ocurrs in

References what

|

Alias

other professional names

PEOPLE, ACTORS

none

|

Also

other professions

MAIN, ACTORS

PEOPLE

|

Alt

alternative title

MAIN (SYNONYMS)

none

|

B

Book Author

MAIN, ACTORS

PEOPLE

|

BS

was Brother or Sister of, use Si

PEOPLE, ACTORS

PEOPLE

|

C

Cinematographer

MAIN, PEOPLE, ACTORS

PEOPLE

|

CoD

CoDirector

MAIN

PEOPLE

|

Cost

Cost to make film [M/K]

MAIN

none

|

Ch

child of

PEOPLE ACTORS

PEOPLE, ACTORS

|

Chor

choreographer

MAIN, ACTORS

PEOPLE

|

Dtr

distributor

MAIN

STUDIO

|

Er

possible error, verify

all

none

|

Fd

Founded:

STUDIOS

PEOPLE , ACTORS

|

Fdr

Founder of:

PEOPLE , ACTORS

STUDIOS

|

Gn

F

PEOPLE

used for Female producer or director

|

Inc

Income from film [M/K]

MAIN

none

|

Inf

influenced

PEOPLE, ACTORS

PEOPLE , ACTORS

|

Inb

influenced by

PEOPLE, ACTORS

PEOPLE, ACTORS

|

Lw

Lived with

PEOPLE, ACTORS

PEOPLE, ACTORS

|

M

composer

MAIN, PEOPLE, ACTORS

PEOPLE

|

Mt

Married to

PEOPLE, ACTORS

PEOPLE, ACTORS

|

Nt

note

all

none

|

P

playwright

MAIN, ACTORS

PEOPLE

|

Pt

parent of

PEOPLE, ACTORS

PEOPLE, ACTORS

|

Ph

unusual physical characteristics,
as height, weight, ..

ACTORS

none

|

R

ARt director, now V

MAIN, PEOPLE, ACTORS

PEOPLE

|

Seen

[dd[MMM]]yy

MAIN

none

|

St

studio

MAIN, ACTORS

PEOPLE

|

Si

 

PEOPLE, ACTORS

PEOPLE, ACTORS

|

Ty

typical style

PEOPLE, ACTORS

Table 4.4

|

V

Visual art director

MAIN, ACTORS, PEOPLE

PEOPLE

|

VT

have video tape (number)

MAIN

none

|

W

writer

MAIN, PEOPLE, ACTORS

PEOPLE

|

Ww

worked with

PEOPLE, ACTORS

PEOPLE, ACTORS

|

4.3 -- GEOGRAPHY --

Codes for countries and origins are listed below. It is still incomplete.

Common countries for movie making are listed first, followed by other countries in alphabetical order

. code

country

adjective

|

code

country

adjective

|

Am

USA

American

|

 

 

 

 

Br

not used

British

|

GB

Great Britain

not used

|

Fr

France

French

|

 

 

 

 

Ge

Germany

German

|

 

 

 

 

It

Italy

Italian

|

 

 

 

 

Ja

Japan

Japanese

|

 

 

 

 

Alphabetical by code

|

 

 

 

 

Ar

Argentinia

Argentine

|

 

 

 

 

Au

Australia

Australian

|

 

 

 

 

Be

Belgium

Belgian

|

 

 

 

 

Bz

Brazil

Brazilian

|

 

 

 

 

Ca

Canada

Canadian

|

 

 

 

 

Ch

China, PRC

Chinese

|

 

 

 

 

Cz

Czechoslovakia

Czech

|

 

 

 

 

Da

Denmark

Danish

|

 

 

 

 

Gr

Greece

Greek

|

 

 

 

 

Du

Holland

Dutch

|

 

 

 

 

Hu

Hungary

Hungarian

|

 

 

 

 

In

India

Indian

|

 

 

 

 

Ir

Ireland

Irish

|

 

 

 

 

Me

Mexico

Mexican

|

 

 

 

 

Os

Austria

Austrian

|

 

 

 

 

Pe

Peru

Peruvian

|

 

 

 

 

Ru

USSR, Russia

Russian

|

 

 

 

 

Sp

Spain

Spanish

|

 

 

 

 

SA

South-Africa

South-African

|

 

 

 

 

Yu

Yugoslavia

Yugoslav

|

 

 

 

 

Zw

Switzerland

Swiss

|

 

 

 

 

4.4 -- CATEGORIES --

Codes for movie categories are listed below. Mainly used in MAIN, also content of Type fields. Multiple entries are possible.

code

category

|

Susp

thriller

|

CnR

cops and robbers

|

Dram

drama

|

West

western

|

Myst

mystery

|

 

 

 

S.F.

science fiction

|

Advt

adventure

|

Horr

horror

|

Romt

romantic

|

Comd

comedy

|

Musc

musical

|

Docu

documentary

|

Porn

pornography, including soft

|

Noir

black

|

BioP

biographical Picture

|

TV

TV show

|

TVs

TV series

|

TVm

TV miniseries

|

4.5 -- COLOR-CODES --

Codes for color processes used for movies in MAIN are listed below.

code

full name

description

|

prc

unknown

 

|

col

color

color film, common after 1955

|

bnw

black-and-white

b-w film common before 1945

|

sbw

silent

silent black-and-white film

|

cld

colored

black-and-white film recolored

|

Cart

cartoon

Cartoons are normally colored

|

Tcol

Technicolor

high quality color

|

Ecol

Eastmancolor

color by Kodak

N\t(unstable)

|

Wcol

Warnercolor

 

|

 

Mcol

Metrocolor

Color by MGM

|

 

Acol

Anscocolor

color by Kodak?

|

 

Agcol

Agfacolor

 

|

 

Fcol

Fujicolor

 

|

 

DeLuxe

DeLuxe

low cost color

|

 

DuArt

DuArt

color

|

 

Movielab

MovieLab

color

|

 

CS

Cinemascope

widescreen, mostly color

|

 

Trama

Technirama

widescreen color

|

 

Pan

PanaVision

 

|

 

TV

film made for TV

various processes

|

 

Vst

Vistavision

 

|

 


Appendix A: References

Books, etc.

Electronic material

Books about Actors

Web pages

BOOKS I have for Movie Stories:


Appendix B: CONVERSION

This section refers to the original HTML files. The notes are still being developed.

To convert the source files from HTML format to another type of database:

(we use [] to denote HTML `french' brackets.)

  1. remove header notes
  2. remove miscellaneous HTML commands, as [HTML], [/HTML], [BODY], [/BODY], [HR], ...
  3. for relational files ignore all lines starting with [tr][th]. These are header lines suitable for schema definitions. They could also become the roots of large director objects for the main and cast files.
  4. remove tabs and carriage returns. All content lines end with [td]| .
  5. records are divided into fields, as documented in the file schemas above, by [td]. A space follows [td] entries, preceding the content of the next field. Missing fields are indicated by two `[td] [td]', or by `[td] dummy entry[td]', as indicated in the file descriptions above.
  6. Many fields can have multiple entries. Simple relational transforms may drop such fields, others may require normalized sub-relations.
    Multiple values in a field a separated by
    1. 1. `,' if the values are of the same type, as [td] Romt, Dram[td]
    2. 2. `;' or `:' if they are of different types, as W(Ben Hecht; AAN)

If both `;' and `,' appear in a field, then the `;' typically distinguishes a major group relative to the `,' value separator, as [td] island, South Pacific; court, SF, CA[td].

  1. In the Note fields may be a variable number of different types of entries, each of the form
    TypeCode(field), as W(Eliot Stannard)
  2. When names of producers, writers, etc, in fields do not contain blanks, as [td] P:A.Hughes[td], then the name exists in the people.html file, and can be used as an interfile reference.