References to the XML Files and DTD Schemas of the Movies Database

Updated 28April2004

The XML version of the movies database should contain 7 XML files, with their corresponding dtd files. The intent is to eventually also have them available in .rdf form for the DAML Ontoagents project. Background about these files is found on a documentation file for the earlier TeX and HTML versions.

The files are named as data#.xml or data#.dtd, where data is the type of data they contain, the # gives the version number that they are derived from in the master files kept on the Varese computer in .doc or .txt formats. Errors in the .xml files were corrected using the XLint program of Vincent Chu and Juan Arguello.

  1. mains243.xml; main list of about 12 000 movies, fully converted and validated.
    main32.dtd; dtd for main list of movies, not yet validated.
  2. people55.xml; list of about 3 500 important people in the movies, fully converted and checked with IE.
    peopl11.dtd; dtd for people, not yet validated.
  3. actors63.xml; list of about 6800 movie actors, converted and completely checked with IE.
    actor6.dtd; dtd for people, not yet validated.
  4. casts124.xml; cast listings with about 48 000 entries showing actors and their roles in about 9000 movies by 2700 directors, converted and checked with IE.
    cast93.dtd; dtd for casts, updated, not yet validated.
  5. remakes05.xml; linkages for about 1300 remade movies, converted and completely checked.
    remake4.dtd; dtd for remakes, not yet validated.
  6. studios00.xml; list of studios, not yet converted.
    studios01.dtd; dtd for studios, not yet validated.
  7. codes09.xml; code tables for variables used in various movie files.
    codes00.dtd; dtd for codes, not yet created.

Other files

Much of this material has been integrated into codes.xml. See the file describing earlier versions, doc.doc for missing codes.

2.11 -- REFERENCES --

Books that provided material for this database are listed within this documentation file as Appendix A.

2.12 -- GEOGRAPHY --

Codes for countries and origins are listed within this documentation file as section 4.3: doc.html GEO.

2.13 -- CATEGORIES --

Codes for movie categories are listed within this documentation file as Section 4.4: doc.html CATS.

2.14 -- COLOR-CODES --

Codes for color processes used for movies are listed within this documentation file as Section 4.5: doc.html COLS.

2.15 -- ROLE-TYPES --

Codes that specify role-types for actors

are listed in the preamble for casts.html ROLES.


Codes that identify subfields in various files are listed within this documentation file as Section 4.2: doc.html FIELDS.

2.17 - AWARD TYPES --

Lists the award types used in MAIN, ACTORS, and PEOPLE, with the organizations who award them, and the span of years they were awarded.

2.19 -- IMAGES --

there is a small collection of .tiff files for actors and directors. They are kept individually in an images subdirectory.

2.20 -- ICONS --

There are about a dozan icons to be used to identify subfiles. Some of them come from the New Yorker Magazine Jan.1993. There are kept individually in an icons subdirectory.

Appendix A: References

Books, etc.

Books about Actors

Web pages

Electronic material

BOOKS I have for Movie Stories:


This section refers to the original HTML files. The notes are still being developed.

To convert the source files from HTML format to another type of database:

(we use [] to denote HTML `french' brackets.)

  1. Refcode: remove header notes
  2. Refcode: remove miscellaneous HTML commands, as [HTML], [/HTML], [BODY], [/BODY], [HR], ...
  3. Refcode: for relational files ignore all lines starting with [tr][th]. These are header lines suitable for schema definitions. They could also become the roots of large director objects for the main and cast files.
  4. Refcode: remove tabs and carriage returns. All content lines end with [td]| .
  5. Refcode: records are divided into fields, as documented in the file schemas above, by [td]. A space follows [td] entries, preceding the content of the next field. Missing fields are indicated by two `[td] [td]', or by `[td] dummy entry[td]', as indicated in the file descriptions above.
  6. Refcode: Many fields can have multiple entries. Simple relational transforms may drop such fields, others may require normalized sub-relations.
    Multiple values in a field a separated by
    1. Refcode: 1. `,' if the values are of the same type, as [td] Romt, Dram[td]
    2. Refcode: 2. `;' or `:' if they are of different types, as W(Ben Hecht; AAN)

If both `;' and `,' appear in a field, then the `;' typically distinguishes a major group relative to the `,' value separator, as [td] island, South Pacific; court, SF, CA[td].

  1. Refcode: In the Note fields may be a variable number of different types of entries, each of the form
    TypeCode(field), as W(Eliot Stannard)
  2. Refcode: When names of producers, writers, etc, in fields do not contain blanks, as [td] P:A.Hughes[td], then the name exists in the people.html file, and can be used as an interfile reference.