Creating the Synonym table
The synonym table was generated from a comma-separated version of
Roget's Thesaurus from the early 1900's (the copyright had
expired). Each row in the table contains the root word (the thesaurus
entry), and a number of child words (the words listed as synonyms of the
word). CreateThesaurus.java is a simple Java program that handles this
parsing.
Parsing Author, Work, and Criticism Metadata
The information about authors, works, and literary criticism was parsed from online resources at Project Gutenberg and the Internet Public Library. I wrote special-case parsers to handle these pages. The parsers for the author data are given below; the others are similar.Parsing Works
Works are stored in Project Gutenberg as plain text. I wrote a script to
download these texts and then parse their contents into the
database. After downloading the text, the header and footer that Project
Gutenberg places on the text were removed. Then the body was tokenized
to find individual words.
Words were then "stemmed", meaning that the suffixes were stripped off
such that similar words like "falling", "fallen", and "falls"
would all map to the same stem, "fall".
This was done such that keyword searches would also return
hits containing similar words. It had the additional benefit of keeping
the vocabulary size smaller. The algorithm I used was the
Porter
Algorithm, for which there was a public-domain implementation available.
My source code for acquiring data is very unpolished, since we were
not required to turn it in.
Creating the synonym table