Web Crawler: User Documentation

What does the Web Crawler do?

The Web Crawler allows the user to extract data from a set of hyperlinked HTML pages, convert them into OEM format, and load the result into a Lore database. It takes as input a specification file, a user profile, and a set of templates. The specification file describes how the user wants the extraction to be performed (e.g. where to start, how far to get to, what pages to include, what templates to use for each set of pages, etc.). The user profile gives the location of tools needed by the Crawler (e.g.  the Extraction script, and the Lore database loader), while the set of templates declaratively states where the data of interest could be located in the target HTML pages.

The Web Crawler makes use of the Web Extractor script (proxygen.py) written by Arturo Crespo and Junghoo Cho to perform individual extractions. There is a related perl script lwp-rget (included in perl 5 distribution) which downloads a web subgraph and adjusts links for off-line browsing. The perl script provides some interesting features such as a sleep parameter to stretch out document retrievals, but it does not support the main features of the Web Crawler such as filtering by document types, extracting document contents, and converting the extracted information into database objects.

A Sample Specification File

The following is a sample specification file which can be used to gather the structure of our DB Seminar site. Suppose that we want to extract a different set of information for the abstracts of the talks, and have designed a set of special templates for them. For other pages, we will use another set of templates.

Please note that the line numbers in this example are not part of the specification file. They are added only to aid the explanation.

  1.   START
  2.   http://www-db.stanford.edu/~kyau/dbseminar/seminar.html
  3.  
  4.   DEPTH
  5.   2
  6.  
  7.   INCLUDE_PREFIXES
  8.   http://www-db.stanford.edu/
  9.   ftp
  10.  
  11.   INCLUDE_TEXT_TYPES
  12.   text/html
  13.   text/plain
  14.  
  15.   INCLUDE_BINARY_TYPES
  16.   image/gif
  17.   image/jpeg
  18.  
  19.   INCLUDE_TEXT_SUFFIXES
  20.   .txt
  21.  
  22.   INCLUDE_BINARY_SUFFIXES
  23.   .gif
  24.   .jpg
  25.  
  26.   SAVE_TEXT_SRC
  27.   true
  28.  
  29.   SAVE_BINARY_SRC
  30.   true
  31.  
  32.   SAVE_BINARIES_IN
  33.   images
  34.  
  35.   TEMPLATES
  36.   regular1    template/regular.py
  37.   regular2    template/regularNolink.py
  38.   abstract1   template/abstract.py
  39.   abstract2   template/abstractNolink.py
  40.  
  41.   DOCUMENT_MAPS
  42.   http://www-db.stanford.edu/~kyau/dbseminar/abstract/*   abstract1 abstract2
  43.   http://www-db.stanford.edu/*.html                       regular1  regular2
  44.   * *
  45.  
  46.   LORE_DB_NAME
  47.   test
  48.  
  49.   LORE_ENTRY_TAG
  50.   dbseminar
  51.  
  52.   USE_CACHE_WITHIN_DAYS
  53.   0.5
This specification file consists of fifteen different sections separated by blank lines. Six of them are required (START, DEPTH, INCLUDE_PREFIXES, INCLUDE_TEXT_TYPES, TEMPLATES, DOCUMENT_MAPS). Each section is started with a keyword which states the purpose of the section.
 
Keyword Example Purpose
START lines 1-2 The URL of the starting page
DEPTH lines 4-5 The crawling depth
INCLUDE_PREFIXES lines 7-9 Types of pages to include. Only URLs starting with the given prefixes will be extracted. 
INCLUDE_TEXT_TYPES lines 12-13 MIME types of text files to include. Valid only for documents originating from HTTP servers. 
INCLUDE_BINARY_TYPES* lines 15-17 MIME types of binary files to include. Valid only for documents originating from HTTP servers. 
INCLUDE_TEXT_SUFFIXES* lines 19-20 Types of text files to include. Only URLs ending with the given suffixes will be extracted and treated as text documents. Valid for documents originating from non-HTTP servers.
INCLUDE_BINARY_SUFFIXES* lines 22-24 Types of binary files to include. Only URLs ending with the given suffixes will be extracted and treated as binary documents. Valid for documents originating from non-HTTP servers.
SAVE_TEXT_SRC* lines 26-27 Whether to save the HTML source of accepted text documents as part of the resultant OEM object.
SAVE_BINARY_SRC* lines 29-30 Whether to download the accepted binary files. If true, the resultant OEM object will include references to local copies of the accepted files. If false, only the URLs will be included and no local copies will be saved.
SAVE_BINARIES_IN* lines 32-33 The directory to save the local copies of accepted binary files.
TEMPLATES lines 35-39 Set of templates available. Each of the following lines gives a mnemonic name of a template and the location of the template file.
DOCUMENT_MAPS lines 41-44 Mapping of groups of documents to list of templates. Each of the following lines lists a group and the associated templates. A single wildcard symbol '*' can be used in specifying a group name.  After the Crawler consults the inclusion list and determines that a page should be extracted, the Crawler will try to match the page's URL with the groups one by one. When there is a match, the Crawler will then try to extract the given page with the list of templates associated with the group, until the extraction is successful. If all the templates in the group fails, the Crawler will then try to match the URL with another group, and repeat the above process.  
Line 27 shows a 'catch-all' group which will try all templates on all documents. 
NOTE: only text files will be subject to extraction. All binary files will be saved without any modification.
LORE_DB_NAME* lines 46-47 Name of the Lore database to load the result of the extraction. 
If this field is omitted, the result of the extraction will be stored in a file without loading into a Lore database.
LORE_ENTRY_TAG* lines 49-50 The persistent Symbolic Object Identifier to be used to identify the root of the resultant OEM object in a Lore database.
USE_CACHE_WITHIN_DAYS* lines 52-53 Specify how old cached copies can still be considered as valid. The Crawler will try to use a cached copy if it was retrieved from the source within the given number of days. Setting this field to 0 will disable the use of the cache. 
If this field is omitted, the Crawler will get the header of each requested page, and compare the last-modified time of the page with the cached copy. The cached copy will be used only when the actual page has not been modified since it was saved in the cache.
 * optional fields
 
For our example, the Crawler will start crawling at the page It will then follow the hyperlinks within the document, and extract all the HTML and text documents from our web server or any ftp server. If a page is located in the directory the Crawler will try to extract the document with templates 'abstract1' and 'abstract2'. For other documents from the web server of Stanford Database Group, the templates 'regular1' and 'regular2' will be used. All the templates will be tried for all other documents.  We will accept cached copies of HTML pages if they were retrieved within half a day. The extraction process will repeat until we extract all the documents within two levels of indirection from the starting URL. The resultant OEM object, with all the document sources, will be loaded into the Lore database 'test' with the entry tag 'dbseminar'.

In addition, the Crawler will download all the jpeg and gif files along the crawling path, and save them into the directory 'images'. References to these local images will be included in the resultant OEM object.

A Sample User Profile

The user can include an optional user profile '.oem_profile' to configure the Web Crawler's properties that are common to all invocations. LORE_PATH and EXTRACTOR_PATH give the path of the Lore Database Loader (dbload2) and the Web Extractor script (proxygen.py), respectively. The default value is the current directory. LORE_STRING_MAX_SIZE states the maximum length of strings to be loaded into a Lore Database. All longer strings will be truncated to the given limit. If this value is unspecified, no limit will be imposed.
 
Templates

The templates to use must follow the formats accepted by the Web Extractor script (proxygen.py), with three additional requirements. For a detailed description of Web Extractor and the corresponding template format, please refer to the paper 'Extracting Semistructured Information from the Web'.

The following is an example template that extracts the title, date, HTML source, and all hyperlinks of a HTML page. Please note that the line numbers are not part of the template.
  1.   [["root",
  2.     "get('%URL')",
  3.     "#"
  4.    ],
  5.    ["title",
  6.     "root",
  7.     "*<title>#</title>*"
  8.    ],
  9.    ["date(:root)",
  10.     "getHead('%URL')",
  11.     "*Last-modified:#\r*"
  12.    ],
  13.    ["source",
  14.     "root",
  15.     "#"
  16.    ],
  17.    ["_link",
  18.     "split(root, '<a href')",
  19.     "#"
  20.    ],
  21.    ["hyperlink:url",
  22.     "_link[1:]",
  23.     "*=*\"#\"*"
  24.    ]
  25.   ]
This template file consists of six commands. Each command is delimited by square brackets. The first command (lines 1-4) retrieves the content of a HTML page, and assigns it to the variable root. Note the use of  the parameter %URL to allow reuse of the template on different pages. The Crawler will substitute the appropriate URL for the parameter during the actual extraction, according to the Document Maps specified by the user.  The same applies to the getHead command (line 10), which get the HTTP header of a URL.

The second command (lines 5-8) extracts the title of the page located between the <title> and </title> tags. The third command (lines 9-12) extracts the last modified date of the page from the HTTP header and assigns it to the variable date.  The extension (:root) following the variable name on line 9 is used to attach the result to the correct place. By default, the result of an extraction command will be attached to the source of the extraction. For example, title will be attached to root in the second command. Every get or getHead command will start a new tree. In this case, the use of (:root) in line 9 overrides the default setting, and attach date as a child of root instead of having 'date' as a separate tree.

The fourth command (lines 13-16) includes the source of the HTML page in the OEM graph. This is required if the user wishes to save the source in the Crawler's resultant OEM.

The last two commands split the page using the <a href tag as the separator, and store the hyperlinks in the variable list hyperlink. The Crawler recognizes all OEM nodes with label hyperlink and follows them to start the next level of crawling. The extension :url following the variable name on line 21 sets the type of the OEM objects as url instead of the default value string.
 
GUI for Template Generation

A GUI has been developed by Wei Yuan to help users in generating templates for the Web Extractor. The interface displays the HTML page in interest, and aids users to select extraction commands for the page. It also provides a view of the structure and content of the resultant OEM object.


 
Currently, the GUI only supports the Web Extractor script (proxygen.py). In the future, we intend to extend it to include features of the Web Crawler as well.
 



Questions/Comments? Please contact Ka Fai Yau <kyau@db.stanford.edu>.