CS154 Project #3

Due Wednesday, March 8, 2000, 3:15PM

Mining the Web

The requirements are simple: use what you have built in Projects 1 and 2 to find something interesting on the Stanford Web, which we have downloaded as a single file (plus short segments of this 100Mb file); see Resources in the class Web page. It is up to you what to look for, but some suggestions are:

  1. Phone numbers, credit cards, or other numerical information.
  2. Mentions of a particular state country.
  3. Portions that are in some foreign language.
  4. Faculty and their offices.
  5. Movies being shown on campus (at the time the Web snapshot was taken).

Credit for the project will be based on correct use of your software (e.g., selection of appropriate regular expressions), but there will be cheap prizes for originality.

Selecting the proper regular expressions is not trivial, as those of you following the FAQ sheet for Project 2 will note. For example, my test proposal that you look for capitalized Something of Something was badly done at least twice. First, it was pointed out that you want to include the whitespace at the end of the second Something, or you get Something of S. Then, it was pointed out that the actual context could involve punctuation, such as period or comma, and that these had to be added as options. Is that good enough to capture all, or the great majority, of instances of the intended pattern?

What to Hand In

Please give us a report including:

  1. An informal description of what you were looking for.

  2. The regular expressions you used.

  3. The files you were able to scan, and the approximate time your scans took. Note: you can get exact timings using the time command in front of the command that runs your pattern-matcher. That is,

         time foo
    

    runs command foo and prints a synopsis of the user, system, and wall-clock time taken by foo at the end. However, we are only interested in whether you are able to do the search in a reasonable amount of time. Thus, anything that is too short to measure on a watch can be reported as ``instantaneous.'' For example, tell us whether you were able to get through the full 100Mb at all, or whether you took long or short times on the shorter files.

  4. The results of your searches. If you received a large number of matches for one expression, you can give us a sample and tell us (roughly) how many there were.

All this information can be handed in electronically using submit, or given to us in class by hardcopy.