Project #2 FAQ
Question:
Where would you like this project to go? What suggestions do you have?
Answer:
Here are some Project
Hints to get you started thinking about what kinds of things you
might do with this project.
Question:
When I search for ``School of Business'' I first match ``School of B''.
Answer:
That was a mistake in the specification.
We really should have asked for a pattern that ended with white space,
e.g., a blank or newline.
The parser.h file describes codes for whitespace in the regular
expressions that are allowed to be input to Nathan's parser, including
{s} for any one whitespace character.
Added later: It was pointed out that we then miss occurrences
at the end of a sentence or that are followed by a comma or other
punctuation.
Thus, you may wish to end with an expression like ({s}|[\.\,;]).
Remember a real period is represented by \. and likewise for
comma.
Added much later:
It was also pointed out that a tag is a logical ender for the expression
we are looking for.
Thus, including < as a possible ender character makes sense too.
Question:
There are ctrl-M's in the Web text that make it hard to read, and that
the parser fails to recognize as newlines.
Answer:
Apparently some of the Stanford Web was created using Windows (shame on
them), and Microsoft, in one of its early attempts at incompatibility
with UNIX used ctrl-M (carriage return), rather than the ASCII newline
control character ('\n' in C) to separate lines.
One of the students, B.C. Wong, suggested the following UNIX ``translate''
command:
tr '\r' '\n' <inputFile >outputFile
to replace the carriage-returns ('\r') by newlines.
I actually did that for the truncated files, which are now available
through the Web as x1000.txt and so on.
I was not able to do it for the entire 100Mb file, because we are not
allocated enough space.
However, if you feel the ctrl-M's are giving you trouble, you can
write project #2 to take the data as its standard input, and use the
UNIX ``pipe'' symbol | to have the translation done piecemeal as the
input to your program is read.
The idea is
tr '\r' '\n' </usr/class/cs154/WWW/w.txt | yourProj2...