Project #3 FAQ
Question:
How are we supposed to search the entire Stanford Web? We don't have that
much quota in our leland accounts to copy the 100 MB file to the directory
of our program.
Answer:
If you're running on Unix, you can specify the absolute path as opposed to
the relative path to get to any directory in the file system. So you
could try something like this:
parser regexfile /usr/class/cs154/WWW/w.txt
Which would run the parser with regexfile (in the local directory) and the
w.txt file in the class directory (you have to start with / in your
absolute path, so it knows that you're specifying a path from the root).
Also, if you ever need to get from one place in your directory tree to
another, you can use .. to specify the parent directory of the current
directory (or of any directory for that matter):
parser ../project1/regexfile ../testfile/test.txt
Finally, you can always copy files into the /usr/tmp directory which does
not count toward your quota, so you could do something like this:
cp /usr/class/cs154/WWW/w.txt /usr/tmp
parser regexfile /usr/tmp/w.txt
This way, if you need to make changes to w.txt for whatever reason (like
say you wanted to add a unique word that appears nowhere else in the file,
but you still want to search through a large file to see how long it takes
you to find one accepting pattern), then you can alter the file by reading
and writing from it (which you can't do in the /usr/class/cs154/WWW
folder, since it's read only).
Question:
How many regular expressions are we expected to turn in?
Answer:
You're looking for interesting information -- I suppose it doesn't really
matter whether you use one regular expression or multiple regular
expressions. My guess is that Professor Ullman intends for you to try different
ways of formulating your expressions to find the same *kind* of useful
information, and then to comment on what patterns worked the best, or that
kind of thing. Alternatively, you could use multiple expressions to find
several different instances of related information and compare the
locations of the accepting states to find out their proximities, for example. As far as email addresses go, you will probably have only
one RE, but you might recall that sometimes email addresses don't include
the domain, sometimes they do, and other transformations that might occur,
so you might see how useful it is to search for slightly more lenient
patterns (things that will accept well-formed email addresses as well as
predictable poorly formed email addresses). For example, I'm reading a
paper right now by Miguel Castro and Barbara Liskov at MIT. In the header
it tells their names, their address, and their email addresses, but it
presents the email addresses as follows:
{castro, liskov}@lcs.mit.edu
Which, of course, represents castro@lcs.mit.edu and liskov@lcs.mit.edu --
if you were searching for normal emails you might only get
liskov}@lcs.mit.edu, which is no one's email, because the right-bracket
makes it poorly-formed. This is just one example -- there are probably
many others. You might try to think about some alternative ways of doing
this, and seeing whether or not they make any difference in the kind of
information you get out.
You are expected to look for something interesting and
tell us what you found. If you only need one regular expression to find
something interesting, then bully for you. If you use only a simple
regular expression, though, you might miss a lot of interesting cases.
You can always Union expressions together to make several regular
expressions one single expression, but depending on how you're running
your search, this might lose useful information. You may want to search
for several different interesting things, in which case you might want
more than one expression and more than one summary of what you found.
This is really mostly up to you.
Question:
Any other useful tips?
Response:
Also, for those of you who don't know how to redirect UNIX standard input
and output, you can use the < and > operators on the command line to tell
the standard input to be read from a given file or the standard output to
be written to a given file (stdin and stdout are the files you read from
in scanf and write to in printf respectively). You can use this to
collect the data you get out of your program to compare it to, say, the
output you'd get by running grep over the same file. For example:
setenv CS154 "/usr/class/cs154/WWW"
parser regexfile $CS154/w.txt > myoutput
grep -f regexfile $CS154/w.txt > grepoutput
diff myoutput grepoutput
The first line shows how you can set your environment variables to a
certain string, so that you don't need to keep retyping the path. The
second line shows how to use the environment variable to get to the file
you want and output the results of your scan into another file (that will
appear in the current directory). The third line shows how to run grep
with a regular expression in a regexfile. The one caveat here is that
grep uses a slightly different regular expression grammar than we do. You
might have to alter your regular expressions to match those of grep,
depending on what you're searching for. (You can find out more about
grep's regular expression grammar by typing "man grep" in the UNIX
console). The fourth line runs a file called diff over the two outputs to
see where they differ. grep returns the lines on which an accepting
pattern is found, so your output might be different than theirs.
The one last note of advice I have is that you try running grep anyway,
even if your output is different, just to see how long the execution takes
relative to your own. You can tell the machine to ignore the standard
output of grep or your own program by redirecting the output to /dev/null:
time grep -f regexfile $CS154/w.txt > /dev/null
(the output of time is sent to standard error, so it won't be redirected
to /dev/null, but will instead be sent to your console). If you beat grep
and get the same output, let us know (you might try fgrep and egrep as
well, since these have different scanning techniques). You might also run
grep on regular expressions that will grow to exponential size and see how
it handles those relative to your NFA expression (and see if it handles
them correctly). Good luck with everything!
Last modified: Fri Mar 3 03:52:20 PST 2000