Project #3 FAQ

Question: How are we supposed to search the entire Stanford Web? We don't have that much quota in our leland accounts to copy the 100 MB file to the directory of our program.

Answer: If you're running on Unix, you can specify the absolute path as opposed to the relative path to get to any directory in the file system. So you could try something like this:

parser regexfile /usr/class/cs154/WWW/w.txt

Which would run the parser with regexfile (in the local directory) and the w.txt file in the class directory (you have to start with / in your absolute path, so it knows that you're specifying a path from the root).

Also, if you ever need to get from one place in your directory tree to another, you can use .. to specify the parent directory of the current directory (or of any directory for that matter):

parser ../project1/regexfile ../testfile/test.txt

Finally, you can always copy files into the /usr/tmp directory which does not count toward your quota, so you could do something like this:

cp /usr/class/cs154/WWW/w.txt /usr/tmp
parser regexfile /usr/tmp/w.txt

This way, if you need to make changes to w.txt for whatever reason (like say you wanted to add a unique word that appears nowhere else in the file, but you still want to search through a large file to see how long it takes you to find one accepting pattern), then you can alter the file by reading and writing from it (which you can't do in the /usr/class/cs154/WWW folder, since it's read only).

Question: How many regular expressions are we expected to turn in?

Answer: You're looking for interesting information -- I suppose it doesn't really matter whether you use one regular expression or multiple regular expressions. My guess is that Professor Ullman intends for you to try different ways of formulating your expressions to find the same *kind* of useful information, and then to comment on what patterns worked the best, or that kind of thing. Alternatively, you could use multiple expressions to find several different instances of related information and compare the locations of the accepting states to find out their proximities, for example. As far as email addresses go, you will probably have only one RE, but you might recall that sometimes email addresses don't include the domain, sometimes they do, and other transformations that might occur, so you might see how useful it is to search for slightly more lenient patterns (things that will accept well-formed email addresses as well as predictable poorly formed email addresses). For example, I'm reading a paper right now by Miguel Castro and Barbara Liskov at MIT. In the header it tells their names, their address, and their email addresses, but it presents the email addresses as follows:

{castro, liskov}

Which, of course, represents and -- if you were searching for normal emails you might only get liskov}, which is no one's email, because the right-bracket makes it poorly-formed. This is just one example -- there are probably many others. You might try to think about some alternative ways of doing this, and seeing whether or not they make any difference in the kind of information you get out.

You are expected to look for something interesting and tell us what you found. If you only need one regular expression to find something interesting, then bully for you. If you use only a simple regular expression, though, you might miss a lot of interesting cases. You can always Union expressions together to make several regular expressions one single expression, but depending on how you're running your search, this might lose useful information. You may want to search for several different interesting things, in which case you might want more than one expression and more than one summary of what you found. This is really mostly up to you.

Question: Any other useful tips?

Response: Also, for those of you who don't know how to redirect UNIX standard input and output, you can use the < and > operators on the command line to tell the standard input to be read from a given file or the standard output to be written to a given file (stdin and stdout are the files you read from in scanf and write to in printf respectively). You can use this to collect the data you get out of your program to compare it to, say, the output you'd get by running grep over the same file. For example:

setenv CS154 "/usr/class/cs154/WWW"
parser regexfile $CS154/w.txt > myoutput
grep -f regexfile $CS154/w.txt > grepoutput
diff myoutput grepoutput

The first line shows how you can set your environment variables to a certain string, so that you don't need to keep retyping the path. The second line shows how to use the environment variable to get to the file you want and output the results of your scan into another file (that will appear in the current directory). The third line shows how to run grep with a regular expression in a regexfile. The one caveat here is that grep uses a slightly different regular expression grammar than we do. You might have to alter your regular expressions to match those of grep, depending on what you're searching for. (You can find out more about grep's regular expression grammar by typing "man grep" in the UNIX console). The fourth line runs a file called diff over the two outputs to see where they differ. grep returns the lines on which an accepting pattern is found, so your output might be different than theirs.

The one last note of advice I have is that you try running grep anyway, even if your output is different, just to see how long the execution takes relative to your own. You can tell the machine to ignore the standard output of grep or your own program by redirecting the output to /dev/null:

time grep -f regexfile $CS154/w.txt > /dev/null

(the output of time is sent to standard error, so it won't be redirected to /dev/null, but will instead be sent to your console). If you beat grep and get the same output, let us know (you might try fgrep and egrep as well, since these have different scanning techniques). You might also run grep on regular expressions that will grow to exponential size and see how it handles those relative to your NFA expression (and see if it handles them correctly). Good luck with everything!

Last modified: Fri Mar 3 03:52:20 PST 2000