CS154 Project #2

Due Wednesday, February 23, 2000, 3:15PM

Simulating a Finite Automaton

Your program should take two inputs:

  1. A file containing a regular expression R in the notation used for Project 1.

  2. A file containing a string of characters w.

That is, your program, perhaps called myProg should be able to run with the command line

     myProg reFile dataFile
where reFile is a file containing a regular expression, and dataFile is a file containing the data that will be input to the resulting FA.

We will also accept programs that take a different input form, if you are more comfortable with that. For example, you might want to concatenate the RE in front of the data file, and use only the standard input. However, if you do so, better make sure that you have a way of telling where the RE ends.

Your output should be an indication (discussed below) of at what points in string w, the automaton for the regular expression .*R accepts. Notice that while R might be a simple pattern, say abc, what you are looking for is all the positions in R where abc occurs. Thus, the regular expression you match against w is not R, but .*abc, i.e., anything followed by abc.

Presumably, your program will modify R to become .*R, then feed that expression to your program from Project 1. The new code you write for this project will simulate whatever kind of automaton your code from Project 1 produces. If for some reason you can't handle the ``.*'' then an alternative is to start your automaton at every character of w, and simulate all these (copies of) automata, throwing them away when they reach a dead state.

We strongly advise you to design your program to ignore any input character that the automaton cannot use. The reason is that we can not be sure the 100Mb of text we downloaded doesn't have some weird character that is not part of the usual 128 ASCII set. If there is even one such character, the FA will die on reading it, and you won't be able to find anything beyond that point. Note that reading an unexpected character is different from reading an expected character that leads to a dead state (or, if the FA is an NFA, having no transitions on that character).

Response to Acceptance

You may find that your automaton accepts many times as it reads w. Each time it does so, it has just seen something that matches the pattern of R. But what? The answer is that it is hard to tell, unless you take the alternative mentioned above, where you start the automaton for R at each character (and we do not advise this approach, but will accept it). Two possibilities:

  1. Produce in the output the previous n characters of w, say the last 100 characters.

  2. Produce the line containing the point at which the acceptance occurs.

Either way, you have to ``buffer'' w, that is, remember the last so-many characters. You can make a reasonable assumption about how long lines could ever be, if you choose line-based output.

What to Hand In

  1. Your code. Submit it using the submit command as per Project 1.

  2. A writeup telling us:

    a)
    The essential idea(s) behind your simulation.

    b)
    Several examples of RE's and strings that you compiled, and the resulting output.

    c)
    We shall also distribute several examples, and you should show us your results on these.