CS154 Project #1
Due Wednesday, February 2, 2000, 3:15PM
Overview of Course Project
The project is in three phases.
- A program that compiles regular expressions into transition
tables for either a DFA or an NFA (this assignment).
This program will be tested on some ``ugly'' RE's that produce very large
DFA's, so you should make a choice for each RE.
If you simply turn everything into an NFA, you will be too slow
on simple searches once we get to part (3).
- A program that interprets the tables you created in part (1).
You should be able to process very long strings of input.
-
A demonstration of what you can do with your FA interpreter from part (2).
We shall give you a large body of text (10Mb - 100Mb), perhaps
the Stanford Web pages or a similar repository.
You should ``data-mine'' something from the data by designing some regular
expressions and searching for occurrences of substrings that match those
expressions.
Think, for example, about examining Web pages from all Stanford departments
and finding course numbers and their instructors, although it will be
up to you what to search for.
Programming Venue
Despite what you may have heard, NT is not taking over the world;
Linux will dominate, and every computer scientist should assume that
UNIX is and will remain the platform of choice.
Thus, we assume you will write your code using one of the UNIX clusters,
e.g., the ``elaines'' or ``epics.''
We also assume that you will write your code in C or C++; Java is acceptable,
but you may lose some support that has been provided for C/C++ programming.
If you insist on programming in your dorm on a PC or MAC, then remember
that the code has to run on a very large body of data in part (3).
The files will appear in the class directory, and it is your responsibility
to be able to access them.
Regular-Expression Parsing
Thanks to Nathan Folkert, there is a parser for regular expressions
in subdirectory /usr/class/cs154/project.
Check the file parser.h for a description of the parser and
its input language.
The file parser.c has the C code for the parser.
You can just include it in your program (you don't even have to acknowledge this
use; we'll assume it), or modify it should you like (again, no acknowledgement
is necessary).
This code takes a regular expression of the type we expect you to process
and produces a parse tree, i.e., a tree of structs, each of which represents
a node of the tree; i.e., an operator or operand of the expression.
This code has been compiled under gcc; we hope it will compile
under other C compilers, but it has not be tested.
You should report any problems to cs154-help@lists, and we
shall try to adapt to what people are actually using, but there are no
guarantees.
The Assignment
You should write a program that accepts a file containing a regular expression
in the extended (UNIX-style) notation including simple character classes
(bracketed lists of characters and ranges of characters like a-z).
You should also allow the ? and positive closure (superscript +)
operators.
These extensions are described in Section 3.3.1 of the course reader.
Since we do not have the easy ability to distinguish superscripts from
in-line characters, use | as the ``or'' operator and +
as the superscript + (positive closure) operator.
See the file project/parser.h for details, even if you do not use the
code provided.
The output of your program should be a structure representing either a
DFA or NFA for the same language as the input RE.
You should design your structure carefully, because it will be used in
part (2).
Some data-structure things to think about:
Roughly, an automaton is a set of states.
This set could be represented as an array or a list.
A state is a set of input/next-state(s) pairs.
Your automata will have a large input alphabet, perhaps all 128
ASCII characters, or the printable subset (i.e., those that
appear on a typical keyboard).
You might, for example, plan to ignore all characters that don't
form part of the printable set of ASCII characters.
How will you represent the transitions?
By a list of symbol-state(s) pairs?
By an array indexed by ASCII characters?
Will you represent character classes specially, e.g., by handling the
common situation where a state q goes to state p
on a large number of different inputs such as all letters?
We're not trying to tell you what should be done; we just want to warn you
to think about these issues, since you will eventually have to
simulate automata in whatever data structure you choose.
What to Hand In
-
Your code.
You may be able to submit this on-line.
Details later.
-
A writeup telling us:
- a)
-
What strategy you used to decide whether to generate a DFA or NFA.
- b)
-
The essential ideas behind your representation(s) of automata.
- c)
-
Several examples of RE's that you compiled and the result.
You may wish to develop a way to display the contents of your automaton
data structures in a ``human-readable'' way.
- d)
-
We shall also distribute several hard examples of RE's, and you should
describe the results on these expressions:
how long did it take; how big were your data structures?