CS154 - Project Part I Cheat Sheet

This document contains a few helpful hints on how to approach Part I of
the project.  The first section will include a Guide for the Project,
outlining the various steps in the conversion from a regular expression to
a finite automata, including pseudocode and hints and tips.  

Note:  You do not need to follow the outline provided here.  It is an
example only.  It is not any more "correct" than any other method, nor is
it necessarily the most efficient or easiest way to do it, but it may help
you in getting an idea of what approaches you might take.

SECTION 1: Outline of Construction

STEP 1: Convert Regular Expression Into a Parse Tree
- this function is supplied for you in parser.h.

STEP 2: Convert Parse Tree Into an epsilon-NFA

Proposed Method: In-Order Tree Traversal by a Recursive Function:

This function assumes you have some way to represent a graph, using node
structures and edge structures.  It also assumes you've created a node
representing a start state, s, and a node representing a final state f,
which you pass to the function along with the regular expression parse
tree to begin the recursion.

The design is up to you, but keep in mind
the kind of information you'll need:

void RecursiveConvert(RegularExpressionT *t, NodeT *s, NodeT *f) {
  switch (t->type) {
    Terminal Cases: (Base Case)
      Create edge with appropriate label.
      link edge from s to f

    Union:
      Create a new state node, q1, q2.
      Create a new state node, p1, p2.
      Create edge with epsilon label, e1, e2.
      Create edge with epsilon label, f1, f2.
      link e1 from s to q1.
      link e2 from s to q2.
      link f1 from p1 to f.
      link f2 from p2 to f.
      RecursiveConvert(t->term, q1, p1);
      RecursiveConvert(t->next, q2, p2)

    Concat:
      Create a new state node, q.
      Create a new state node, p.
      Create edge with epsilon label, e.
      link e from q to p.
      RecursiveConvert(t->term, s, q);
      RecursiveConvert(t->next, p, f);

    Kleene Closure:
      Create a new state node, q.
      Create a new state node, p.
      create edge with epsilon label, e.
      create edge with epsilon label, f.
      link e from s to q.
      link e from p to f.
      create edge with epsilon label, d.
      create edge with epsilon labal, g.
      link d from s to f.
      link g from p to q.
      RecursiveConvert(t->term, q, p);

    ... etc.: (cover the rest of the cases)
  }
}

Suggestions: Keep track of the edges into nodes as the edges out - it
makes epsilon-removal easier (you can tell when a node has no more edges
into it).  Remember that each state must be uniquely named - one solution:
a global increment variable.  You may also find it useful to save all
states in an array.

STEP 3: Remove epsilon-Transitions

Proposed Method: Depth First Graph Traversal by Recursive Function

This function assumes that you have a start state, s, and that you've
passed it in to begin the recursion.

RecursiveEpsilonRemoval(NodeT *s) {
  If this state is marked return, otherwise mark it and:
  For each edge out of s, if the label is epsilon, let q = destination
    if q has not yet been visited in this function, and q != s, then
      add edges in q not in s to edges of s (add to the end of the list of
      edges, so that these edges will be checked in this function as well.
      If q is accepting, let s be accepting as well.
    Remove the edge.
  For each edge out of s, let q = destination
    RecursiveEpsilonRemoval(q);
}

Suggestions: Remember to remove unreachable states after you've finished,
otherwise you'll waste memory.  Save each node in some kind of ordered
array -- this will simplify things when you convert to a transition table
anyway.

Tips on how to free stuff (for this function, and in general):  The
easiest way is not to worry about freeing states on the fly, but rather to
save a table of states and then free them all when you're done (you can
free all non-reachable states after STEP 3, and then free all the states
after STEP 4, along with all remaining edges).  In STEP 3, you'll want to
free epsilon-edges as you remove them.

STEP 4: Convert NFA to transition table

Proposed Method: Depth First Graph Traversal by Recursive Function

This function assumes that you've created at the root level some structure
for representing an NFA transition table, and that the sets of next states
have been initialized to being considered empty sets.  I'm going to access
this structure as one would access a two-dimensional array, though you can
have any suitable data structure for this.

ConvertToNFATable(NFATableT *t, NodeT *s) {
  if s has not been visited {
    Mark s as having been visited.
    For each e = edge out of s, let q = destination
      For each v = value of the label of e
        Add q.name to t->table[s][v]
      For each edge out of s, let q = destination,
        ConvertToNFATable(t, q);
  }
}

STEP 5: Convert NFA to DFA Transition Table

Proposed Method: Lazy Evaluation for Subset Construction on NFA State
Table.

This function assumes that you've created at the root level some structure
for representing a DFA transition table.  In this case I'll be using a
dynamic 2-dimensional array of values representing the states.  This
solution also implies some way of labelling each state in the DFA with the
NFA nodes that it represents, though this representation can be discarded
after construction.  Again, this is just pseudocode, so you can approach
this problem however you'd like.

ConvertToDFATable(DFATableT *d, NFATableT *n) {
  add row to 2-D array representing NFA state {0}
  for each q = element in the array of DFA states {
    for each n = NFA state value in representation of q, {
      For each a = input value {
        Fetch S = next state set from NFA table [q][a]
        If S represents a new DFA state (check array of DFA state
        representations), create a new row in the DFA table, and add the
        state representation to list of DFA state representations.  Also,
        if S contains a final state, remember to make the DFA state
        representing S a final state.
        Add transition to state represented by S to DFAtable[q][a]
      }
    }
  }
}

At some point in this construction, if the size of the DFA table exceeds
some maximum value, you can quit, thus maximizing the size of the DFA
state table you'll produce.  Remember, the general size of the DFA table
(unless you cleverly reduce it) is going to be n * input *
sizeof(elements).  For a table of 256 inputs and 4 byte integers
representing states, a 1024 state table requires a megabyte of storage.

SECTION 2: Reducing Memory Requirements for DFA Table

You want to try to allow more states with less memory, without losing any
expressivity in your DFA.  There are several different approaches you
might want to take:

Element Size: You can immediately reduce memory use by 50% by storing
state values as shorts instead of ints.  This caps the table size at 2^16,
since that will be the maximum amount of information a two byte short can
represent.  It might not be a problem, because that still allows some
64000 states.  You could save 75% by storing states as chars, but then you
effectively max out at 2^8 states, which is a small enough table that you
wouldn't have to worry about memory anyway, so the benefits are
eliminated.

Reductions in Input Size: You can try to determine the "important
characters" from your regular expression.  You can save a separate array
of 256 chars, indexed by character.  Set the "important characters"
indices to unique values [0..255], representing the indices of that input
in the DFA table.  

Why?  We can thus reduce the width of the DFA table to the size of the
"important character" set plus one, representing any non-distinguished
characters (or any character).  e.g. Say we have 9 distinguished
characters (our regular expression is "a.b.*c..def.hi.*(j)*.*", for
instance).  Our DFA table thus requires only n * 10 * sizeof(element)
bytes as opposed to n * 256 * sizeof(element) bytes, because all other
characters can be represented by the same transitions.

How to determine or expand on the idea of "important characters":
Characters which appear only parallel to each other in single character
union expressions can be given the same index value, since they will
always transition to the same state.  In general, *any* two input symbols
whose transitions match for all states can be merged and represented by
the same index in the DFA table (they would be given the same value in the
input index array that I mentioned above).  e.g. in "a|b|c", a, b, and c
can be represented by the same input index.

Alternatively, we can save character sets *as* character sets rather than
expanding them, and when an input doesn't match any distinguished
character, you can check character sets to see if values are represented
therein.  This only makes sense if you merge equivalent character sets,
and might be difficult to implement.  

A combination of these approaches may be the best solution, or you may
find a construction that provides a better solution.  

SECTION 3: Other Useful Suggestions and Concepts:

* Graph representations for FA's are directed graphs.  An edge goes from
one node to another, but does not imply any sort of edge going the other
direction.  Thus edge (q,t) is very different than edge (t,q).  If you use
some kind of edgelist in your construction, keep this in mind.

* Once you've constructed a DFA, try to minimize it.  This doesn't speed
up the DFA, but it reduces the size, allowing you to store more DFA's in
memory, if you decide multiple DFAs is important to your search strategy.

* You might try to break up large expressions into smaller ones, and use a
different search strategy for a combination of these machines.  For
instance, you could break r|s, where r and s represent regular
expressions, into two expressions, convert both to DFAs, and minimize, and
then run them in parallel over your input.  This should result in a more
efficient implementation than running an NFA on the larger expression, and
should result in a less memory intensive implementation than the combined
DFA, especially if you can then take advantage of a reduction in input
symbols.

* For very large NFAs or DFAs in which many states have transitions on
relatively few symbols, but you can't reduce the input alphabet, you might
try using a "sparse matrix".  Memory savings could be quite substantial.
There are many ways to implement these.  You could implement it using a
character set to state which symbols are defined for that state, and then
storing in a much smaller array, indexed in some way based on the defined
inputs for that state, the state transitions only for those symbols on
which transitions are defined.

* If, on the other hand, you have lots of transitions on lots of inputs
for each state of your NFA, you might consider representing the
transitions, the set of current states, and the set of final states by
bitfields and using bitwise operators on them to determine set inclusion,
etc. (this is also a fast way to do an NFA search).  The number of bits
required is one per state, so this saves memory only if your next state
representation arrays are larger than these fields on average.  If you
have 1024 states, each bitfield would have 1024 bits, or 128 bytes.  This
is pretty big, considering that you'll have a table of 1024 * 256 *
sizeof(next state structure) bytes.  If you store next states as a pointer
(4 bytes) to an array of shorts, the average number of next states
necessary would be 6 states or more to see any beneficial memory savings
(6 edges out of each state, on average).  The search time should be
better, however, and it's unlikely (though entirely possible) you'll have
NFA's of 1024 states.  (Note that the transition table for an NFA of this
size would be about 32 MB.) 

* Be creative!  There are 1000's of ways to make a really kick-ass DFA
and/or NFA, and many more implementations that make it easier to do
certain kinds of searches.  Think about Part 2 and 3 of the project when
designing your tables.  Think speed.  Think compactness.  Think
flexibility.  Good luck!

SECTION 4: How to Output Your NFA for the End of Part 1:

Note that you may have to adjust this to fit your program structure.  I am
demonstrating by accessing values as if they were a part of some
structure, though you may have to access them differently.

printf("NFA\n");
printf("%s\n", regexpstring);
printf("Start State = %d\n", nfa->start);
printf("Final States = {");
for (i = 0; i < nfa->finalStateNum - 1; i++) {
  printf("%d, ", nfa->finalStates[i]);
}
printf("%d}\n", nfa->finalStates[i]);
for (i = 0; i < nfa->numStates; i++) {
  printf("State q%d:\n", nfa->states[i].name);
  for (j = 0; j < 256; j++) {
    if (isprint(j))
      printf("\tInput '%c': {", (char)j);
    else
      printf("\tInput #%d: {", j);
    for (k = 0; k < nfa->nextstatesize[i][j] - 1; k++) {
      printf("%d, "), nfa->nextstate[i][j][k]);
    printf("%d}\n", nfa->nextstate[i][j][k]);
    }
  }
printf("End NFA\n");

SECTION 5: How to Output Your DFA for the End of Part 1:

printf("DFA\n");
printf("%s\n", regexpstring);
printf("Start State = %d\n", dfa->start);
printf("Final States = {");
for (i = 0; i < dfa->finalStateNum - 1; i++) {
  printf("%d, ", dfa->finalStates[i]);
}
printf("%d}\n", dfa->finalStates[i]);
for (i = 0; i < dfa->numStates; i++) {
  printf("State q%d:\n", dfa->states[i].name);
  for (j = 0; j < 256; j++) {
    if (isprint(j))
      printf("\tInput '%c': ", (char)j);
    else
      printf("\tInput #%d: ", j);
    printf("%d}\n", dfa->nextstate[i][j]);
    }
  }
printf("End DFA\n");

Note on SECTION 4 and SECTION 5: We would like to see this format output.
If you choose to restrict your inputs, please do so here as well.  If you
choose to work with charsets, come up with some way of labelling them and
outputting the charset itself (this function is provided in parser.h)
after you've completed listing out the whold DFA or NFA output, paired
with the label you gave it while printing out the machine.  (and if you
can, try to match the ordering as well).  

We strongly suggest that you do this, so that it makes it easier for us to
grade and test your project.  

SECTION 6: Good luck!

Good luck!

ADDENDUM 1: Programming Hints

A lot of people have been saying that they don't know how to get started,
or what kinds of data structures to use.  I'm going to provide a more
concrete example of how you might actually build some of these things in C
below.  The following code is, again, an example, and moreover, it has
never been tested for ease of use or compiled, even, so there may be
syntax errors. I am providing it as sort of an outline of how *I* would go
about implementing the methods that I outlined above.  I recommend that
you think about these problems *first*, before looking at these proposed
programming approaches, because it's a good intellectual exercise and
important for you to develop your own problem solving approaches.

-Nate

PART 1:
Example Structures:
Here are some example structures you might try, along the lines of the
methods proposed above.

typedef enum { SYM, ANY, E, CSET } EdgeLabelT;
// SYM = symbol
// ANY = any character
// E = epsilon
// CSET = character set

struct node_t;
struct edge_t;

typedef struct node_t {
  int name; // unique identifier for each state

  // The following are dynamic arrays
  struct edge_t *edges_out; 
  struct edge_t **edges_in; // saves you hassle in removing unreachable
	// states.  Save as pointers, so you'll be able to identify the
	// edge structure itself.

  // You'll want to save the allocated size as well to allow dynamic
  	//reallocation.
  int numout, numoutalloc;
  int numin, numinalloc;
} NodeT;

typedef struct edge_t {
  struct node_t *from;
  struct node_t *to;
  EdgeLabelT type;
  char value; // for when type = SYM;
  CharSetT *charset; // for when type = CSET;
} EdgeT;

NodeT *stategraph;
int numstates;
// You can come up with a way to not have to dynamically reallocate the
// stategraph.  What do we know about the number of states, given this
// construction, based on the regular expression input string?


// Table Constructions:
// For both of these tables, I am assuming that your states are going to
// be numbered 0 through n-1, where n is the number of states.  You'll
// want to find a way to rename your NFA states so this is true, or to
// come up with an alternative way of indexing the stategraph.  This
// construction also assumes that you assume a set number of inputs (in
// this case 256).

// NFA

// You'll need a dynamically-reallocatable array of next states:
typedef struct {
  int *nextStates;
  int nextStateNum;
  int nextStateAlloc;
} TransitionT;

typedef struct {
  int start;

  // You'll need a dynamically-reallocatable array of final states:
  int *final;
  int numFinalStates;
  int numFinalStatesAlloc;

  TransitionT *(nextstates[256]); // coming in, you'll know the number of
  	// states.
  int numstates;
} NFATableT;

// DFA:

typedef struct {
  int start;

  int *final;
  int numFinalStates;
  int numFinalStatesAlloc;

  int *(nextStates[256]);
  int numstates;
  int numstatesalloc; // You need to dynamically reallocate.  You won't
	//know how many DFA states you need.
} DFATableT;

PART 2:
How to dynamically-reallocate an array:

First, start by allocating some minimum number of expected states.  If you
make this number large enough, you may never need to dynamically
reallocate anything:

int arrNum = 0;
int arrNumAlloc = 64; 
ArrayTypeT arr = (ArrayTypeT *) malloc (sizeof(ArrayTypeT) * arrNumAlloc);
// in this case, we allocate 64 elements to begin with.

if (!arr) Error("No memory!");
// remember that malloc doesn't check for you if you don't have enough
	// memory.  You can handle this however you'd like.  The best
	// response is probably to bail from the program and try and
	// figure out why this failed, and how you can predict failure and
	// prevent it in the future, for instance by capping the number of
	// states that your DFA or NFA can have (the SUN machines have a
	// lot of memory, but remember that these graphs can be *really*
	// big). 

As you add elements to the array, increment the value of numArr:

arr[arrNum] = element;
arrNum++;

Whenever you increment arrNum, check to see if the new value of arrNum is
equal to the number of allocated elements (it should never be greater than
this number -- you should probably produce an error if it is, because then
you know you've written your code incorrectly).  If it is equal, you need
to reallocate the array to a larger size.  A safe bet is generally to
double the size of the array.

assert(arrNum <= arrNumAlloc);
if (arrNum == arrNumAlloc) {
  arrNumAlloc *= 2;
  arr = (ArrayTypeT *)realloc(arr, sizeof(ArrayTypeT) * arrNumAlloc);
  if (!arr) Error("No Memory!");
}

The function realloc will copy the array contents into a block large
enough to hold it, and return to you a pointer to the new memory.  You
again have to check to see that arr is not equal to NULL, as this is what
realloc, like malloc, returns when there is not enough memory to fulfill
your request.  You can free a block that you have realloc'ed in the same
way that you free a block that you've malloc'ed, namely, with the free
function.

Note that when you are reducing the size of the array (say, removing
epsilon transitions), it does not pay to realloc the array smaller,
because usually realloc just ignores the request and returns the same
pointer.  It shouldn't matter, because the only thing that is important in
the end is the memory you allocate to your NFA and DFA table.  For the NFA
table, you will know the number of states coming in (it will be the same
as the number of states in your graph).  For the DFA table, you should
pick as the maximum number of states your DFA can have a value that is a
power of 2 (if you approach reallocation in this way).  Alternatively,
when you know the true size of the array, you can save memory by copying
it into an array of exactly the size that you needed instead of leaving it
in an array that may be wasting up to 50% of its allocated memory.  

There may be a method in UNIX to find out exactly how much RAM is
available for allocation (I'm not sure, and if there is, it would be
platform dependent, I believe).  I don't *think* there's an ANSI function
to do this, but I could be wrong.  If that's the case, the more ambitious
of you can *dynamically* decide the maximum size of your DFA based on the
amount of memory available!

Good luck!

ADDENDUM 2:
Another Note on Memory:
This came up in section tonight.  On the Unix machines, you have a vast
amount of memory at your disposal.  The RAM is very large.  You can
represent very large DFAs in it.  The problem is that if you have a very
large DFA, it is likely that it will get paged out of main memory by the
virtual memory system, and put onto the disk.  This will probably start
happening long before malloc complains that you don't have enough memory
and returns NULL, which is why checking for malloc being NULL as a test to
your DFA being too large is a bad test.  Why is it bad if main memory
pages out part of your DFA structure?  Because DFAs can have very poor
locality of reference.  What does that mean?  You're not guarenteed that
when you read from one location (say the inupt on a certain state), that
the next state you have to read is going to be anywhere near the original
state in memory.  This is especially true if you don't have your DFA in a
contiguous block.  So it's likely that when you're going from state to
state, you'll move to a different page in memory that may have been paged
out.  This is very bad because it takes about 1000 times as long (that's a
guess, but it seems to be the general number that they were throwing out
in EE182), so it's going to reduce the benefits of having converted to a
DFA in the first place (and after having gone through so much trouble, we
wouldn't want that!).

If anyone wants to try a solution that works around this, one possible
approach would be to attempt to put your DFA in contiguous memory (once
you've built the DFA, allocate a giant block of memory of the size of your
DFA), and copy the rows over to that block, attempting to put rows that
reference each other closer together, to increase locality of reference.
This would result in much extra credit.  Moreover, finding an optimal
solution to locality of reference in a DFA state graph,  I am willing to
bet, is NP complete, so you'd want to find a heuristic for it.  I'm not
sure on that though.  Extra credit also for anyone who can prove that this
is or is not NP complete.  :)  If you try this approach, try the other
approach as well, and compare the running times of your solution, if you
can.  

Good luck!