CS154 - Project Part I Cheat Sheet This document contains a few helpful hints on how to approach Part I of the project. The first section will include a Guide for the Project, outlining the various steps in the conversion from a regular expression to a finite automata, including pseudocode and hints and tips. Note: You do not need to follow the outline provided here. It is an example only. It is not any more "correct" than any other method, nor is it necessarily the most efficient or easiest way to do it, but it may help you in getting an idea of what approaches you might take. SECTION 1: Outline of Construction STEP 1: Convert Regular Expression Into a Parse Tree - this function is supplied for you in parser.h. STEP 2: Convert Parse Tree Into an epsilon-NFA Proposed Method: In-Order Tree Traversal by a Recursive Function: This function assumes you have some way to represent a graph, using node structures and edge structures. It also assumes you've created a node representing a start state, s, and a node representing a final state f, which you pass to the function along with the regular expression parse tree to begin the recursion. The design is up to you, but keep in mind the kind of information you'll need: void RecursiveConvert(RegularExpressionT *t, NodeT *s, NodeT *f) { switch (t->type) { Terminal Cases: (Base Case) Create edge with appropriate label. link edge from s to f Union: Create a new state node, q1, q2. Create a new state node, p1, p2. Create edge with epsilon label, e1, e2. Create edge with epsilon label, f1, f2. link e1 from s to q1. link e2 from s to q2. link f1 from p1 to f. link f2 from p2 to f. RecursiveConvert(t->term, q1, p1); RecursiveConvert(t->next, q2, p2) Concat: Create a new state node, q. Create a new state node, p. Create edge with epsilon label, e. link e from q to p. RecursiveConvert(t->term, s, q); RecursiveConvert(t->next, p, f); Kleene Closure: Create a new state node, q. Create a new state node, p. create edge with epsilon label, e. create edge with epsilon label, f. link e from s to q. link e from p to f. create edge with epsilon label, d. create edge with epsilon labal, g. link d from s to f. link g from p to q. RecursiveConvert(t->term, q, p); ... etc.: (cover the rest of the cases) } } Suggestions: Keep track of the edges into nodes as the edges out - it makes epsilon-removal easier (you can tell when a node has no more edges into it). Remember that each state must be uniquely named - one solution: a global increment variable. You may also find it useful to save all states in an array. STEP 3: Remove epsilon-Transitions Proposed Method: Depth First Graph Traversal by Recursive Function This function assumes that you have a start state, s, and that you've passed it in to begin the recursion. RecursiveEpsilonRemoval(NodeT *s) { If this state is marked return, otherwise mark it and: For each edge out of s, if the label is epsilon, let q = destination if q has not yet been visited in this function, and q != s, then add edges in q not in s to edges of s (add to the end of the list of edges, so that these edges will be checked in this function as well. If q is accepting, let s be accepting as well. Remove the edge. For each edge out of s, let q = destination RecursiveEpsilonRemoval(q); } Suggestions: Remember to remove unreachable states after you've finished, otherwise you'll waste memory. Save each node in some kind of ordered array -- this will simplify things when you convert to a transition table anyway. Tips on how to free stuff (for this function, and in general): The easiest way is not to worry about freeing states on the fly, but rather to save a table of states and then free them all when you're done (you can free all non-reachable states after STEP 3, and then free all the states after STEP 4, along with all remaining edges). In STEP 3, you'll want to free epsilon-edges as you remove them. STEP 4: Convert NFA to transition table Proposed Method: Depth First Graph Traversal by Recursive Function This function assumes that you've created at the root level some structure for representing an NFA transition table, and that the sets of next states have been initialized to being considered empty sets. I'm going to access this structure as one would access a two-dimensional array, though you can have any suitable data structure for this. ConvertToNFATable(NFATableT *t, NodeT *s) { if s has not been visited { Mark s as having been visited. For each e = edge out of s, let q = destination For each v = value of the label of e Add q.name to t->table[s][v] For each edge out of s, let q = destination, ConvertToNFATable(t, q); } } STEP 5: Convert NFA to DFA Transition Table Proposed Method: Lazy Evaluation for Subset Construction on NFA State Table. This function assumes that you've created at the root level some structure for representing a DFA transition table. In this case I'll be using a dynamic 2-dimensional array of values representing the states. This solution also implies some way of labelling each state in the DFA with the NFA nodes that it represents, though this representation can be discarded after construction. Again, this is just pseudocode, so you can approach this problem however you'd like. ConvertToDFATable(DFATableT *d, NFATableT *n) { add row to 2-D array representing NFA state {0} for each q = element in the array of DFA states { for each n = NFA state value in representation of q, { For each a = input value { Fetch S = next state set from NFA table [q][a] If S represents a new DFA state (check array of DFA state representations), create a new row in the DFA table, and add the state representation to list of DFA state representations. Also, if S contains a final state, remember to make the DFA state representing S a final state. Add transition to state represented by S to DFAtable[q][a] } } } } At some point in this construction, if the size of the DFA table exceeds some maximum value, you can quit, thus maximizing the size of the DFA state table you'll produce. Remember, the general size of the DFA table (unless you cleverly reduce it) is going to be n * input * sizeof(elements). For a table of 256 inputs and 4 byte integers representing states, a 1024 state table requires a megabyte of storage. SECTION 2: Reducing Memory Requirements for DFA Table You want to try to allow more states with less memory, without losing any expressivity in your DFA. There are several different approaches you might want to take: Element Size: You can immediately reduce memory use by 50% by storing state values as shorts instead of ints. This caps the table size at 2^16, since that will be the maximum amount of information a two byte short can represent. It might not be a problem, because that still allows some 64000 states. You could save 75% by storing states as chars, but then you effectively max out at 2^8 states, which is a small enough table that you wouldn't have to worry about memory anyway, so the benefits are eliminated. Reductions in Input Size: You can try to determine the "important characters" from your regular expression. You can save a separate array of 256 chars, indexed by character. Set the "important characters" indices to unique values [0..255], representing the indices of that input in the DFA table. Why? We can thus reduce the width of the DFA table to the size of the "important character" set plus one, representing any non-distinguished characters (or any character). e.g. Say we have 9 distinguished characters (our regular expression is "a.b.*c..def.hi.*(j)*.*", for instance). Our DFA table thus requires only n * 10 * sizeof(element) bytes as opposed to n * 256 * sizeof(element) bytes, because all other characters can be represented by the same transitions. How to determine or expand on the idea of "important characters": Characters which appear only parallel to each other in single character union expressions can be given the same index value, since they will always transition to the same state. In general, *any* two input symbols whose transitions match for all states can be merged and represented by the same index in the DFA table (they would be given the same value in the input index array that I mentioned above). e.g. in "a|b|c", a, b, and c can be represented by the same input index. Alternatively, we can save character sets *as* character sets rather than expanding them, and when an input doesn't match any distinguished character, you can check character sets to see if values are represented therein. This only makes sense if you merge equivalent character sets, and might be difficult to implement. A combination of these approaches may be the best solution, or you may find a construction that provides a better solution. SECTION 3: Other Useful Suggestions and Concepts: * Graph representations for FA's are directed graphs. An edge goes from one node to another, but does not imply any sort of edge going the other direction. Thus edge (q,t) is very different than edge (t,q). If you use some kind of edgelist in your construction, keep this in mind. * Once you've constructed a DFA, try to minimize it. This doesn't speed up the DFA, but it reduces the size, allowing you to store more DFA's in memory, if you decide multiple DFAs is important to your search strategy. * You might try to break up large expressions into smaller ones, and use a different search strategy for a combination of these machines. For instance, you could break r|s, where r and s represent regular expressions, into two expressions, convert both to DFAs, and minimize, and then run them in parallel over your input. This should result in a more efficient implementation than running an NFA on the larger expression, and should result in a less memory intensive implementation than the combined DFA, especially if you can then take advantage of a reduction in input symbols. * For very large NFAs or DFAs in which many states have transitions on relatively few symbols, but you can't reduce the input alphabet, you might try using a "sparse matrix". Memory savings could be quite substantial. There are many ways to implement these. You could implement it using a character set to state which symbols are defined for that state, and then storing in a much smaller array, indexed in some way based on the defined inputs for that state, the state transitions only for those symbols on which transitions are defined. * If, on the other hand, you have lots of transitions on lots of inputs for each state of your NFA, you might consider representing the transitions, the set of current states, and the set of final states by bitfields and using bitwise operators on them to determine set inclusion, etc. (this is also a fast way to do an NFA search). The number of bits required is one per state, so this saves memory only if your next state representation arrays are larger than these fields on average. If you have 1024 states, each bitfield would have 1024 bits, or 128 bytes. This is pretty big, considering that you'll have a table of 1024 * 256 * sizeof(next state structure) bytes. If you store next states as a pointer (4 bytes) to an array of shorts, the average number of next states necessary would be 6 states or more to see any beneficial memory savings (6 edges out of each state, on average). The search time should be better, however, and it's unlikely (though entirely possible) you'll have NFA's of 1024 states. (Note that the transition table for an NFA of this size would be about 32 MB.) * Be creative! There are 1000's of ways to make a really kick-ass DFA and/or NFA, and many more implementations that make it easier to do certain kinds of searches. Think about Part 2 and 3 of the project when designing your tables. Think speed. Think compactness. Think flexibility. Good luck! SECTION 4: How to Output Your NFA for the End of Part 1: Note that you may have to adjust this to fit your program structure. I am demonstrating by accessing values as if they were a part of some structure, though you may have to access them differently. printf("NFA\n"); printf("%s\n", regexpstring); printf("Start State = %d\n", nfa->start); printf("Final States = {"); for (i = 0; i < nfa->finalStateNum - 1; i++) { printf("%d, ", nfa->finalStates[i]); } printf("%d}\n", nfa->finalStates[i]); for (i = 0; i < nfa->numStates; i++) { printf("State q%d:\n", nfa->states[i].name); for (j = 0; j < 256; j++) { if (isprint(j)) printf("\tInput '%c': {", (char)j); else printf("\tInput #%d: {", j); for (k = 0; k < nfa->nextstatesize[i][j] - 1; k++) { printf("%d, "), nfa->nextstate[i][j][k]); printf("%d}\n", nfa->nextstate[i][j][k]); } } printf("End NFA\n"); SECTION 5: How to Output Your DFA for the End of Part 1: printf("DFA\n"); printf("%s\n", regexpstring); printf("Start State = %d\n", dfa->start); printf("Final States = {"); for (i = 0; i < dfa->finalStateNum - 1; i++) { printf("%d, ", dfa->finalStates[i]); } printf("%d}\n", dfa->finalStates[i]); for (i = 0; i < dfa->numStates; i++) { printf("State q%d:\n", dfa->states[i].name); for (j = 0; j < 256; j++) { if (isprint(j)) printf("\tInput '%c': ", (char)j); else printf("\tInput #%d: ", j); printf("%d}\n", dfa->nextstate[i][j]); } } printf("End DFA\n"); Note on SECTION 4 and SECTION 5: We would like to see this format output. If you choose to restrict your inputs, please do so here as well. If you choose to work with charsets, come up with some way of labelling them and outputting the charset itself (this function is provided in parser.h) after you've completed listing out the whold DFA or NFA output, paired with the label you gave it while printing out the machine. (and if you can, try to match the ordering as well). We strongly suggest that you do this, so that it makes it easier for us to grade and test your project. SECTION 6: Good luck! Good luck! ADDENDUM 1: Programming Hints A lot of people have been saying that they don't know how to get started, or what kinds of data structures to use. I'm going to provide a more concrete example of how you might actually build some of these things in C below. The following code is, again, an example, and moreover, it has never been tested for ease of use or compiled, even, so there may be syntax errors. I am providing it as sort of an outline of how *I* would go about implementing the methods that I outlined above. I recommend that you think about these problems *first*, before looking at these proposed programming approaches, because it's a good intellectual exercise and important for you to develop your own problem solving approaches. -Nate PART 1: Example Structures: Here are some example structures you might try, along the lines of the methods proposed above. typedef enum { SYM, ANY, E, CSET } EdgeLabelT; // SYM = symbol // ANY = any character // E = epsilon // CSET = character set struct node_t; struct edge_t; typedef struct node_t { int name; // unique identifier for each state // The following are dynamic arrays struct edge_t *edges_out; struct edge_t **edges_in; // saves you hassle in removing unreachable // states. Save as pointers, so you'll be able to identify the // edge structure itself. // You'll want to save the allocated size as well to allow dynamic //reallocation. int numout, numoutalloc; int numin, numinalloc; } NodeT; typedef struct edge_t { struct node_t *from; struct node_t *to; EdgeLabelT type; char value; // for when type = SYM; CharSetT *charset; // for when type = CSET; } EdgeT; NodeT *stategraph; int numstates; // You can come up with a way to not have to dynamically reallocate the // stategraph. What do we know about the number of states, given this // construction, based on the regular expression input string? // Table Constructions: // For both of these tables, I am assuming that your states are going to // be numbered 0 through n-1, where n is the number of states. You'll // want to find a way to rename your NFA states so this is true, or to // come up with an alternative way of indexing the stategraph. This // construction also assumes that you assume a set number of inputs (in // this case 256). // NFA // You'll need a dynamically-reallocatable array of next states: typedef struct { int *nextStates; int nextStateNum; int nextStateAlloc; } TransitionT; typedef struct { int start; // You'll need a dynamically-reallocatable array of final states: int *final; int numFinalStates; int numFinalStatesAlloc; TransitionT *(nextstates[256]); // coming in, you'll know the number of // states. int numstates; } NFATableT; // DFA: typedef struct { int start; int *final; int numFinalStates; int numFinalStatesAlloc; int *(nextStates[256]); int numstates; int numstatesalloc; // You need to dynamically reallocate. You won't //know how many DFA states you need. } DFATableT; PART 2: How to dynamically-reallocate an array: First, start by allocating some minimum number of expected states. If you make this number large enough, you may never need to dynamically reallocate anything: int arrNum = 0; int arrNumAlloc = 64; ArrayTypeT arr = (ArrayTypeT *) malloc (sizeof(ArrayTypeT) * arrNumAlloc); // in this case, we allocate 64 elements to begin with. if (!arr) Error("No memory!"); // remember that malloc doesn't check for you if you don't have enough // memory. You can handle this however you'd like. The best // response is probably to bail from the program and try and // figure out why this failed, and how you can predict failure and // prevent it in the future, for instance by capping the number of // states that your DFA or NFA can have (the SUN machines have a // lot of memory, but remember that these graphs can be *really* // big). As you add elements to the array, increment the value of numArr: arr[arrNum] = element; arrNum++; Whenever you increment arrNum, check to see if the new value of arrNum is equal to the number of allocated elements (it should never be greater than this number -- you should probably produce an error if it is, because then you know you've written your code incorrectly). If it is equal, you need to reallocate the array to a larger size. A safe bet is generally to double the size of the array. assert(arrNum <= arrNumAlloc); if (arrNum == arrNumAlloc) { arrNumAlloc *= 2; arr = (ArrayTypeT *)realloc(arr, sizeof(ArrayTypeT) * arrNumAlloc); if (!arr) Error("No Memory!"); } The function realloc will copy the array contents into a block large enough to hold it, and return to you a pointer to the new memory. You again have to check to see that arr is not equal to NULL, as this is what realloc, like malloc, returns when there is not enough memory to fulfill your request. You can free a block that you have realloc'ed in the same way that you free a block that you've malloc'ed, namely, with the free function. Note that when you are reducing the size of the array (say, removing epsilon transitions), it does not pay to realloc the array smaller, because usually realloc just ignores the request and returns the same pointer. It shouldn't matter, because the only thing that is important in the end is the memory you allocate to your NFA and DFA table. For the NFA table, you will know the number of states coming in (it will be the same as the number of states in your graph). For the DFA table, you should pick as the maximum number of states your DFA can have a value that is a power of 2 (if you approach reallocation in this way). Alternatively, when you know the true size of the array, you can save memory by copying it into an array of exactly the size that you needed instead of leaving it in an array that may be wasting up to 50% of its allocated memory. There may be a method in UNIX to find out exactly how much RAM is available for allocation (I'm not sure, and if there is, it would be platform dependent, I believe). I don't *think* there's an ANSI function to do this, but I could be wrong. If that's the case, the more ambitious of you can *dynamically* decide the maximum size of your DFA based on the amount of memory available! Good luck! ADDENDUM 2: Another Note on Memory: This came up in section tonight. On the Unix machines, you have a vast amount of memory at your disposal. The RAM is very large. You can represent very large DFAs in it. The problem is that if you have a very large DFA, it is likely that it will get paged out of main memory by the virtual memory system, and put onto the disk. This will probably start happening long before malloc complains that you don't have enough memory and returns NULL, which is why checking for malloc being NULL as a test to your DFA being too large is a bad test. Why is it bad if main memory pages out part of your DFA structure? Because DFAs can have very poor locality of reference. What does that mean? You're not guarenteed that when you read from one location (say the inupt on a certain state), that the next state you have to read is going to be anywhere near the original state in memory. This is especially true if you don't have your DFA in a contiguous block. So it's likely that when you're going from state to state, you'll move to a different page in memory that may have been paged out. This is very bad because it takes about 1000 times as long (that's a guess, but it seems to be the general number that they were throwing out in EE182), so it's going to reduce the benefits of having converted to a DFA in the first place (and after having gone through so much trouble, we wouldn't want that!). If anyone wants to try a solution that works around this, one possible approach would be to attempt to put your DFA in contiguous memory (once you've built the DFA, allocate a giant block of memory of the size of your DFA), and copy the rows over to that block, attempting to put rows that reference each other closer together, to increase locality of reference. This would result in much extra credit. Moreover, finding an optimal solution to locality of reference in a DFA state graph, I am willing to bet, is NP complete, so you'd want to find a heuristic for it. I'm not sure on that though. Extra credit also for anyone who can prove that this is or is not NP complete. :) If you try this approach, try the other approach as well, and compare the running times of your solution, if you can. Good luck!