NFA这种东西似乎并不是特别重要,它应该是数学家关心的东西——对于一个词法并不复杂到变态的语言,直接画出其DFA也不是什么困难的事情。而作为无视正则表达式的代价,DFA的初始化显得非常重要。
上一篇文章《再见,正则表达式》给出了一点点建立DFA状态关系网络的例子,这篇文章则会终结DFA的初始化。
首先我们得获取一些状态,其实并不需要知道具体状态数量有多少,甚至不需要把DFA完全画好,大致估计一下,然后写下这样一些语句
#define NR_STATES (64)
struct State* jerryStates;
jerryStates = (struct State*)malloc(NR_STATES * sizeof(struct State)); memset(jerryStates, 0, NR_STATES * sizeof(struct State));
jerryStates指向一个数组。但是因为DFA没画,所以不太方便给每个状态分配一个编号(就算画了,这样做也很繁),所以这里采取另一种手段,设置一个序列号变量,每分配一个状态,这个序列号增加一。下面演示一小段,初始化识别整数和实数的状态:
int stateNr = 0, s; integer = jerryStates + (++stateNr); realnum = jerryStates + (++stateNr); integer->type = INTEGER; realnum->type = REAL; for(s = '0'; s <= '9'; ++s) { jerryStates->nextState[s] = integer; integer->nextState[s] = integer; realnum->nextState[s] = realnum; } jerryStates->nextState['.'] = realnum; integer->nextState['.'] = realnum;
stateNr就是序列号,首先是分配状态,然后确定类型,接下来将建立关系(从循环一直到结尾都是)。这样DFA就可以识别常整数和常实数了。
标识符跟这差不多,但是关键字跟标识符其实也差不多,鉴于此,这里把所有的关键字都认为是标识符。然后,在以后的分析中会以某种方式从一个关键字表中查找这个标识符是否是关键字,并作出对应处理。
identifier = jerryStates + (++stateNr); identifier->type = IDENT; for(s = 'a'; s <= 'z'; ++s) { jerryStates->nextState[s] = identifier; identifier->nextState[s] = identifier; } for(s = 'A'; s <= 'Z'; ++s) { jerryStates->nextState[s] = identifier; identifier->nextState[s] = identifier; } for(s = '0'; s <= '9'; ++s) { identifier->nextState[s] = identifier; } jerryStates->nextState['_'] = identifier->nextState['_'] = identifier;
至于注释,特别是多行注释,我一向觉得正规式写起来似乎比直接构造自动机还麻烦……
("/*") ( (~("*")) | (("*")+ ~("*")) )* ("*")+ ("/")
多行注释的正规式似乎是这样的,而自动机则有3个状态,具体如下:
初始状态(jerryStates[0]) 遇到"/*"转到多行注释头状态(commentMultiLineStart),多行注释头遇到星号转移到多行注释第二状态(commentMultiLine2,这名字似乎很怪),而遇到其他状态则转移到自身,多行注释第二状态与之相同,不过遇到斜杠则转移到注释接受状态(comment)。这个过程并不复杂:
commentInLineStart = jerryStates + (++stateNr); commentMultiLineStart = jerryStates + (++stateNr); commentMultiLine2 = jerryStates + (++stateNr); comment = jerryStates + (++stateNr); commentInLineStart->type = commentMultiLineStart->type = commentMultiLine2->type = DENY; comment->type = SKIP; jerryStates->nextState['/']->nextState['/'] = commentInLineStart; jerryStates->nextState['/']->nextState['*'] = commentMultiLineStart; for(s = 0; s < (1 << 8); ++s) { commentInLineStart->nextState[s] = commentInLineStart; commentMultiLineStart->nextState[s] = commentMultiLineStart; commentMultiLine2->nextState[s] = commentMultiLineStart; } commentInLineStart->nextState['\n'] = comment; commentMultiLineStart->nextState['*'] = commentMultiLine2; commentMultiLine2->nextState['*'] = commentMultiLine2; commentMultiLine2->nextState['/'] = comment;
commentInLineStart表示的是单行注释头("//")。
最后是棘手的符号了,这些琐碎的小东西不仅多,而且长度不一。这样的实现
struct State* plus = jerryStates + (++i); struct State* minus = jerryStates + (++i); struct State* multiply = jerryStates + (++i); struct State* divide = jerryStates + (++i); jerryStates[0].nextState['+'] = plus; jerryStates[0].nextState['-'] = minus; jerryStates[0].nextState['*'] = multiply; jerryStates[0].nextState['/'] = divide;
显然会很痛苦,所以考虑再三,决定使用比较自动化的方式来进行——将符号与对应的接受类型绑定在一个结构内,遍历一个这样结构的数组来初始化符号,对于长度为2的符号,也可以这样弄:
struct { char* symbol; AcceptType type; } SYMS[] = { {"+", PLUS}, {"-", MINUS}, {"*", MULTIPLY}, {"/", DIVIDE}, {"=", ASSIGN}, {"!", NOT}, {"<", LT}, {">", GT}, {";", EOS}, {",", COMMA}, {"(", LPARENT}, {")", RPARENT}, {"[", LBRACKET}, {"]", RBRACKET}, {"{", LBRACE}, {"}", RBRACE}, {"&", DENY}, {"|", DENY}, {"==", EQ}, {"<=", LE}, {">=", GE}, {"!=", NE}, {"&&", AND}, {"||", OR}, {NULL, DENY} }; struct State* iter; for(; NULL != SYMS[s].symbol; ++s) { iter = jerryStates; // printf("--INFO-- %d\n", s); for(character = SYMS[s].symbol; *character; ++character) { // printf("---CHAR-- %c %d\n", *character, SYMS[s].type); if(NULL == iter->nextState[(int)*character]) { iter->nextState[(int)*character] = jerryStates + (++stateNr); iter->nextState[(int)*character]->type = SYMS[s].type; // printf("---APPEND-- %c %d %d\n", *character, stateNr, SYMS[s].type); } iter = iter->nextState[(int)*character]; } }
外层循环是用来遍历结构数组的,而内层循环则让接受长度为2的符号的状态衔接在接受长度为一的符号的状态之后。这样符号识别的自动机就搞定了。
附:初始化函数及相关变量、宏
/* dfa.c */ #define NR_STATES (64) static struct State* initStates(void) { int stateNr = 0, s = 0; char* character; struct { char* symbol; AcceptType type; } SYMS[] = { {"+", PLUS}, {"-", MINUS}, {"*", MULTIPLY}, {"/", DIVIDE}, {"=", ASSIGN}, {"!", NOT}, {"<", LT}, {">", GT}, {";", EOS}, {",", COMMA}, {"(", LPARENT}, {")", RPARENT}, {"[", LBRACKET}, {"]", RBRACKET}, {"{", LBRACE}, {"}", RBRACE}, {"&", DENY}, {"|", DENY}, {"==", EQ}, {"<=", LE}, {">=", GE}, {"!=", NE}, {"&&", AND}, {"||", OR}, {NULL, DENY} }; struct State* iter; struct State* commentInLineStart,* commentMultiLineStart, * commentMultiLine2,* comment; struct State* space; struct State* integer,* realnum; struct State* identifier; jerryStates = (struct State*)malloc(NR_STATES * sizeof(struct State)); memset(jerryStates, 0, NR_STATES * sizeof(struct State)); jerryStates[0].type = DENY; for(; NULL != SYMS[s].symbol; ++s) { iter = jerryStates; // printf("--INFO-- %d\n", s); for(character = SYMS[s].symbol; *character; ++character) { // printf("---CHAR-- %c %d\n", *character, SYMS[s].type); if(NULL == iter->nextState[(int)*character]) { iter->nextState[(int)*character] = jerryStates + (++stateNr); iter->nextState[(int)*character]->type = SYMS[s].type; // printf("---APPEND-- %c %d %d\n", *character, stateNr, SYMS[s].type); } iter = iter->nextState[(int)*character]; } } commentInLineStart = jerryStates + (++stateNr); commentMultiLineStart = jerryStates + (++stateNr); commentMultiLine2 = jerryStates + (++stateNr); comment = jerryStates + (++stateNr); commentInLineStart->type = commentMultiLineStart->type = commentMultiLine2->type = DENY; comment->type = SKIP; jerryStates->nextState['/']->nextState['/'] = commentInLineStart; jerryStates->nextState['/']->nextState['*'] = commentMultiLineStart; for(s = 0; s < (1 << 8); ++s) { commentInLineStart->nextState[s] = commentInLineStart; commentMultiLineStart->nextState[s] = commentMultiLineStart; commentMultiLine2->nextState[s] = commentMultiLineStart; } commentInLineStart->nextState['\n'] = comment; commentMultiLineStart->nextState['*'] = commentMultiLine2; commentMultiLine2->nextState['*'] = commentMultiLine2; commentMultiLine2->nextState['/'] = comment; identifier = jerryStates + (++stateNr); identifier->type = IDENT; for(s = 'a'; s <= 'z'; ++s) { jerryStates->nextState[s] = identifier; identifier->nextState[s] = identifier; } for(s = 'A'; s <= 'Z'; ++s) { jerryStates->nextState[s] = identifier; identifier->nextState[s] = identifier; } jerryStates->nextState['_'] = identifier->nextState['_'] = identifier; integer = jerryStates + (++stateNr); realnum = jerryStates + (++stateNr); integer->type = INTEGER; realnum->type = REAL; for(s = '0'; s <= '9'; ++s) { jerryStates->nextState[s] = integer; integer->nextState[s] = integer; realnum->nextState[s] = realnum; identifier->nextState[s] = identifier; } jerryStates->nextState['.'] = realnum; integer->nextState['.'] = realnum; space = jerryStates + (++stateNr); space->type = SKIP; jerryStates->nextState[' '] = space; jerryStates->nextState['\t'] = space; jerryStates->nextState['\r'] = space; jerryStates->nextState['\n'] = space; space->nextState[' '] = space; space->nextState['\t'] = space; space->nextState['\r'] = space; space->nextState['\n'] = space; // printf("--INFO-- DFA built. %d\n", stateNr); return jerryStates; } #undef NR_STATES
注释掉的都是调试信息输出语句。