[词法分析]初始化DFA

NFA这种东西似乎并不是特别重要,它应该是数学家关心的东西——对于一个词法并不复杂到变态的语言,直接画出其DFA也不是什么困难的事情。而作为无视正则表达式的代价,DFA的初始化显得非常重要。

上一篇文章《再见,正则表达式》给出了一点点建立DFA状态关系网络的例子,这篇文章则会终结DFA的初始化。

首先我们得获取一些状态,其实并不需要知道具体状态数量有多少,甚至不需要把DFA完全画好,大致估计一下,然后写下这样一些语句

#define NR_STATES (64)
struct State* jerryStates;    
jerryStates = (struct State*)malloc(NR_STATES * sizeof(struct State));
memset(jerryStates, 0, NR_STATES * sizeof(struct State));

jerryStates指向一个数组。但是因为DFA没画,所以不太方便给每个状态分配一个编号(就算画了,这样做也很繁),所以这里采取另一种手段,设置一个序列号变量,每分配一个状态,这个序列号增加一。下面演示一小段,初始化识别整数和实数的状态:

    int stateNr = 0, s;

    integer = jerryStates + (++stateNr);
    realnum = jerryStates + (++stateNr);
    integer->type = INTEGER;
    realnum->type = REAL;
    for(s = '0'; s <= '9'; ++s) {
        jerryStates->nextState[s] = integer;
        integer->nextState[s] = integer;
        realnum->nextState[s] = realnum;
    }
    jerryStates->nextState['.'] = realnum;
    integer->nextState['.'] = realnum;

stateNr就是序列号,首先是分配状态,然后确定类型,接下来将建立关系(从循环一直到结尾都是)。这样DFA就可以识别常整数和常实数了。

 

标识符跟这差不多,但是关键字跟标识符其实也差不多,鉴于此,这里把所有的关键字都认为是标识符。然后,在以后的分析中会以某种方式从一个关键字表中查找这个标识符是否是关键字,并作出对应处理。

    identifier = jerryStates + (++stateNr);
    identifier->type = IDENT;
    for(s = 'a'; s <= 'z'; ++s) {
        jerryStates->nextState[s] = identifier;
        identifier->nextState[s] = identifier;
    }
    for(s = 'A'; s <= 'Z'; ++s) {
        jerryStates->nextState[s] = identifier;
        identifier->nextState[s] = identifier;
    }
    for(s = '0'; s <= '9'; ++s) {
        identifier->nextState[s] = identifier;
    }

    jerryStates->nextState['_'] = identifier->nextState['_'] = identifier;

至于注释,特别是多行注释,我一向觉得正规式写起来似乎比直接构造自动机还麻烦……

("/*") ( (~("*")) | (("*")+ ~("*")) )* ("*")+ ("/")

多行注释的正规式似乎是这样的,而自动机则有3个状态,具体如下:

初始状态(jerryStates[0]) 遇到"/*"转到多行注释头状态(commentMultiLineStart),多行注释头遇到星号转移到多行注释第二状态(commentMultiLine2,这名字似乎很怪),而遇到其他状态则转移到自身,多行注释第二状态与之相同,不过遇到斜杠则转移到注释接受状态(comment)。这个过程并不复杂:

    commentInLineStart = jerryStates + (++stateNr);
    commentMultiLineStart = jerryStates + (++stateNr);
    commentMultiLine2 = jerryStates + (++stateNr);
    comment = jerryStates + (++stateNr);
    commentInLineStart->type = commentMultiLineStart->type
                             = commentMultiLine2->type = DENY;
    comment->type = SKIP;

    jerryStates->nextState['/']->nextState['/'] = commentInLineStart;
    jerryStates->nextState['/']->nextState['*'] = commentMultiLineStart;

    for(s = 0; s < (1 << 8); ++s) {
        commentInLineStart->nextState[s] = commentInLineStart;
        commentMultiLineStart->nextState[s] = commentMultiLineStart;
        commentMultiLine2->nextState[s] = commentMultiLineStart;
    }
    commentInLineStart->nextState['\n'] = comment;
    commentMultiLineStart->nextState['*'] = commentMultiLine2;
    commentMultiLine2->nextState['*'] = commentMultiLine2;
    commentMultiLine2->nextState['/'] = comment;

 commentInLineStart表示的是单行注释头("//")。

 

最后是棘手的符号了,这些琐碎的小东西不仅多,而且长度不一。这样的实现

    struct State* plus = jerryStates + (++i);
    struct State* minus = jerryStates + (++i);
    struct State* multiply = jerryStates + (++i);
    struct State* divide = jerryStates + (++i);

    jerryStates[0].nextState['+'] = plus;
    jerryStates[0].nextState['-'] = minus;
    jerryStates[0].nextState['*'] = multiply;
    jerryStates[0].nextState['/'] = divide;

显然会很痛苦,所以考虑再三,决定使用比较自动化的方式来进行——将符号与对应的接受类型绑定在一个结构内,遍历一个这样结构的数组来初始化符号,对于长度为2的符号,也可以这样弄:

    struct {
        char* symbol;
        AcceptType type;
    } SYMS[] = {
        {"+", PLUS}, {"-", MINUS}, {"*", MULTIPLY}, {"/", DIVIDE},
        {"=", ASSIGN}, {"!", NOT}, {"<", LT}, {">", GT}, {";", EOS},
        {",", COMMA}, {"(", LPARENT},
        {")", RPARENT}, {"[", LBRACKET}, {"]", RBRACKET}, {"{", LBRACE},
        {"}", RBRACE}, {"&", DENY}, {"|", DENY},
        {"==", EQ}, {"<=", LE}, {">=", GE}, {"!=", NE},
        {"&&", AND}, {"||", OR},
        {NULL, DENY}
    };
    struct State* iter;

    for(; NULL != SYMS[s].symbol; ++s) {
        iter = jerryStates;
//        printf("--INFO-- %d\n", s);
        for(character = SYMS[s].symbol; *character; ++character) {
//            printf("---CHAR-- %c %d\n", *character, SYMS[s].type);
            if(NULL == iter->nextState[(int)*character]) {
                iter->nextState[(int)*character] = jerryStates + (++stateNr);
                iter->nextState[(int)*character]->type = SYMS[s].type;
//                printf("---APPEND-- %c %d %d\n", *character, stateNr, SYMS[s].type);
            }
            iter = iter->nextState[(int)*character];
        }
    }

外层循环是用来遍历结构数组的,而内层循环则让接受长度为2的符号的状态衔接在接受长度为一的符号的状态之后。这样符号识别的自动机就搞定了。

 

附:初始化函数及相关变量、宏

/* dfa.c */

#define NR_STATES (64)

static struct State* initStates(void)
{
    int stateNr = 0, s = 0;
    char* character;
    struct {
        char* symbol;
        AcceptType type;
    } SYMS[] = {
        {"+", PLUS}, {"-", MINUS}, {"*", MULTIPLY}, {"/", DIVIDE},
        {"=", ASSIGN}, {"!", NOT}, {"<", LT}, {">", GT}, {";", EOS},
        {",", COMMA}, {"(", LPARENT},
        {")", RPARENT}, {"[", LBRACKET}, {"]", RBRACKET}, {"{", LBRACE},
        {"}", RBRACE}, {"&", DENY}, {"|", DENY},
        {"==", EQ}, {"<=", LE}, {">=", GE}, {"!=", NE},
        {"&&", AND}, {"||", OR},
        {NULL, DENY}
    };
    struct State* iter;
    struct State* commentInLineStart,* commentMultiLineStart,
                * commentMultiLine2,* comment;
    struct State* space;
    struct State* integer,* realnum;
    struct State* identifier;

    jerryStates = (struct State*)malloc(NR_STATES * sizeof(struct State));
    memset(jerryStates, 0, NR_STATES * sizeof(struct State));

    jerryStates[0].type = DENY;
    for(; NULL != SYMS[s].symbol; ++s) {
        iter = jerryStates;
//        printf("--INFO-- %d\n", s);
        for(character = SYMS[s].symbol; *character; ++character) {
//            printf("---CHAR-- %c %d\n", *character, SYMS[s].type);
            if(NULL == iter->nextState[(int)*character]) {
                iter->nextState[(int)*character] = jerryStates + (++stateNr);
                iter->nextState[(int)*character]->type = SYMS[s].type;
//                printf("---APPEND-- %c %d %d\n", *character, stateNr, SYMS[s].type);
            }
            iter = iter->nextState[(int)*character];
        }
    }
    commentInLineStart = jerryStates + (++stateNr);
    commentMultiLineStart = jerryStates + (++stateNr);
    commentMultiLine2 = jerryStates + (++stateNr);
    comment = jerryStates + (++stateNr);
    commentInLineStart->type = commentMultiLineStart->type
                             = commentMultiLine2->type = DENY;
    comment->type = SKIP;

    jerryStates->nextState['/']->nextState['/'] = commentInLineStart;
    jerryStates->nextState['/']->nextState['*'] = commentMultiLineStart;

    for(s = 0; s < (1 << 8); ++s) {
        commentInLineStart->nextState[s] = commentInLineStart;
        commentMultiLineStart->nextState[s] = commentMultiLineStart;
        commentMultiLine2->nextState[s] = commentMultiLineStart;
    }
    commentInLineStart->nextState['\n'] = comment;
    commentMultiLineStart->nextState['*'] = commentMultiLine2;
    commentMultiLine2->nextState['*'] = commentMultiLine2;
    commentMultiLine2->nextState['/'] = comment;

    identifier = jerryStates + (++stateNr);
    identifier->type = IDENT;
    for(s = 'a'; s <= 'z'; ++s) {
        jerryStates->nextState[s] = identifier;
        identifier->nextState[s] = identifier;
    }
    for(s = 'A'; s <= 'Z'; ++s) {
        jerryStates->nextState[s] = identifier;
        identifier->nextState[s] = identifier;
    }
    jerryStates->nextState['_'] = identifier->nextState['_'] = identifier;

    integer = jerryStates + (++stateNr);
    realnum = jerryStates + (++stateNr);
    integer->type = INTEGER;
    realnum->type = REAL;
    for(s = '0'; s <= '9'; ++s) {
        jerryStates->nextState[s] = integer;
        integer->nextState[s] = integer;
        realnum->nextState[s] = realnum;
        identifier->nextState[s] = identifier;
    }
    jerryStates->nextState['.'] = realnum;
    integer->nextState['.'] = realnum;

    space = jerryStates + (++stateNr);
    space->type = SKIP;
    jerryStates->nextState[' '] = space;
    jerryStates->nextState['\t'] = space;
    jerryStates->nextState['\r'] = space;
    jerryStates->nextState['\n'] = space;
    space->nextState[' '] = space;
    space->nextState['\t'] = space;
    space->nextState['\r'] = space;
    space->nextState['\n'] = space;

//    printf("--INFO-- DFA built. %d\n", stateNr);
    return jerryStates;
}
#undef NR_STATES

注释掉的都是调试信息输出语句。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
词法分析是编译器的基础,它的主要任务是将输入的字符序列转化为符号序列,同时识别出其中的各种词法单元,如关键字、标识符、常数、运算符等。其中,DFA(确定有限自动机)是实现词法分析的一种有效方法。 具体来说,实现词法分析程序可以分为以下几个步骤: 1. 定义输入字符集和词法单元集合。 首先,需要确定词法单元集合,例如C语言中常见的有关键字、标识符、数字、运算符、分隔符等。同时,也需要定义输入字符集,例如ASCII码中的所有字符和部分特殊字符。 2. 构造DFA状态转移表。 在词法分析中,DFA用来识别和匹配字符序列。因此,需要根据输入字符集合与词法单元集合,构造对应的DFA状态转移表。这可以通过手动构造或使用工具如Lex/Yacc完成,其中包括每个状态的入口、出口和字符匹配等。 3. 编写DFA驱动程序。 编写DFA驱动程序,即读入输入的字符序列,根据DFA状态转移表进行状态跳转,最终输出识别到的词法单元及其相应属性。 4. 测试程序。 在完成代码编写后,需要进行详细的测试,比如输入一些边界情况的字符、特殊字符等,保证程序的正确性和鲁棒性。 在实现词法分析程序的过程中,需要掌握DFA的原理和构造方法,同时熟悉所使用的程序设计语言,如C语言。合理构造状态转移表和驱动程序可以有效地提升词法分析程序的性能和识别能力。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值