[Let's Write an Interpreter] 2 词法分析

最新推荐文章于 2024-07-25 12:29:37 发布

有-点-甜

最新推荐文章于 2024-07-25 12:29:37 发布

阅读量842

点赞数

文章标签： c语言 literate programming 编程语言虚拟机解释器

本文链接：https://blog.csdn.net/you_dian_tian/article/details/27981455

版权

2.1 字符集

每一种编程语言都有自己的字符集，即规定了哪些字符是该语言可识别的。Fish的字符集是一个 ASCII 子集：

0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z + - * / % = > < & | ' " [ ] { } . ( ) \n \t 空格

\n表示回车符，\t表示制表符，空格就是空格。凡不属于上面的字符，在 Fish 中都是非法的。

2.2 词法

通常，编程语言的语法或词法是用 BNF 范式来描述的，对 BNF 的介绍可以参考《Compilers：Principles，Techniques & Tools》一书。下面，先给出标识符 (identifier)和整数 (integer) 的词法：

< digit >→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

< letter >→ a | b | . . . | z | A | B | . . . | Z

< identifier >→< letter > {< letter > | < digit >}

< integer >→< digit > {< digit >}

词法分析器的编写将严格按照上面的定义进行。

2.3 词法分析器

在进行词法分析器 (lexer) 的编写之前，需要弄清两个概念：字符流和符号流，即 character stream和token stream。前者是从源程序中读入的一个一个的字符组成的，后者是lexer 对character stream进行分析后的输出，可以把 lexer 看成是源程序和 parser 之间的过滤器 (filter)，lexer 对字符流进行了变换。对每一个lexer 输出的token，我们需要方法指出它是哪一种 token，比如下面的都是token：

标识符 即变量的名字,如 a、var、foo、bar、X、Y 等；
运算符 如 >、-、+ 等；
整数如 2014、4、23 等；
...

我们用 enum 类型来表示不同的 token：

⟨token type⟩≡
enum {
    ⟨type list⟩

    TEOF = -1            /* end of input */
};

⟨type list⟩≡
    TADD = '+',            /* + */
    TSUB = '-',            /* - */
    TMUL = '*',            /* * */
    TDIV = '/',            /* / */
    TMOD = '%',            /* % */
    TASG = '=',            /* = */
    TNOT = '!',            /* ! */
    TLB  = '[',            /* [ */
    TRB  = ']',            /* ] */
    TLP  = '(',            /* ( */
    TRP  = ')',            /* ) */
    TLBR = '{',            /* { */
    TRBR = '}',            /* } */
    TL   = '<',            /* < */
    TG   = '>',            /* > */
    TSQ  = '\'',           /* ' */
    TDQ  = '\"',           /* " */

    TINT = 257,            /* integer */
    TID  = 258,            /* identifier */

上面定义了不同的 token 对应的编码，大部分都是字符的值，因为这些字符本身就构成了一个 token。当然，你也可以不这样，比如为每一种 token 类型赋予唯一的整数，一般大于 256 为好。

怎么实现 lexer呢？首先要清楚 lexer 的任务及与 parser 的关系，它是由 parser 调用的，每次调用，lexer 都返回字符流中的下一个token，因此lexer 要负责对输入字符流进行解析，识别出不同类型的 token 返回给 parser。此外，如果lexer 本身需要一些内部初始化工作或结束时需要销毁某些资源，那么应该提供这样的接口给parser 调用，我们把所有这些函数接口放在头文件lexer.h 中：

⟨lexer.h⟩≡
#ifndef LEXER_INCLUDED
#define LEXER_INCLUDED

#include <stdio.h>
#define MAX_ID    64    /* 标识符最长为 MAX_ID */

⟨lexer typedefs⟩
⟨lexer functions⟩

#endif /* LEXER_INCLUDED */

上面这段代码说明 lexer.h头文件主要包含两个部分：⟨lexer typedefs⟩ 表示与 lexer 相关的类型定义，⟨lexer functions⟩ 表示 lexer 接口的函数声明。其中类型定义部分包含上面出现的 token 枚举类型：

⟨lexer typedefs⟩≡
⟨token type⟩

函数声明部分则说明了我们的 lexer 提供的外部接口：

⟨lexer functions⟩≡
extern void Lexer_init(FILE* in);
extern void Lexer_deinit(void);
extern int Lexer_token(void);

Lexer_token 返回当前输入流中的下一个 token。该函数实现很简单，尽管如此，为了清晰及后面扩展起见，我们还是单独建立一个 lexer.c文件，里面放置 lexer.h 的实现代码：

⟨lexer.c⟩≡
⟨lexer includes⟩
⟨lexer variables⟩

int Lexer_token(void)
{
       ⟨skip blanks⟩
       if (isalpha(c))
              ⟨read identifier and return TID⟩
       else if (isdigit(c))
              ⟨read integer and return TINT⟩
       else
              ⟨return operators⟩
}

变量c表示当前读取到的字符，Lexer_token 首先跳过空白字符，即空格、制表符、回车符等，这很简单(变量lexsrc表示文件输入流)：

⟨skip blanks⟩≡
   while (c == ' ' || c == '\t' || c == '\n')
          c = getc(lexsrc);

接下来 Lexer_token 对第一个非空白字符进行判断，如果是一个字母，那么它肯定是一个标识符的开始，因此我们保存这个标识符并返回 TID 这个 token类型:

⟨read identifier and return TID⟩≡
{
        char *p = idbuf;

        ⟨read identifier⟩

        return TID;
}

⟨read identifier⟩≡
do {
        *p++ = c;
        c = getc(lexsrc);
} while(isalnum(c));
*p = '\0';                 /* null terminated string */

上面的 idbuf用于存储标识符名字。若第一个非空白字符是一个数字，那么它肯定是一个整数的开始，因此我们保存这个整数然后返回 TINT 这种 token 类型：

⟨read integer and return TINT⟩≡
{
        number = 0;

        ⟨read integer⟩

        return TNUM;
}

⟨read integer⟩≡
do {
        number = number * 10 + c - '0';
        c = getc(lexsrc);
} while(isdigit(c));

同样的 number存储当前读取到的数字。 以后凡是这样第一次出现的变量，如果能从上下文推断出它的用途，我们都不再说明，后文会给出定义的。注意，上面的代码并没有检测整数溢出问题。接下来，如果第一个非空白字符既不是字母也不是数字，那么我们认为它是一个运算符：

⟨return operators⟩≡
{
        int rc = c;
        c = getc(lexsrc);

        return rc;
}

我们并没有直接返回 c，而是预取了下一个字符，因为当下一次 parser 调用Lexer_token 时，Lexer_token 假设当前的c 存的是下一个字符。最后，我们还需要正确的对lexer 进行初始化，这在 Lexer_init 中完成：

⟨lexer.c⟩+≡
void Lexer_init(FILE *in)
{
        lexsrc = in;
        c = getc(lexsrc); /* pre-read a character */
}
void Lexer_deinit(void)
{
        ⟨free resources⟩
}

目前 Lexer_deinit 不需要释放任何资源，故函数体为空：

⟨free resources⟩≡

上面的程序使用了一些未定义的变量如 c、idbuf、number 和 lexsrc 等，现在定义它们：

⟨lexer variables⟩≡
static int c;
static int number;
static char idbuf[MAX_ID+1];
static FILE *lexsrc = NULL;

别忘记了让 lexer.c 包含正确的头文件：

⟨lexer includes⟩≡
#include <stdio.h>
#include <ctype.h>
#include "lexer.h"

就这样，我们的 lexer 初步完成了，下面需要测试它是否正确，为此我们编写一个简单的测试驱动程序 driver1.c：

⟨driver1.c⟩≡
#include <stdio.h>
#include "lexer.h"

int main(int argc, char *argv[])
{
    int token;
    char *s;

    Lexer_init(stdin);
    while ((token = Lexer_token()) > 0) {
            switch(token) {
            case TADD: s = "+"; break;        /* + */
            case TSUB: s = "-"; break;        /* - */
            case TMUL: s = "*"; break;        /* * */
            case TDIV: s = "/"; break;        /* / */
            case TMOD: s = "%"; break;        /* % */
            case TASG: s = "="; break;        /* = */
            case TNOT: s = "!"; break;        /* ! */
            case TLB : s = "["; break;        /* [ */
            case TRB : s = "]"; break;        /* ] */
            case TLP : s = "("; break;        /* ( */
            case TRP : s = ")"; break;        /* ) */
            case TL  : s = "<"; break;        /* < */
            case TG  : s = ">"; break;        /* > */
            case TLBR: s = "{"; break;        /* { */
            case TRBR: s = "}"; break;        /* } */
            case TSQ : s = "\'"; break;       /* " */
            case TDQ : s = "\""; break;       /* " */
            case TINT: s = "number"; break;   /* number */
            case TID : s = "identifier"; break;   /* identifier */
            default:
                        s = "illegal character";
                        break;
            }
            puts(s);
    }

    Lexer_deinit();
    return 0;
}

自然的，有了测试程序，总得有被测试对象吧。测试数据的准备也是很重要的，它将伴随我们的整个开发过程，实际上应该在编码之前就把测试数据准备好，我们后面都会这样做，这样可以让自己清晰的知道在干什么，并且每更改一次程序，都应该把之前的测试数据拿来跑一遍，以确信自己没有触发其它错误。

目前只是一个很初级的 lexer，只需要测试它是否能正常工作，因此再准备一个测试文件test1.fish作为lexer 的输入：

⟨test1.fish⟩≡
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y
z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z + - * / % = >
< & | ' " [ ] { } . ( )
id id1 identifier1 xyz89xyz
12345 1 0 89 -1 -32768 65535 -2147483647 4294967295

我们约定凡是后缀名为.fish 的文件都表示用 Fish 语言编写的源程序。把test1.fish保存在 test 目录下,现在需要编译源程序，编译环境是：

CentOS 6.5 64 位，内核 2.6.32，GCC 4.8.2

使用的编译选项为：

coptions='-Wall -lm -O2'

本文所有代码遵循C89标准，应此无论是64位还是32位操作系统、Linux还是Windows平台，只要是遵循C89标准的编译器，应该都可以编译通过。现在屏住呼吸，编译执行：

$ cd src
$ gcc $coptions driver1.c lexer.c -o driver1
$ ./driver < ../test1.fish

回想一下，我们一直假设当前目录是fish。以后都不将显示'cd src'命令，假设你已经切换至 src 目录下。上面命令执行后，运行结果如下：

number
number
number
number
number
number
number
number
number
number
identifier
... 中间省略
identifier
+
-
*
/
%
=
>
<
illegal character
illegal character
'
"
[
]
{
}
illegal character
(
)
identifier
identifier
identifier
identifier
number
number
number
number
-
number
-
number
number
-
number
number

注意，上面出现了 3 次'illegal character'，都是正常的，因为目前我们并不认为单独的 &、| 和. 是合法的 token，故目前将它们视为非法字符，它们的用途到后面会明白。此外，末尾出现了三个'-'，这是因为我们的lexer 还不能处理带符号数。Great，have a rest and until next time... :)

有-点-甜

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Let's Write an Interpreter] 2 词法分析

2.1 字符集每一种编程语言都有自己的字符集，即规定了哪些字符是该语言可识别的。Fish的字符集是一个 ASCII 子集：0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z + - * / % =
复制链接

扫一扫