如何写一个简单的解释器（Interpreter）-1

原创已于 2024-10-31 17:13:08 修改 · 1.2k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#jvm

于 2019-01-03 23:23:20 首次发布

interpreter 专栏收录该内容

21 篇文章

订阅专栏

本文将引导你逐步学习如何编写一个简单的解释器，从实现一个处理加法的Pascal语言解释器开始。通过这个过程，你将了解解释器的工作原理，学习词法分析和如何将输入转换为可执行的指令。文章涵盖了如何处理输入字符串，将其分解为tokens，并执行加法操作。

原创声明：这一个系列是翻译自https://ruslanspivak.com。插图是原文中自带的，同时我删除了一些没必要的解释。

“如果你不知道compiler是怎么工作的，那么你就不会清楚计算机是怎么工作的。如果你不是100%确定地知道‘你知道compiler是怎么工作的’，那么你肯定不知道compiler是怎么工作的。” — Steve Yegge

别着急，试着跟我读完整个教程，最终你会学会怎么写interpreter和compiler的。并且你一定会变得自信，至少我希望如此。为什么要学习这些呢？我给三个理由你听听。

为了写成一个interpreter或者compiler，你必须学习相当多的技术和技巧，而且会综合运用。这些工作会让你更善于使用这些技术，变成一个更好的码农。同时，你得到的是如何开发好一个软件，而不仅仅是interpreter和compiler。
你会真切地想知道计算机内部是怎么工作的。通常人们都认为interpreter和compiler像是魔法，并且你一定不习惯那个魔法。你会想解开魔法的帘子，去弄明白帘子里面是什么，它们是怎么控制这一切的。
或者你希望创造一个你自己的编程语言或者领域语言。如果你想这么做，那你就要为这个语言创建一个interpreter或者compiler。干这个事儿最近很流行，很多新语言层出不穷，比如Elixir、Go、Rust，还有好多。

OK，那什么是interpreter和compiler呢？

Interpreter 或者 compiler 的目标是吧源程序翻译成某些高层语言的形式。听起来好像没有说，是吗？相信我，看完这些文章你会弄明白源程序到底被转化成了什么东西。

看到这里，你可能心生疑问：interpreter和compiler区别在哪儿？看下图吧。compiler是源程序转换成了机器语言，而interpreter不转换。

来，我们动手，写一个Pascal语言的interpreter。为了简单，我们选择用python来写。

下面是一个经典的阶乘运算的Pascal程序。

program factorial;

function factorial(n: integer): longint;
begin
    if n = 0 then
        factorial := 1
    else
        factorial := n * factorial(n - 1);
end;

var
    n: integer;

begin
    for n := 0 to 16 do
        writeln(n, '! = ', factorial(n));
end.

直接写好一个interpreter不太现实。下面我们用python先写一个开头，加法计算器。够简单吧？

# Token types
#
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, EOF = 'INTEGER', 'PLUS', 'EOF'

class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, PLUS, or EOF
        self.type = type
        # token value: 0, 1, 2. 3, 4, 5, 6, 7, 8, 9, '+', or None
        self.value = value
    def __str__(self):
        """String representation of the class instance.
        Examples:
            Token(INTEGER, 3)
            Token(PLUS '+')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )
    def __repr__(self):
        return self.__str__()
class Interpreter(object):
    def __init__(self, text):
        # client string input, e.g. "3+5"
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        # current token instance
        self.current_token = None
    def error(self):
        raise Exception('Error parsing input')
    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)
        This method is responsible for breaking a sentence
        apart into tokens. One token at a time.
        """
        text = self.text
        # is self.pos index past the end of the self.text ?
        # if so, then return EOF token because there is no more
        # input left to convert into tokens
        if self.pos > len(text) - 1:
            return Token(EOF, None)
        # get a character at the position self.pos and decide
        # what token to create based on the single character
        current_char = text[self.pos]
        # if the character is a digit then convert it to
        # integer, create an INTEGER token, increment self.pos
        # index to point to the next character after the digit,
        # and return the INTEGER token
        if current_char.isdigit():
            token = Token(INTEGER, int(current_char))
            self.pos += 1
            return token
        if current_char == '+':
            token = Token(PLUS, current_char)
            self.pos += 1
            return token
        self.error()
    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.get_next_token()
        else:
            self.error()
    def expr(self):
        """expr -> INTEGER PLUS INTEGER"""
        # set current token to the first token taken from the input
        self.current_token = self.get_next_token()
        # we expect the current token to be a single-digit integer
        left = self.current_token
        self.eat(INTEGER)
        # we expect the current token to be a '+' token
        op = self.current_token
        self.eat(PLUS)
        # we expect the current token to be a single-digit integer
        right = self.current_token
        self.eat(INTEGER)
        # after the above call the self.current_token is set to
        # EOF token
        # at this point INTEGER PLUS INTEGER sequence of tokens
        # has been successfully found and the method can just
        # return the result of adding two integers, thus
        # effectively interpreting client input
        result = left.value + right.value
        return result
def main():
    while True:
        try:
            # To run under Python3 replace 'raw_input' call
            # with 'input'
            text = raw_input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        interpreter = Interpreter(text)
        result = interpreter.expr()
        print(result)
if __name__ == '__main__':
    main()

运行一下：

$ python calc1.py
calc> 3+4
7
calc> 3+5
8
calc> 3+9
12
calc>

为了让这个简单到不能再简单的interpreter正常工作，不出现异常。你必须保证：

只输入简单的整数数字
只使用加法操作
没有空白字符

当你输入3+5的时候，你的interpreter得到了一个字符串“3+5”。为了让interpreter真真的理解要干什么，它首先就要分解这个“3+5”串，编程一系列的tokens。Token是一个物体，它包含类型和数值。比如，“3”这个字符串，它对应的token的类型就是INTEGER，数值是整数3。

把串拆分为tokens的过程叫词法分析，完成这个过程的工具叫词法分析器，或者扫描器。它把你的一串输入数字转换成了一个长长的token串。

get_next_token 是 Interpreter python类的一个方法，他就是词法分析器。每次调用它的时候，你就能得到输入串的下一个token。仔细看看，输入的字符串是放在变量text 中的，text中除了有输入串，还有一个 pos代表输入串的索引。pos 初始值为0，指向“3”。这个方法首先检查第一个字符是不是数字。如果是数字，方法就增加pos并返回一个整形3的token。

现在，pos指向了text中的 ‘+’ 这个字符。下次再调用这个方法，它会检测到当前的符号不是数字，而是加号，于是返回一个PLUS+的token。

类似3的处理，5也会处理好。方法返回整形5这个token。

pos现在来到了EOF，代表输入串处理完毕，于是方法退出。

你可以用下面的交互式python执行方式，来验证一下整个过程：

>>>
>>> interpreter = Interpreter('3+5')
>>> interpreter.get_next_token()
Token(INTEGER, 3)
>>>
>>> interpreter.get_next_token()
Token(PLUS, '+')
>>>
>>> interpreter.get_next_token()
Token(INTEGER, 5)
>>>
>>> interpreter.get_next_token()
Token(EOF, None)
>>>

结果是：token的序列为 INTEGER -> PLUS -> INTEGER。interpreter会知道，它要找的序列是：一个整数加上另一个整数。负责找这个序列的方法是 expr。它会验证一下序列是否合法，该找的数字有没有错位。一切正常的情况下，它会成功执行出结果。

expr 方法使用 helper 方法来 eat token，从而来来验证一个个的current_token指向的字符。出现异常的话就抛出来。

祝贺你，完成了你的第一个简单的interpreter！