【Interpreter】构建简单的解释器（第2部分）

最新推荐文章于 2020-07-20 17:54:09 发布

唐茂

最新推荐文章于 2020-07-20 17:54:09 发布

阅读量3.6k

点赞数

分类专栏：构建简单的解释器(译) 文章标签：解释器构建简单的解释器第二部分 interpreter tokens lexer

● 【计算机理论和基础】同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

构建简单的解释器(译)

12 篇文章 0 订阅

订阅专栏

文章目录

【Interpreter】构建简单的解释器（第2部分）

【Interpreter】构建简单的解释器（第2部分）

简单翻译了下，方便查看，水平有限，喜欢的朋友去看原文！

在《有效思考的五要素》中，作者 Burger 和 Starbird 分享了一个关于他们观看国际知名小号演奏家 Tony Plog 为初露锋芒的小号演奏者举办大师课的故事。学生们首先演奏了复杂的乐章，他们演奏得非常好。但随后他们被要求表演非常基础的简单的音符。当他们演奏那些音符时，听起来比先前演奏的复杂片段要稚嫩些。他们完成演奏之后，大师也演奏了相同的音符，但大师演奏这些音符时听起来并不稚嫩。差异令人惊叹。 Tony 解释道：掌握简单笔记的表现可以让人们更好地控制复杂的作品。很明显 — 要想掌握真正的精湛技艺必须重点掌握简单基本的理论思想。

故事中的经验显然不仅适用于音乐，同样也适用于软件开发。这个故事告诫我们所有人不要忽视深入挖掘简单基本的理念对复杂工作的重要性，即使有时感觉像是倒退。虽然精通一个工具或框架很重要，但了解其背后的原理也非常重要。正如 Ralph Waldo Emerson 所说：

“If you learn only methods, you’ll be tied to your methods. But if you learn principles, you can devise your own methods.”

关于这点，让我们再次深入到解释器和编译器中。

今天我将展示第1部分中提及计算器的新版本，包含以下功能：

处理输入字符串中任何位置的空白字符
处理输入中的多位整数
两个整数的减法运算（目前它只支持加法运算）

下面是新版计算器的源代码，可以执行以上所有操作：

# Token types
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, MINUS, EOF = 'INTEGER', 'PLUS', 'MINUS', 'EOF'


class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, PLUS, MINUS, or EOF
        self.type = type
        # token value: non-negative integer value, '+', '-', or None
        self.value = value

    def __str__(self):
        """String representation of the class instance.

        Examples:
            Token(INTEGER, 3)
            Token(PLUS '+')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()


class Interpreter(object):
    def __init__(self, text):
        # client string input, e.g. "3 + 5", "12 - 5", etc
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        # current token instance
        self.current_token = None
        self.current_char = self.text[self.pos]

    def error(self):
        raise Exception('Error parsing input')

    def advance(self):
        """Advance the 'pos' pointer and set the 'current_char' variable."""
        self.pos += 1
        if self.pos > len(self.text) - 1:
            self.current_char = None  # Indicates end of input
        else:
            self.current_char = self.text[self.pos]

    def skip_whitespace(self):
        while self.current_char is not None and self.current_char.isspace():
            self.advance()

    def integer(self):
        """Return a (multidigit) integer consumed from the input."""
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)

    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)

        This method is responsible for breaking a sentence
        apart into tokens.
        """
        while self.current_char is not None:

            if self.current_char.isspace():
                self.skip_whitespace()
                continue

            if self.current_char.isdigit():
                return Token(INTEGER, self.integer())

            if self.current_char == '+':
                self.advance()
                return Token(PLUS, '+')

            if self.current_char == '-':
                self.advance()
                return Token(MINUS, '-')

            self.error()

        return Token(EOF, None)

    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.get_next_token()
        else:
            self.error()

    def expr(self):
        """Parser / Interpreter

        expr -> INTEGER PLUS INTEGER
        expr -> INTEGER MINUS INTEGER
        """
        # set current token to the first token taken from the input
        self.current_token = self.get_next_token()

        # we expect the current token to be an integer
        left = self.current_token
        self.eat(INTEGER)

        # we expect the current token to be either a '+' or '-'
        op = self.current_token
        if op.type == PLUS:
            self.eat(PLUS)
        else:
            self.eat(MINUS)

        # we expect the current token to be an integer
        right = self.current_token
        self.eat(INTEGER)
        # after the above call the self.current_token is set to
        # EOF token

        # at this point either the INTEGER PLUS INTEGER or
        # the INTEGER MINUS INTEGER sequence of tokens
        # has been successfully found and the method can just
        # return the result of adding or subtracting two integers,
        # thus effectively interpreting client input
        if op.type == PLUS:
            result = left.value + right.value
        else:
            result = left.value - right.value
        return result


def main():
    while True:
        try:
            # To run under Python3 replace 'raw_input' call
            # with 'input'
            text = raw_input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        interpreter = Interpreter(text)
        result = interpreter.expr()
        print(result)


if __name__ == '__main__':
    main()

将以上代码保存到 calc2.py 文件中，或直接从 GitHub 下载。试运行一下，看它是否满足预期工作：可以在输入的任何地方处理空白字符; 可以接受多位整数，可以支持整数减法运算，也可以支持整数加法操作。

这是我在笔记本电脑上运行的示例：

$ python calc2.py
calc> 27 + 3
30
calc> 27 - 7
20
calc>

和第1部分中的版本相比，主要代码改动为：

稍微重构了一下 get_next_token 方法。将递增 pos 指针的逻辑放到单独方法 advance 中。
添加了两个新方法：skip_whitespace 用于忽略空格字符，integer 用于处理输入中的多位整数。
修改 expr 方法，除了识别 INTEGER - > PLUS - > INTEGER 短语之外，还可以识别 INTEGER - > MINUS - > INTEGER 短语。该方法在成功识别对应语句后也能解释执行对应的加法操作和减法操作。

在第1部分中学习了两个重要的概念，即 标记符（token） 和 词法分析器（lexical analyzer）。今天我想稍微谈谈词素（lexemes）、解析（parsing） 和 解析器（parsers）。

你已经知道了 tokens。但是为了完成对 tokens 的讨论，需要提一下 lexemes。什么是 lexeme？ lexeme 是形成 token 的一系列字符。在下图中，您可以看到 tokens 和 lexemes 的一些示例，希望这样可以使它们之间的关系变清晰：

lsbasi_part2_lexemes
还记得 expr 方法吗？我之前说过，这是实际解释算术表达式的地方。但是在解释某个表达式之前，你首先需要识别表达式的类型，例如表达式是加法还是减法。expr 方法本质上做了这些事情：从 get_next_token 方法输出的标记流中找到对应的语句结构，然后解释识别出的语句，生成算数表达式的结果。

从 token 流中查找对应结构的过程，或者说从 token 流中识别语句的过程，称为解析（parsing）。解释器或者编译器中承担该任务的部分称为解析器（parser）。

所以你现在知道 expr 方法是解释器中 解析（parsing） 和 解释（interpreting） 执行的地方 —— expr 方法首先尝试从标记流中识别（解析）INTEGER -> PLUS -> INTEGER 语句或者 INTEGER -> MINUS -> INTEGER 语句，然后在成功地识别（解析）其中某个语句之后，该方法解释这个语句，返回加法操作或者减法操作的结果给调用者。

现在是再次练习的时候了。

lsbasi_part2_exercises

扩展计算器以处理两个整数的乘法
扩展计算器以处理两个整数的除法
修改代码以解释包含任意数量的加法和减法的表达式，例如“9 - 5 + 3 + 11”

学习理解测试：

什么是 lexeme ？
在 token 流中识别构造器的方法叫什么？换句话说，识别具体短语中的 token 流的进程名称是什么?
解析器（编译器）执行部分的名称是什么？

我希望你会喜欢今天的文章。在本系列的下一遍文章中，您将扩展计算器以处理更复杂的算术表达式。敬请关注。

以下是我推荐的书籍清单，可以帮助您学习解释器和编译器：

Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages (Pragmatic Programmers)
Writing Compilers and Interpreters: A Software Engineering Approach
Modern Compiler Implementation in Java
Modern Compiler Design
Compilers: Principles, Techniques, and Tools (2nd Edition)

原文链接：Let’s Build A Simple Interpreter. Part 2.

作者博客：Ruslan’s Blog

——2019-01-03——