Let’s Build A Simple Interpreter 1

最新推荐文章于 2021-08-02 11:55:36 发布

jinchengwu3344

最新推荐文章于 2021-08-02 11:55:36 发布

阅读量219

点赞数

本文链接：https://blog.csdn.net/longlongqin/article/details/105111564

版权

为什么要你学解释器和编译器？这里你三条理由。

要写一个解释器或编译器，你必须同时用到很多技术。编写一个解释器或编译器会帮助你提高这些技能并且成为一个更好的软件开发者。而且，你将学到的这些技能在开发任何软件时都有可能用到，而不仅仅是解释器或编译器。
你确实想要知道计算机如何工作。一般解释器和编译器看上去都像魔法一样。但你不应该对这些魔法感到舒服。你想要揭开解释器和编译器的神秘面纱，理解它们如何工作并控制所有一切。
你想要创造自己的编程语言或者领域特定语言。如果是这样，你就需要为这个语言创建一个解释器或编译器。最近，创建新语言再度兴起。你几乎每天都可以看到一门新语言的诞生：Elixir, Go, Rust 等。

原文链接：https://ruslanspivak.com/lsbasi-part1/

好了，但什么是解释器和编译器呢？

解释器与编译器

解释器与编译器都是“高级语言与机器之间的翻译官”。都是将代码翻译成机器可以执行的二进制机器码，只不过在运行原理和翻译过程不同。

那它们的区别在于：

编译器：先整体编译完，然后一次性执行。比如：C语言代码被编译成二进制代码（exe程序），在windows平台上执行。
解释器：解释一句后就提交计算机执行一句，即便捷式边执行。比如php，postscritp，javascript就是典型的解释性语言。

用一个通俗的例子来讲：我们去饭馆吃饭，点了八菜一汤。编译器的方式就是厨师把所有的菜给你全做好了，一起给你端上来，至于你在哪吃，怎么吃，随便。解释器的方式就是厨师做好一个菜给你上一个菜，你就吃这个菜，而且必须在饭店里吃。

编译器与解释器的工作流程的差别：

编译器与解释器的各自的特点：

构造解释器V1.0

该系列文章的作者使用 Python 编写Pascal语言的解释器。

第一版V1.0，构造的计算器有诸多限制。如：

只输入一位的数字
现阶段仅支持加法操作
输入中不允许有空白符

这些约束使得构建一个计算器很简单，代码如下：

# Token types：
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, EOF = 'INTEGER', 'PLUS', 'EOF'


class Token(object):
    def __init__(self, type, value):
	# token type: INTEGER, PLUS, or EOF
	self.type  = type
	# token value: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, '+', or None
	self.value = value

    def __str__(self):
	"""String representation of the class instance.

	Examples:
	    Token(INTEGER, 3)
	    Token(PLUS, '+')
	"""
	return 'Token({type}, {value})'.format(
	    type=self.type,
	    value=repr(self.value)
	)

    def __repr__(self):
	return self.__str__()

class Interpreter(object):
    def __init__(self, text):
	# client string input, e.g. "3+5"
	self.text = text
	# self.pos is an index into self.text
	self.pos  = 0
	# current token instance
	self.current_token = None

    def error(self):
	raise Exception('Error parsing input')

    def get_next_token(self):
	"""Lexical analyzer (also known as scanner or tokenizer)

	This method is responsible for breaking a sentence
	apart into tokens. One token at a time.
	"""
	text = self.text

	# is self.pos index past the end of the self.text ?
	# if so, then return EOF token because there no more
	# input left to convert into tokens
	if self.pos > len(text) - 1:
	    return Token(EOF, None)

	# get a character at the position self.pos and decide
	# what token to create based on the single character
	current_char = text[self.pos]

	# if the character is a digit then convert it to
	# integer, create an INTEGER token, increment self.pos
	# index to point to the next character after the digit,
	# and return the INTEGER token
	if current_char.isdigit():
	    token     = Token(INTEGER, int(current_char))
	    self.pos += 1
	    return token

	if current_char == '+':
	    token     = Token(PLUS, current_char)
	    self.pos += 1
	    return token

	self.error()

    def eat(self, token_type):
	# compare the current token type with the passed token
	# type and if they match then "eat" the current token
	# and assign the next token to the self.current_token,
	# otherwise raise an exception.
	if self.current_token.type == token_type:
	    self.current_token = self.get_next_token()
	else:
	    self.error()

    def expr(self):
	"""expr -> INTEGER PLUS INTEGER"""
	# set current token to the first token taken from the input
	self.current_token = self.get_next_token()

	# we expect the current token to be a single-digit integer
	left = self.current_token
	self.eat(INTEGER)

	# we expect the current token to be a '+' token
	op = self.current_token
	self.eat(PLUS)

	# we expect the current token to be a single-digit integer
	right = self.current_token
	self.eat(INTEGER)
	# after the above call the self.current_token is set to
	# EOF token

	# at this point INTEGER PLUS INTEGER sequence of tokens
	# has been successfully found and the method can just
	# return the result of adding two integers, thus
	# effectively interpreting client input
	result = left.value + right.value
	return result

def main():
    while True:
	try:
	    # To run under Python3 replace 'raw_input' call with 'input'
	    text = input('calc> ')
	except EOFError:
	    break
	if not text:
	    continue
	interpreter = Interpreter(text)
	result = interpreter.expr()
	print(result)

if __name__ == '__main__':
    main()

把以上代码保存到名为 calc1.py 中，或者直接从 GitHub 上下载。在你开始仔细研究代码之前，在命令行上运行这个计算器并看它实现运行。把玩一下！下面是在我笔记本上的一次尝试（如果你想在 Python3 下运行，就需要把 raw_input 替换为 input）：

$ python calc1.py
calc> 3+4
7
calc> 3+5
8
calc> 3+9
12
calc>

代码分析

假设我们在命令行输入一个表达式“3+5”。你的解释器得到一个字符串 “3+5”。为了使解释器真正理解如何处理这个字符串，需要先把输入的 “3+5” 拆分成被叫做 token 的部件。

词法分析：（lexical analysis，简称lexer，亦称scanner 或 tokenizer）

词法分析也称为分词，此阶段编译器从左向右扫描源文件，将其字符流分割成一个个的词（ token 、记号，后文中将称为 token ）。

Token：

所谓 token ，就是源文件中不可再进一步分割的一串字符，类似于英语中单词，或汉语中的词。

这里的 token 就是一个有类型的值的对象（即，token还存着值的类型）。例如对于字符串“3”来说，token 类型为 INTEGER ，相应的值是整数 3 。

解释器Interpreter要做的第一步就是读取输入的字符串并把他转化成 token 流。解释器中做这个工作的部分被称为 词法分析器(lexical analyzer)，简称 lexer 。也可以称它为： scanner 或 tokenizer 。他们的含义是一样的：表示解释器或编译器中将输入的字符串转化为 token 流的部分。

那是如何转化为token流呢？

解释器 Interpreter中的 get_next_token 方法就是你的词法分析器。你每次调用它，就会从输入到解释器的字符流中得到下一个 token。让我们仔细看一下这个方法，看看它是怎么把字符转化为 token 的。输入被存放在变量 text 中，它保存了输入的字符串， pos 是指向该字符串的一个索引（把字符串看作是一个字符数组）。 pos 的初值被设为 0, 指向字符‘3’。该方法首先检查该字符是不是数字，若是数字，就递增 pos 并返回一个类型为 INTEGER 值为整数 3 的 token：

现在 pos 指向了 text 中的字符‘+’，下次你调用这个方法时，它会先测试 pos 位置的字符是否是数字，然后再测试它是否是加号，此时它是加号。这样该方法就递增 pos 并返回一个类型为 PLUS 值为‘+’的 token：

现在 pos 指向了字符‘5’。当你再次调用 get_next_token 时，它会检查 pos 位置是否是一个数字，此时是的，因此它递增 pos 并返回一个类型为 INTEGER 值为‘5’的 token：

现在索引 pos 越过了字符串“3+5”的末尾，接下来每次调用 get_next_token 方法都会返回 EOF token：

自己动手试试看看你的计算器的 lexer 组件怎么工作的：

>>> from calc1 import Interpreter
>>>
>>> interpreter = Interpreter('3+5')
>>> interpreter.get_next_token()
Token(INTEGER, 3)
>>>
>>> interpreter.get_next_token()
Token(PLUS, '+')
>>>
>>> interpreter.get_next_token()
Token(INTEGER, 5)
>>>
>>> interpreter.get_next_token()
Token(EOF, None)
>>>

此时你的解释器已经可以从输入的字符流中获得 token 流了，解释器需要对它做点什么：它需要从使用 lexer get_next_token 得到的字符流中找到结构。你的解释器期望从流中找到如下的结构： INTEGER -> PLUS -> INTEGER. 即，它试着找到这样一个 token 序列：整数后跟一个加号再跟一个整数。

负责查找和解释这个结构的方法是 expr. 这个方法验证一个 token 序列是否遵从期望的 token 序列，即 INTEGER -> PLUS -> INTEGER. 当确定遵从这个结构后，它就把 PLUS 左边和右边 token 的值相加来生成结果，从而成功地解释了你传给解释器的算术表达式。

expr 方法使用了辅助方法 eat 来验证传给 eat 的 token 类型与当前的 token 类型相匹配。在匹配到传入的 token 类型后， eat 方法会取得下一个 token 并把它赋值给变量 current_token, 这样实际上是“吃掉”了当前匹配的 token 并把想象中的 token 流中的指针向前移动了。如果 token 流中的结构不遵从期望的 INTEGER PLUS INTEGER 序列， eat 方法就会抛出一个异常。