Python 中的抽象语法树

Abstract Syntax Trees in Python

注:机翻,未校对。

Requirement: All examples are compatible with at least Python v3.6, except for using ast.dump() with the attribute indent= which has been added in Python v3.9.
要求:所有示例至少与 Python v3.6 兼容,除了使用 ast.dump() 和 Python v3.9 中添加的属性 indent=

What is an Abstract Syntax Tree (AST)? 什么是抽象语法树 (AST)?

An Abstract Syntax Tree (AST) is a data structure used to reason about the grammar of a programming language in the context of the instructions provided into source code.
抽象语法树 (AST) 是一种数据结构,用于在源代码中提供的指令上下文中推理编程语言的语法。

From source code to binary

For instance, compilers use ASTs when transforming source code into binary code:
例如,编译器在将源代码转换为二进制代码时使用 AST:

  1. Given some text source code, the compiler first tokenizes the text to identify programming language keywords, variables, literals, etc. Each token represents an “atom” of an instruction.
    给定一些文本源代码,编译器首先对文本进行标记化,以识别编程语言关键字、变量、文字等。每个令牌代表指令的一个“原子”。
  2. Tokens are then rearranged into an AST, a tree where nodes are the “atoms” of the instructions, and edges the relationships between the atoms based on the programming language grammar. For instance, the AST make explicit the presence of a function call, the related input arguments, the instructions composing the function, etc.
    然后,令牌被重新排列成一个 AST,这是一个树,其中节点是指令的“原子”,并根据编程语言语法对原子之间的关系进行边缘处理。例如,AST 明确了函数调用的存在、相关的输入参数、组成函数的指令等。
  3. The compiler then can apply multiple optimizations to the AST, and ultimately converts it into binary code.
    然后,编译器可以对 AST 应用多个优化,并最终将其转换为二进制代码。

Despite their role for compilers, ASTs are useful for a broader set of use-cases. Let’s discuss this more in details.
尽管 AST 在编译器中扮演着重要角色,但它们对于更广泛的用例集很有用。让我们更详细地讨论这个问题。

The ast Python module and its use ast Python 模块及其用法

The ast module in the Python standard library can be used to create, visit, and modify AST related to Python source code. It has been introduced in Python 2.6, and since then it evolved alongside the Python grammar.
Python 标准库中的 ast 模块可用于创建、访问和修改与 Python 源代码相关的 AST。它已在 Python 2.6 中引入,从那时起它与 Python 语法一起发展。

Even if it is part of standard library since a long time, it is not common to use it directly. Rather, you might have used it indirectly as popular tools use it under-the-hood:
即使它长期以来一直是标准库的一部分,直接使用它也不常见。相反,您可能间接地使用了它,因为流行的工具在引擎盖下使用它:

  • code testing: mutpy is a mutation testing tool used to alters the code under testing to broaden the set of tests in an automated fashion. In practice a mutation is an artificial modification of the AST generated from the code under testing. To see how PyBites uses mutpy, check out this article.
    代码测试:Mutpy 是一种突变测试工具,用于更改被测试的代码,以自动化方式扩展测试集。在实践中,突变是对被测代码生成的 AST 的人为修改。要了解 PyBites 如何使用 mutpy,请查看这篇文章。
  • code coverage: vulture is a static code analyzer that studies an AST to identify portion of the code not used.
    代码覆盖率:Vulture 是一个静态代码分析器,它研究 AST 以识别未使用的代码部分。
  • code vulnerabilities: bandit uses the AST representation to identify security vulnerabilities.
    代码漏洞:Bandit 使用 AST 表示来识别安全漏洞。
  • code autocompletion: jedi is an IDE and text editors autocompletion tool relying on the ast module functionality to safely evaluate expressions.
    代码自动完成:绝地是一个 IDE 和文本编辑器自动完成工具,它依赖于 ast 模块功能来安全地计算表达式。
  • code reformating: black and flake8 are two popular tools to enforce code reformatting, and they both use an AST representation of the source code to apply their formatting rules.
    代码重整:Black 和 flake8 是强制代码重新格式化的两种常用工具,它们都使用源代码的 AST 表示来应用其格式化规则。

Using the ast module to investigate the PyBites Bite exercises 使用 ast 模块调查 PyBites 咬合练习

Still not convinced of the relevance of an AST? Fair enough: let’s consider a more practical, and closer to the PyBites Platform, use-case.
仍然不相信 AST 的相关性?公平地说:让我们考虑一个更实用、更接近 PyBites 平台的用例。

The PyBites Platform is currently offering 300+ Bite exercises, and the number is constantly increasing. Given the (semi)hidden intention of the platform is to offer a varied set of challenges covering different Python modules and functionalities, it starts to be more and more challenging to identify what is covered by already available exercises, and what is instead left to explore.
PyBites 平台目前提供 300+ 个咬合练习,而且数量还在不断增加。鉴于该平台的(半)隐藏意图是提供涵盖不同 Python 模块和功能的各种挑战,因此确定现有练习涵盖的内容以及需要探索的内容开始变得越来越具有挑战性。

This is where we can take advantage of the ast module. Specifically, we can process the source code of the solution of the exercises (as provided by the authors of the challenges) and recover some statistics about their content. For instance, which are the popular modules and builtin functions used.
这就是我们可以利用 ast 模块的地方。具体来说,我们可以处理练习解决方案的源代码(由挑战的作者提供)并恢复有关其内容的一些统计信息。例如,哪些是流行的模块和使用的内置函数。

Here some of the results. To follow along check out this Jupyter notebook.
以下是一些结果。要继续学习,请查看此 Jupyter 笔记本。

Builtins popularity 内置人气

Pybites exercises - builtins popularity

The histogram above shows the Python builtin calls sorted by their popularity. In other words, using the ast module one can detect when a function call has been made, and if it relates to the builtins module or not. Three colors are used to visually distinguish between exception types, the creation of base types (int, float, bool, list, and set), or other functions. The histogram is a normalized frequency count, i.e., the frequency of each element is cumulated across all exercises, and divided by the sum of all elements occurrence across all exercises.
上面的直方图显示了按其受欢迎程度排序的 Python 内置调用。换句话说,使用 ast 模块可以检测何时进行了函数调用,以及它是否与 builtins 模块相关。三种颜色用于直观地区分异常类型、创建基类型(intfloatboollistset)或其他函数。直方图是归一化的频率计数,即每个元素的频率在所有练习中累积,然后除以所有练习中出现的所有元素的总和。

A few observations: 几点观察:

  • The distribution is heavy tailed, with len() representing 13.4% of all builtin calls, while dir() being used only once.
    该分布是重尾的,len() 占所有内置调用的 13.4%,而 dir() 只使用一次。
  • All five base types are used, but bool() is used only in 1 challenge.
    所有五种基本类型都使用,但 bool() 仅在 1 个挑战中使用。
  • Only 5 of the standard exceptions are used, with ValueError being the most common.
    仅使用了 5 个标准例外,其中 ValueError 是最常见的。
  • Most of the builtin functions are already used by exercises, but considering the functional programming calls you can notice that map() appears while filter() does not (as indeed the common practice is to prefer list comprehension).
    大多数内置函数已经被练习使用,但考虑到函数式编程调用,您可以注意到 map() 出现而 filter() 没有(因为通常的做法是更喜欢列表推导)。

Modules popularity 模块受欢迎程度

Pybites exercises - modules popularity

The histogram above shows the ranking for modules. For simplicity we limit to report on the root modules only. If submodules are used, their frequencies are cumulated into the frequency of the respective root modules.
上面的直方图显示了模块的排名。为简单起见,我们仅限于报告根模块。如果使用子模块,则它们的频率将累积到相应根模块的频率中。

As before, the histogram is heavy tailed, a testament that the PyBites Bite exercises try to “cover a little bit of everything”.
和以前一样,直方图是沉重的尾巴,这证明了 PyBites Bite 练习试图“覆盖一切”。

We can observe the presence of non-standard modules, such as pandas and pytest, as well more ad-hoc modules such as as zodiac and fibonacci that are created for the purpose of the challenges themselves.
我们可以观察到非标准模块的存在,例如 pandaspytest,以及更多的临时模块,例如 zodiacfibonacci,这些模块是为挑战本身的目的而创建的。

One can easily expand the analysis to understand the functions used in each module/submodule, as well as dive into more specific analysis. What is relevant to highlight is that the results reported here are generated with about 50 lines of Python code and using ast module. Processing the 300+ source code files with tools like awk, grep, or anything else would have been significantly harder.
可以轻松扩展分析以了解每个模块/子模块中使用的功能,并深入研究更具体的分析。需要强调的是,这里报告的结果是用大约 50 行 Python 代码并使用 ast 模块生成的。使用 awk、grep 或其他任何工具处理 300+ 源代码文件会更加困难。

Hopefully this examples gave you a rough idea of what you can achieve with an AST. The next step is to understand how to create such data structures, and investigate their composition.
希望这个例子能让你大致了解一下AST可以实现什么。下一步是了解如何创建此类数据结构,并研究它们的组成。

Dissecting an assignment instruction using the ast module 使用 ast 模块剖析作业指令

To start familiarize with the ast module, let’s see what happens when we analyze a single instruction: one_plus_two = 1+2
为了开始熟悉 ast 模块,让我们看看当我们分析一条指令时会发生什么:one_plus_two = 1+2

>>> import ast
>>> code = "one_plus_two = 1+2"
>>> tree = ast.parse(code)
>>> ast.dump(tree, indent=4)

This will output: 这将输出:

Module(
    body=[
        Assign(
            targets=[
                Name(id='one_plus_two', ctx=Store())],
            value=BinOp(
                left=Constant(value=1),
                op=Add(),
                right=Constant(value=2)))],
    type_ignores=[])

It might not be obvious at first, but the output generated by ast.dump() is actually a tree:
乍一看可能并不明显,但 ast.dump() 生成的输出实际上是一棵树:

  • The words starting with capital letter are nodes of the tree.
    以大写字母开头的单词是树的节点。
  • The attributes of the nodes are either edges of the tree or metadata.
    节点的属性是树的边缘或元数据。

Let’s rework the output into a diagram with the following conventions:
让我们使用以下约定将输出重新设计为关系图:

  • One rectangle for each node, marking in bold the related node type.
    每个节点一个矩形,以粗体标记相关节点类型。
  • Node attributes collecting metadata are reported in blue.
    收集元数据的节点属性以蓝色报告。
  • Other node attributes are annotated with their type.
    其他节点属性使用其类型进行批注。
  • Nodes are connected based on their attributes.
    节点根据其属性进行连接。

AST sketch

With this visualization at hand we can observe a few things.
通过手头的这种可视化,我们可以观察到一些事情。

The root of the tree is a Module node. In fact, even if our example is a single line program, it is still a true Python module. The node contains two attributes body and type_ignores. Let’s put the aside type_ignores for a moment and focus on body.
树的根是 Module 节点。事实上,即使我们的示例是一个单行程序,它仍然是一个真正的 Python 模块。该节点包含两个属性 bodytype_ignores。让我们暂时把 type_ignores 放在一边,专注于 body

As a Python module contains a series of instructions, the Module.body attribute is a list of nodes, one for each instruction in the program. Our example consists of a single assignment operation, hence Module.body contains only one Assign node.
由于 Python 模块包含一系列指令,因此 Module.body 属性是节点列表,程序中的每条指令对应一个节点。我们的示例由单个赋值操作组成,因此 Module.body 只包含一个 Assign 节点。

An assignment operation has a right-hand side specifying the operation to perform, and a left-hand side specifying the destination of the operation. The two sides are associated to the Assign.value and Assign.targets attributes of the Assign node.
分配操作的右侧指定要执行的操作,左侧指定操作的目标。两边与 Assign 节点的 Assign.valueAssign.targets 属性相关联。

Considering the right-hand side, the Assign.value attribute is a BinOp node, since the instruction is a binary operation between two operands, which is fully specified with three attributes:
考虑到右侧,Assign.value 属性是一个 BinOp 节点,因为指令是两个操作数之间的二进制操作,完全由三个属性指定:

  • BinOp.op is a Add node given we are performing an addition.
    BinOp.op 是一个 Add 节点,给定我们正在执行加法。
  • BinOp.left and BinOp.right are the addition operands and consist of Constant nodes, each holding the raw value in the Constant.value attribute.
    BinOp.leftBinOp.right 是加法操作数,由 Constant 个节点组成,每个节点都保存 Constant.value 属性中的原始值。

Considering the left-side, as Python supports multiple assignments and tuple unpacking, the Assign.targets attribute is a list collecting the different destinations of the operation. In our case the assignment is for a single variable, so a single Name node is used. In turn, the Name node has 2 attributes:
考虑到左侧,由于 Python 支持多个赋值和元组解包,因此 Assign.targets 属性是一个收集操作不同目标的列表。在我们的例子中,赋值是针对单个变量的,因此使用单个 Name 节点。反过来,Name 节点有 2 个属性:

  • Name.id stores the name of the variable used in the program ("one_plus_two").
    Name.id 存储程序中使用的变量的名称 ("one_plus_two")。
  • Name.ctx specifies how variable reference is used in the program. This can only be one of types ast.Load, ast.Remove or ast.Store, but those are always empty nodes.
    Name.ctx 指定如何在程序中使用变量引用。这只能是类型 ast.Loadast.Removeast.Store 之一,但这些节点始终是空节点。

The Module.type_ignores attribute and type comments Module.type_ignores 属性和类型注释

The attribute Module.type_ignores in the vast majority of the cases is going to be an empty list. This is why in the sketch is colored in blue. To understand why this is the case and what is the actual purpose of the attribute, we need to make a digression.
在绝大多数情况下,属性 Module.type_ignores 将是一个空列表。这就是为什么在草图中是蓝色的。为了理解为什么会这样以及该属性的实际目的是什么,我们需要做一个题外话。

Python 3.0 introduced annotations, and few years later those have been expanded into type hints. If you are not familiar with those concepts, check this Real Python tutorial and the official doc.
Python 3.0 引入了注释,几年后,这些注释已扩展

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值