论文翻译【AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summarization】

最新推荐文章于 2024-09-12 23:40:25 发布

Aries_WK

最新推荐文章于 2024-09-12 23:40:25 发布

阅读量420

点赞数 1

分类专栏：代码摘要文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/Aries_WK/article/details/129881121

版权

代码摘要专栏收录该内容

3 篇文章 0 订阅

订阅专栏

AST-Transformer是一种新的基于Transformer的模型，它有效地处理AST，通过关系矩阵减少计算复杂度。该模型在代码摘要任务中表现出色，超越现有方法，尤其在处理长AST时能显著提高效率。

摘要由CSDN通过智能技术生成

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

前言

提示：AST-Transformer:高效编码抽象语法树的代码摘要

在这里插入图片描述

2021年11月，发表在ASE《自动软件工程国际会议》 CCFA类的会议

如果您想了解更多关于这篇文章的信息，您可以访问以下链接：

arXiv: https://arxiv.org/abs/2112.01184
https://arxiv.org/pdf/2112.01184.pdf

一、摘要

Abstract——Code summarization aims to generate brief natural language descriptions for source code. As source code is highly structured and follows strict programming language grammars, its Abstract Syntax Tree (AST) is often leveraged to inform the encoder about the structural information. However, ASTs are usually much longer than the source code. Current approaches ignore the size limit and simply feed the whole linearized AST into the encoder. To address this problem, we propose ASTTransformer to efficiently encode tree-structured ASTs. Experiments show that AST-Transformer outperforms the state-of-arts by a substantial margin while being able to reduce 90~95% of the computational complexity in the encoding process.
Index Terms—tree-based neural network, source code summarization

摘要——代码摘要旨在为源代码生成简洁的自然语言描述。由于源代码具有高度的结构性，并遵循严格的编程语言语法，因此其抽象语法树（AST）通常被用来向编码器提供结构信息。然而，AST通常比源代码长得多。当前的方法忽略了大小限制，只是简单地将整个线性化的AST输入到编码器中。为了解决这个问题，我们提出了AST-Transformer，一种能够高效地编码树状结构的AST的方法。实验表明，AST-Transformer在降低编码过程中90∼95%的计算复杂度的同时，还能够大幅度地超越现有的方法。
索引词——基于树的神经网络，源代码摘要

二、INTRODUCTION

The summary of source code is a brief natural language description explaining the purpose of the code [1]. The code to be summarized can be with different units. In this work, we focus on summarizing the subroutines or defined methods in a program. Current state-of-the-arts all follow the encoder-decoder architecture [2]–[4] and can be trained end-to-end with codesummary pairs. Since the source code is highly structured and follows rigid programming language grammars, a common practice is to also leverage the Abstract Syntax Tree (AST) to help the encoder digest the structured information. The AST is usually linearized by different algorightms like preorder traversal [5], structure-based traversal (SBT) [6] and path decomposition [7], then fed into the encoder. Several works also proposed architectures specific for tree encoding like treeLSTM [8], [9].

源代码的摘要是对代码目的的简短自然语言描述1。要摘要的代码可以有不同的单位。在这项工作中，我们关注于对程序中的子程序或定义方法进行摘要。当前最先进的方法都遵循编码器-解码器架构2–4，并且可以用代码-摘要对进行端到端的训练。由于源代码具有高度的结构性，并遵循严格的编程语言语法，一种常见的做法是利用抽象语法树（AST）来帮助编码器消化结构信息。AST通常通过不同的算法线性化，如前序遍历[5]、基于结构的遍历（SBT）[6]和路径分解[7]，然后输入到编码器中。也有一些工作提出了专门针对树编码的架构，如树-LSTM [8]、[9]。

However, the linearized ASTs, as containing additional structured information, are much longer than their corresponding source code sequence. Some linearization algorithms can further increase the length. For example, linearizing with SBT usually doubles the size of original AST. This makes the model extremely difficult to accurately detect useful dependency relations from the overlong input sequence . Moreover, it brings significant computational overhead, especially for those state-of-the-art Transformer-based models where the number of self-attention operations grows quadratically with the sequence length. Encoding ASTs with tree-based models like tree-LSTM will incur extra complexity as they need to traverse the whole tree to obtain the state of each node.

然而，线性化的AST，因为包含了额外的结构信息，比它们对应的源代码序列要长得多。有些线性化算法还会进一步增加长度。例如，用SBT线性化通常会使原始AST的大小翻倍。这使得模型非常难以从过长的输入序列中准确地检测出有用的依赖关系。而且，它还带来了巨大的计算开销，特别是对于那些基于Transformer的最先进的模型，其中自注意力操作的数量随着序列长度呈二次增长。用像tree-LSTM这样的基于树的模型编码AST将带来额外的复杂度，因为它们需要遍历整棵树来获得每个节点的状态。

In this work, we argue that it is unnecessary to model the dependency between every single node pair. Our intuition is that the state of a node in the AST is affected most by its (1) ancestor-descendent nodes, which represent the hierarchical relationship within one operation, and (2) sibling nodes, which represent the temporal relationship across different operations. Based on this intuition, we propose AST-Transformer, a simple variant of the Transformer model to efficiently handle the treestructured AST.

在这项工作中，我们认为没有必要对每一对节点之间的依赖关系进行建模。我们的直觉是，AST中一个节点的状态最受其（1）祖先-后代节点的影响，它们表示了一个操作内部的层次关系，以及（2）兄弟节点，它们表示了不同操作之间的时间关系。基于这种直觉，我们提出了AST-Transformer，一种简单的Transformer模型的变体，可以有效地处理树状结构的AST。

三、APPROACH

This section details our proposed AST-Transformer, i.e., a simple yet effective Transformer variant to deal with the treestructured AST. The overall architecture of AST-Transformer has two main parts, i.e., AST Encoder and Decoder respectively. The particularity of the proposed AST-Transformer lies in the three special components in the Encoder, namely, AST Linearization, Relation Matrices, and Tree-Structure Attention. First, in subsection §II-A, three different linearization methods for transforming input AST into a sequence are introduced. Then, two matrices for encoding the ancestor-descendent and sibling relationships in the tree are defined in subsection §II-B, as well as concrete methods for constructing the matrices. Eventually, the proposed self-attention mechanism based on relation matrix for generating code summaries is illustrated in subsection §II-C.

本节详细介绍了我们提出的AST-Transformer，即一种简单而有效的Transformer变体，用于处理树状结构的AST。AST-Transformer的整体架构有两个主要部分，分别是AST编码器和解码器。AST-Transformer的特点在于编码器中的三个特殊组件，即AST线性化、关系矩阵和树结构注意力。首先，在第II-A节中，介绍了三种不同的线性化方法，用于将输入的AST转换为序列。然后，在第II-B节中，定义了两个矩阵，用于编码树中节点之间的祖先-后代关系和兄弟关系，以及构造这些矩阵的具体方法。最后，在第II-C节中，说明了基于关系矩阵的自注意力机制，用于生成代码摘要。

A. AST Linearization
In order to make the tree-shaped AST suitable as the input of the neural network model, it first needs to be converted into a sequence with a linearization method. Technically, the proposed AST-Transformer is orthogonal to the linearization and can be build upon any concrete approach. In this paper, the three most representative methods are selected for conducting experiments to test which one can achieve the best effect in combination with the self-attention based on the relation matrix, and they are: Pre-order Traversal (POT), Structurebased Traversal (SBT) [6] and Path Decomposition (PD) [7]. We find that the performances of using SBT and PD have no big differences with using POT in AST-Transformer through experiments. However, generating POT saves almost 90 ∼ 95% time costs compared with generating SBT or PD for the entire dataset. And fortunately, using the simplest POT has been able to achieve the state-of-the-art performance.

A. AST线性化
为了使树形的AST适合作为神经网络模型的输入，它首先需要通过一个线性化方法转换为一个序列。从技术上讲，所提出的AST-Transformer与线性化是正交的，可以建立在任何具体的方法之上。在本文中，我们选择了三种最具代表性的方法来进行实验，测试哪一种方法能够与基于关系矩阵的自注意力相结合，达到最好的效果，它们分别是：前序遍历（POT）、基于结构的遍历（SBT）[6]和路径分解（PD）[7]。我们发现，在AST-Transformer中使用SBT和PD的性能与使用POT没有太大差异。然而，生成POT比生成SBT或PD节省了将近90∼95%的时间成本。幸运的是，使用最简单的POT已经能够达到最先进的性能。

B. Relationship Matrices
We define two kinds of relationships between nodes in the tree that we care about: the ancestor-descendant relationship and the sibling relationship. The former represents the hierarchical information within one operation, and the latter represents the temporal information across different operations. Specifically, two nodes have the ancestor-descendant relationship if there exists a directed path from root node that can traverse through them. Two nodes have the sibling relationship if they share the same parent node. We use two matrices, i.e., AN×N and SN×N , to represent the ancestor-descendent and sibling relationships respectively. N is the total number of nodes. We denote the ith node in the linearized AST as ni . Ai,j is the distance of the shortest path between ni and nj in the AST. Si,j is horizontal sibling distance between ni and nj in the AST if they satisfy the sibling relationship. If one relationship is not satisfied, its value in the matrix will be infinity. By taking the advantages of these two matrices, the model can find related nodes in parallel and efficiently by scanning the matrices instead of traversing the initial tree.

B. 关系矩阵
我们定义了树中节点之间我们关心的两种关系：祖先-后代关系和兄弟关系。前者表示一个操作内部的层次信息，后者表示不同操作之间的时序信息。具体来说，如果从根节点存在一条有向路径可以经过它们，那么两个节点就有祖先-后代关系。如果两个节点有相同的父节点，那么它们就有兄弟关系。我们使用两个矩阵AN×N和SN×N来分别表示祖先-后代关系和兄弟关系。N是节点总数。我们用ni表示线性化AST中的第i个节点。Ai,j是AST中ni和nj之间最短路径的距离。Si,j是ni和nj之间在AST中的水平兄弟距离，如果它们满足兄弟关系。如果没有满足某种关系，那么矩阵中的值将是无穷大。通过利用这两个矩阵，模型可以通过扫描矩阵而不是遍历初始树来并行地找到相关节点，并且效率很高。

C. Tree-Structure Attention
For incorporating with the relationship matrices in selfattention, we combine the practices of Shaw et al. [14] and He et al. [15]. Similar to the vanilla Transformer, we use multihead attention to jointly attend to information from different relationship matrices. Then the outputs of the self-attention with the ancestor-descendant and the sibling relationship matrices are concatenated and once again projected, resulting in the final values.

C. 树结构注意力
为了与关系矩阵相结合，在自注意力中，我们结合了Shaw等人[14]和He等人[15]的做法。与普通Transformer类似，我们使用多头注意力来同时关注来自不同关系矩阵的信息。然后将基于祖先-后代和兄弟关系矩阵的自注意力输出拼接起来，并再次投影，得到最终的值。

四、EXPERIMENTS

III. EXPERIMENTS The overall result of AST-Transformer and the baselines are proposed in Table I. Results show that AST-Transformer obviously outperforms all the baselines in all three metrices. ASTTransformer outperforms the nearest baseline using code token sequence as input by 2.06, 1.3 BLEU, 2.04, 2.19 METEOR and 0.45, 0.41 ROUGE-L in the Java and Python datasets respectively. And the improvement is more obvious compared with baselines using AST or linearized AST as input. We think there are two main reasons for the improvement of ASTTransformer. Firstly, top two approaches (AST-Transformer and Transformer(CODE)) both use the Transformer architecture. As the length of code or AST is much longer than the natural language, self-attention mechanism is conductive to help the model catch long distance meaningful word pair or node pair, and then learn some characters related to the code function. Secondly, though we say that AST contains more information than code token sequence, as AST not only has the semantic information (which is stored in leaf nodes), but also has the structural information (which is stored in nonleaf nodes), the performances of most approaches based on AST are inferior to Transformer(Code), a model just using code token sequence as input. It may be illustrated by that there are many very general structures, such as MethodDeclare → MethodBody, in AST. These structures basically occur in every code, and they are noisy information for models, just like the pause words in nature language. This is exactly why Transformer(SBT) has hardly improved compared to DeepCom, as SBT has around 4 times more nodes than AST. In AST-Transformer, we only allow the node exchanges information with other nodes that are no more than K away from it. This can effectively ensure the specificity of each node without being assimilated by the overall structure.

AST-Transformer和基线方法的总体结果在表I中给出。结果显示，AST-Transformer明显优于所有的基线方法，在所有的三个指标上都有提升。AST-Transformer分别在Java和Python数据集上比最近的基线方法使用代码标记序列作为输入提高了2.06、1.3 BLEU，2.04、2.19 METEOR和0.45、0.41 ROUGE-L。与使用AST或线性化AST作为输入的基线方法相比，改进更加明显。我们认为AST-Transformer的改进有两个主要原因。首先，前两种方法（AST-Transformer和Transformer(Code)）都使用了Transformer架构。由于代码或AST的长度比自然语言长得多，自注意力机制有助于模型捕捉长距离的有意义的单词对或节点对，从而学习一些与代码功能相关的特征。其次，虽然我们说AST包含了比代码标记序列更多的信息，因为AST不仅有语义信息（存储在叶子节点中），还有结构信息（存储在非叶子节点中），但大多数基于AST的方法的性能都不如Transformer(Code)，这是一个只使用代码标记序列作为输入的模型。这可能说明了AST中存在许多非常普遍的结构，例如MethodDeclare→MethodBody，这些结构几乎出现在每个代码中，它们对模型来说是噪音信息，就像自然语言中的停顿词一样。这正是为什么Transformer(SBT)与DeepCom相比几乎没有改进，因为SBT比AST有大约4倍多的节点。在AST-Transformer中，我们只允许节点与距离它不超过K的其他节点交换信息。这可以有效地保证每个节点的特异性，而不被整体结构同化。

在这里插入图片描述

五、CONCLUSION

IV. CONCLUSION
In this paper, we have presented a new Transformer-based model that can encode AST effectively. By using two relationship matrices, AST-Transformer can encode AST without suffering from the computational complexity. Comprehensive experiments show that AST-Transformer outperforms other competitive baselines and achieves the state-of-art performance on several automatic metrics.