三、CodeT5
3.2 Pre-training Tasks
We now introduce our proposed pre-training tasks that enable CodeT5 to learn useful patterns from either PL-only or NL-PL bimodal data.
Identifier-aware Denoising Pre-training. Denoising Sequence-to-Sequence (Seq2Seq) pretraining has been shown to be quite effective in a broad set of NLP tasks (Song et al., 2019; Raffel et al., 2020; Lewis et al., 2020). This denoising objective typically first corrupts the source sequence with some noising functions and then requires the decoder to recover the original texts. In this work, we utilize a span masking objective similar to T5 (Raffel et al., 2020) that randomly masks spans with arbitrary lengths and then predicts these masked spans combined with some sentinel tokens at the decoder. We refer this task to Masked Span Prediction (MSP), as illustrated in Figure 2 (a).
我们现在介绍我们提出的预训练任务,它们可以让CodeT5从单模态(只有PL)或双模态(NL和PL)的数据中学习有用的模式。
基于标识符的去噪预训练。去噪序列到序列(Seq2Seq)预训练已经被证明在广泛的自然语言处理(NLP)任务中非常有效(Song etal., 2019; Raffel et al., 2020; Lewis et al., 2020)。这种去噪目标通常首先用一些噪声函数来损坏源序列,然后要求解码器恢复原始文本。在这项工作中,我们使用一个类似于T5(Raffel et al.,2020)的跨度掩码目标,它随机地掩盖任意长度的跨度,然后在解码器处预测这些被掩盖的跨度,并与一些哨兵令牌结合起来。我们将这个任务称为掩码跨度预测(MSP),如图2(a)所示。
Specifically, we employ the same 15% corruption rate as T5 and ensure the average span length to be 3 by uniformly sampling spans of from 1 to 5 tokens. Moreover, we employ the whole word masking by sampling spans before subword tokenization, which aims to avoid masking partial substokens and is shown to be helpful (Sun et al., 2019). Notably, we pre-train a shared model for various PLs to learn robust cross-lingual representations. We describe the masked span prediction loss as: LMSP (θ) = X k t=1 − log Pθ(x mask t |x \mask , x mask <t ), (1) where θ are the model parameters, x \mask is the masked input, x mask is the masked sequence to predict from the decoder with k denoting the number of tokens in x mask, and x mask <t is the span sequence generated so far.
To fuse more code-specific structural information (the identifier node type in AST) into the model, we propose two additional tasks: Identifier Tagging (IT) and Masked Identifier Prediction (MIP) to complement the denoising pre-training.
具体来说,我们采用与T5相同的15%的损坏率,并通过均匀地采样1到5个令牌的跨度,来保证平均跨度长度为3。此外,我们在子词分词之前采样跨度,采用整词掩码,这旨在避免掩盖部分子词,这已被证明是有帮助的(Sunet al., 2019)。值得注意的是,我们预训练了一个共享模型,用于学习各种PL的鲁棒的跨语言表示。我们将掩码跨度预测损失描述为:
LMSP (θ) = X k t=1 − log Pθ(x mask t |x \mask , x mask <t ), (1)
其中,θ是模型参数,x\mask是被掩盖的输入,x mask是从解码器预测的被掩盖的序列,k表示x mask中的令牌数,x mask<t是到目前为止生成的跨度序列。
为了将更多的代码特定的结构信息(AST中的标识符节点类型)融入到模型中,我们提出了两个额外的任务:标识符标注(IT)和掩码标识符预测(MIP),来补充去噪预训练。
. • Identifier Tagging (IT) It aims to notify the model with the knowledge of whether this code token is an identifier or not, which shares a similar spirit of syntax highlighting in some developeraided tools. As shown in Figure 2 (b), we map the final hidden states of the PL segment at the CodeT5 encoder into a sequence of probabilities p = (p1, …, pm), and compute a binary cross entropy loss for sequence labeling: LIT (θe) = Xm i=1 −[yi log pi + (1 − yi) log(1 − pi)], (2) where θe are the encoder parameters. Note that by casting the task as a sequence labeling problem, the model is expected to capture the code syntax and the data flow structures of the code.
• 标识符标注(IT)它的目的是向模型通知这个代码令牌是否是一个标识符,这与一些开发者辅助工具中的语法高亮有着类似的精神。如图2(b)所示,我们将CodeT5编码器中PL段的最终隐藏状态映射为一个概率序列p = (p1, …, pm),并计算一个二元交叉熵损失来进行序列标注: LIT (θe) = Xm i=1 −[yi log pi + (1 − yi) log(1 − pi)], (2) 其中,θe是编码器参数。注意,通过将任务视为一个序列标注问题,模型预期能够捕捉代码的语法和数据流结构。
• Masked Identifier Prediction (MIP) Different from the random span masking in MSP, we mask all identifiers in the PL segment and employ a unique sentinel token for all occurrences of one specific identifier. In the field of software engineering, this is called obfuscation where changing identifier names does not impact the code semantics. Inspired by Rozière et al. (2021), we arrange the unique identifiers with the sentinel tokens into a target sequence I as shown in Figure 2 ©. We then predict it in an auto-regressive manner: LMIP (θ) = X |I| j=1 − log Pθ(Ij |x \I , I<j ), (3) where x \I is the masked input. Note that deobfuscation is a more challenging task that requires the model to comprehend the code semantics based on obfuscated code and link the occurrences of the same identifiers together.
• 掩码标识符预测(MIP)不同于MSP中的随机跨度掩码,我们掩盖PL段中的所有标识符,并为一个特定标识符的所有出现使用一个唯一的哨兵令牌。在软件工程领域,这被称为混淆,其中改变标识符名称不会影响代码的语义。受Rozière et al. (2021)的启发,我们将唯一的标识符与哨兵令牌排列成一个目标序列I,如图2(c)所示。然后我们以自回归的方式预测它: LMIP (θ) = X |I| j=1 − log Pθ(Ij |x \I , I<j ), (3) 其中,x\I是被掩盖的输入。注意,去混淆是一个更具挑战性的任务,它要求模型根据混淆后的代码理解代码的语义,并将相同标识符的出现联系起来。
We alternately optimize these three losses with an equal probability, which constitutes our proposed identifier-aware denoising pre-training.
Bimodal Dual Generation. In the pre-training phase, the decoder only sees discrete masked spans and identifiers, which is disparate from the downstream tasks where the decoder needs to generate either fluent NL texts or syntactically correct code snippets. To close the gap between the pretraining and fine-tuning, we propose to leverage the NL-PL bimodal data to train the model for a bidirectional conversion as shown in Figure 2 (d). Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them. For each NL-PL bimodal datapoint, we construct two training instances with reverse directions and add language ids (e.g., and for Java PL and English NL, respectively). This operation can be also seen as a special case of T5’s span masking by either masking the full NL or PL segment from the bimodal inputs. This task aims to improve the alignment between the NL and PL counterparts.
我们交替地以相等的概率优化这三个损失,这构成了我们提出的标识符感知的去噪预训练。 双模式双向生成。在预训练阶段,解码器只能看到离散的掩码跨度和标识符,这与下游任务不同,下游任务需要解码器生成流畅的自然语言文本或语法正确的代码片段。为了缩小预训练和微调之间的差距,我们提出利用自然语言-编程语言(NL-PL)双模式数据来训练模型进行双向转换,如图2(d)所示。具体来说,我们将自然语言到编程语言(NL→PL)生成和编程语言到自然语言(PL→NL)生成视为双重任务,并同时对模型进行优化。对于每个NL-PL双模式数据点,我们构造两个反向的训练实例,并添加语言标识符(例如,对于Java PL和英语NL,分别使用和)。这种操作也可以看作是T5的跨度掩码的一个特殊情况,即从双模式输入中掩盖完整的自然语言或编程语言段。这个任务旨在改善自然语言和编程语言之间的对齐。
3.3 Fine-tuning CodeT5
After pre-training on large-scale unlabeled data, we adapt CodeT5 to downstream tasks via either task-specific transfer learning or multi-task learning.
Task-specific Transfer Learning: Generation vs. Understanding Tasks. Code-related tasks can be categorized into generation and understanding tasks. For the former one, our CodeT5 can be naturally adapted with its Seq2Seq framework. For understanding tasks, we investigate two ways of either generating the label as a unigram target sequence (Raffel et al., 2020), or predicting it from the vocabulary of class labels based on the last decoder hidden state following Lewis et al. (2020).
在大规模无标注数据上预训练后,我们通过任务特定的迁移学习或多任务学习来将CodeT5适应下游任务。 任务特定的迁移学习:生成与理解任务。代码相关的任务可以分为生成和理解两类。对于前者,我们的CodeT5可以自然地适应其序列到序列的框架。对于理解任务,我们探索了两种方式,要么将标签生成为一个单词目标序列(Raffel et al., 2020),要么根据最后一个解码器隐藏状态从类标签的词汇表中预测它(Lewis et al., 2020)。
.
Multi-task Learning. We also explore a multi-task learning setting by training a shared model on multiple tasks at a time. Multi-task learning is able to reduce computation cost by reusing the most of model weights for many tasks and has been shown to improve the model generalization capability in NL pre-training (Liu et al., 2019a). We follow Raffel et al. (2020) to employ the same unified model for all tasks without adding any task specific networks but allow to select different best checkpoints for different tasks. To notify the model with which task it is dealing with, we design a unified format of task control codes and prepend it into the source inputs as shown in Figure 1. For instance, we employ “Translate Java to CSharp:” as the source prompt for the code-to-code translation task from Java to CSharp.
多任务学习。我们还探索了一种多任务学习的设置,通过同时在多个任务上训练一个共享的模型。多任务学习能够通过重用大部分模型权重来降低计算成本,并且已经被证明能够提高自然语言预训练中的模型泛化能力(Liu et al., 2019a)。我们遵循Raffel et al. (2020)的做法,对所有任务使用相同的统一模型,而不添加任何任务特定的网络,但允许为不同的任务选择不同的最佳检查点。为了通知模型它正在处理哪个任务,我们设计了一个统一格式的任务控制码,并将其预置到源输入中,如图1所示。例如,我们使用“Translate Java to CSharp:”作为从Java到CSharp的代码到代码翻译任务的源提示。
四、Experimental Setup
4.1 Pre-training Dataset
We follow Feng et al. (2020) to employ CodeSearchNet (Husain et al., 2019) to pre-train CodeT5, which consists of six PLs with both unimodal and bimodal data. Apart from that, we additionally collect two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped PLs with the pre-training data. In total, we employ around 8.35 million instances for pretraining. Table 1 shows some basic statistics. To obtain the identifier labels from code, we leverage the tree-sitter2 to convert the PL into an abstract syntax tree and then extract its node type information. We filter out reserved keywords for each PL from its identifier list. We observe that PLs have different identifier rates, where Go has the least rate of 19% and Ruby has the highest rate of 32%.
我们遵循Feng et al. (2020)的做法,使用CodeSearchNet (Husain et al., 2019)来预训练CodeT5,它包含了六种编程语言,既有单模式也有双模式的数据。除此之外,我们还从BigQuery1中收集了两个C/CSharp的数据集,以确保所有的下游任务都有与预训练数据重叠的编程语言。总共,我们使用了大约835万个实例进行预训练。表1展示了一些基本的统计信息。为了从代码中获取标识符的标签,我们利用tree-sitter2将编程语言转换为抽象语法树,然后提取其节点类型信息。我们从每种编程语言的标识符列表中过滤掉保留的关键字。我们观察到不同的编程语言有不同的标识符比率,其中Go的比率最低,为19%,而Ruby的比率最高,为32%。
4.2 Code-specific Tokenizer
Tokenization is a key ingredient for the success of pre-trained language models like BERT and GPT. They often employ a Byte-Pair Encoding (BPE) tokenizer (Sennrich et al., 2016) to alleviate the Out-of-Vocabulary (OoV) issues. Specifically, we train a Byte-level BPE tokenizer following Radford et al. (2019) and set the vocabulary size to 32,000 as T5. We add additional special tokens ([PAD], [CLS], [SEP], [MASK0], …, [MASK99]). This tokenzier is trained on all of our pre-training data with non-printable characters and low-frequent tokens (occurring < 3 times) filtered. We compare it with T5’s default tokenizer and find that our tokenizer largely reduces the length of tokenized code sequence by 30% - 45% on downstream tasks. This will accelerate the training and especially benefit generation tasks due to the shorter sequence to predict. We also spot a severe problem for applying the T5’s default tokenizer on source code, where it would encode some common code tokens such as brackets [‘{’, ‘}’] into unknown tokens.
4.2 针对代码的分词器 分词是像BERT和GPT这样的预训练语言模型成功的关键因素。它们通常使用字节对编码(BPE)分词器(Sennrich et al., 2016)来缓解词汇表外(OoV)的问题。具体来说,我们按照Radford et al. (2019)的方法训练一个字节级的BPE分词器,并将词汇表大小设置为与T5相同的32,000。我们添加了一些额外的特殊符号([PAD], [CLS], [SEP], [MASK0], …, [MASK99])。这个分词器是在我们所有的预训练数据上训练的,其中过滤掉了不可打印的字符和低频的词(出现次数小于3次)。我们将它与T5的默认分词器进行了比较,发现我们的分词器在下游任务上大大减少了分词后代码序列的长度,达到了30% - 45%。这将加速训练,特别是对于生成任务,因为要预测的序列更短了。我们还发现了一个严重的问题,就是在源代码上应用T5的默认分词器时,它会将一些常见的代码符号,如括号[‘{’, ‘}’]编码为未知符号。
4.3 Downstream Tasks and Metrics
We cover most generation and understanding tasks in the CodeXGLUE benchmark (Lu et al., 2021) and employ the provided public datasets and the same data splits following it for all these tasks.
We first consider two cross-modal generation tasks. Code summarization aims to summarize a function-level code snippet into English descriptions. The dataset consists of six PLs including Ruby, JavaScript, Go, Python, Java, and PHP from CodeSearchNet (Husain et al., 2019). We employ the smoothed BLEU-4 (Lin and Och, 2004) to evaluate this task. Code generation is the task to generate a code snippet based on NL descriptions. We employ the Concode data (Iyer et al., 2018) in Java where the input contains both NL texts and class environment contexts, and the output is a function. We evaluate it with BLEU-4, exact match (EM) accuracy, and CodeBLEU (Ren et al., 2020) that considers syntactic and semantic matches based on the code structure in addition to the n-gram match.
我们涵盖了CodeXGLUE基准测试(Lu et al., 2021)中的大部分生成和理解任务,并使用它提供的公共数据集和相同的数据划分方式进行所有这些任务的实验。
我们首先考虑两个跨模式的生成任务。代码摘要的目标是将一个函数级别的代码片段总结为英文描述。该数据集包含了来自CodeSearchNet (Husain et al., 2019)的六种编程语言,包括Ruby,JavaScript,Go,Python,Java和PHP。我们使用平滑的BLEU-4(Lin and Och, 2004)来评估这个任务。代码生成的任务是根据自然语言描述生成一个代码片段。我们使用Concode数据集(Iyer et al., 2018),其中输入包含了自然语言文本和类环境上下文,输出是一个函数。我们使用BLEU-4,精确匹配(EM)准确率和CodeBLEU(Ren et al., 2020)来评估它,后者除了考虑n-gram匹配外,还考虑了基于代码结构的语法和语义匹配。
Besides, we consider two code-to-code generation tasks. Code translation aims to migrate legacy software from one PL to another, where we focus on translating functions from Java to CSharp and vice versa. Code refinement aims to convert a buggy function into a correct one. We employ two Java datasets provided by Tufano et al. (2019) with various function lengths: small (fewer than 50 tokens) and medium (50-100 tokens). We use BLEU-4 and exact match to evaluate them.
此外,我们还考虑了两个代码到代码的生成任务。代码转换的目的是将遗留的软件从一种编程语言迁移到另一种编程语言,我们主要关注从Java到CSharp和反之的函数转换。代码改进的目的是将一个有错误的函数转换为一个正确的函数。我们使用了Tufano et al. (2019)提供的两个Java数据集,它们包含了不同长度的函数:小(少于50个词)和中(50-100个词)。我们使用BLEU-4和精确匹配来评估它们。
We also investigate how CodeT5 performs on two understanding-based tasks. The first one is defect detection that aims to predict whether a code is vulnerable to software systems or not. We use the C dataset provided by Zhou et al. (2019) for experiment. The second task is clone detection which aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by Wang et al. (2020). We employ F1 score and accuracy for evaluating these two tasks respectively. In total, our CodeT5 supports six tasks and fourteen sub-tasks in CodeXGLUE with a unified encoder-decoder model.
我们还研究了CodeT5在两个基于理解的任务上的表现。第一个是缺陷检测,它的目的是预测一段代码是否对软件系统有漏洞。我们使用了Zhou et al. (2019)提供的C语言数据集进行实验。第二个是克隆检测,它的目的是衡量两个代码片段之间的相似度,并预测它们是否具有相同的功能。我们使用了Wang et al. (2020)提供的Java数据集进行实验。我们分别使用F1分数和准确率来评估这两个任务。总而言之,我们的CodeT5支持CodeXGLUE中的六个任务和十四个子任务,使用一个统一的编码器-解码器模型。
4.4 Comparison Models
We compare CodeT5 with state-of-the-art (SOTA) pre-trained models that can be categorized into three types: encoder-only, decoder-only, and encoder-decoder models. As encoder-only models, we consider RoBERTa (Liu et al., 2019b), RoBERTa (code) trained with masked language modeling (MLM) on code, CodeBERT (Feng et al., 2020) trained with both MLM and replaced token detection (Clark et al., 2020), GraphCodeBERT (Guo et al., 2021) using data flow from code, and DOBF (Rozière et al., 2021) trained with the identifier deobfuscation objective. Note that although DOBF employs a Seq2Seq model during pre-training, it only aims to train a better encoder for downstream tasks without exploring the potential benefit of the pre-trained decoder.
4.4 对比模型 我们将CodeT5与最先进的(SOTA)预训练模型进行比较,这些模型可以分为三种类型:仅编码器,仅解码器和编码器-解码器模型。作为仅编码器的模型,我们考虑了RoBERTa(Liu et al., 2019b),RoBERTa(code)在代码上使用掩码语言建模(MLM)进行训练,CodeBERT(Feng et al., 2020)同时使用MLM和替换令牌检测(Clark et al., 2020)进行训练,GraphCodeBERT(Guo et al., 2021)使用代码的数据流进行训练,以及DOBF(Rozière et al., 2021)使用标识符去混淆目标进行训练。需要注意的是,尽管DOBF在预训练过程中使用了一个Seq2Seq模型,但它只是为了训练一个更好的编码器来应用于下游任务,而没有探索预训练解码器的潜在好处。
For decoder-only models, we compare GPT-2 (Radford et al., 2019) and its adaptations on code domain including CodeGPT-2, and CodeGPT adapted. The difference is that the latter one utilizes a GPT-2 checkpoint for model initialization while the former one is trained from scratch. As encoder-decoder models, the current SOTA model for the CodeXGLUE benchmark is PLBART (Ahmad et al., 2021) based on BART (Lewis et al., 2020) architecture. For pre-training data, most of these models employ CodeSearchNet (Husain et al., 2019) except DOBF and PLBART. DOBF is pretrained on 7.9M Java and 3.6M Python files from BigQuery while PLBART employs a much larger data with 470M Python and 210M Java functions, and 47M NL posts from StackOverflow.
翻译: 对于仅解码器的模型,我们比较了GPT-2(Radford et al., 2019)及其在代码领域的改进版本,包括CodeGPT-2和CodeGPT-adapted。它们的区别在于,后者使用了一个GPT-2的检查点来初始化模型,而前者是从头开始训练的。作为编码器-解码器模型,目前在CodeXGLUE基准测试中表现最好的模型是基于BART(Lewis et al., 2020)架构的PLBART(Ahmad et al., 2021)。对于预训练数据,除了DOBF和PLBART之外,这些模型大多使用了CodeSearchNet(Husain et al., 2019)。DOBF是在BigQuery上的790万个Java文件和360万个Python文件上进行预训练的,而PLBART使用了一个更大的数据集,包括4.7亿个Python函数和2.1亿个Java函数,以及4700万个来自StackOverflow的自然语言帖子。
4.5 Model Configurations
We build CodeT5 based on Huggingface’s T5 (Raffel et al., 2020) PyTorch implementation3 and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M). We set the maximum source and target sequence lengths to be 512 and 256, respectively. We use the mixed precision of FP16 to accelerate the pre-training. We set the batch size to 1024 and employ the peak learning rate of 2e-4 with linear decay. We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5- base is 5 and 12 days, respectively.
4.5 模型配置 我们基于Huggingface的T5(Raffel et al., 2020)PyTorch实现3来构建CodeT5,并使用两种大小的CodeT5-small(60M)和CodeT5-base(220M)。我们分别将最大的源和目标序列长度设置为512和256。我们使用FP16的混合精度来加速预训练。我们将批量大小设置为1024,并使用2e-4的峰值学习率和线性衰减。我们在一个由16个NVIDIA A100 GPU组成的集群上,使用去噪目标对模型进行了100个周期的预训练,然后使用双模式双向训练进行了进一步的50个周期的训练。CodeT5-small和CodeT5-base的总训练时间分别为5天和12天。
In the fine-tuning phase, we find that the tasks in CodeXGLUE (Lu et al., 2021) are quite sensitive to some hyper parameters such as learning rate, training steps, and batch size. We conduct a grid search and select the best parameters based on the validation set. In multi-task learning, we cover all downstream tasks except clone detection.
翻译: 在微调阶段,我们发现CodeXGLUE(Lu et al., 2021)中的任务对一些超参数,如学习率、训练步数和批量大小,非常敏感。我们进行了一个网格搜索,并根据验证集选择了最佳的参数。在多任务学习中,我们涵盖了除了克隆检测之外的所有下游任务。
五、 Results and Analysis
In this section, we compare CodeT5 with SOTA models on a broad set of CodeXGLUE downstream tasks (§5.1), and investigate the effects of our bimodal dual generation and multi-task learning (§5.2), followed by a detailed analysis on the proposed identifier-aware pre-training (§5.3).
5 结果和分析 在本节中,我们将CodeT5与一系列CodeXGLUE下游任务上的最先进模型进行比较(第5.1节),并探讨我们的双模式双向生成和多任务学习的效果(第5.2节),然后对我们提出的标识符感知预训练进行详细的分析(第5.3节)。
5.1 CodeXGLUE Downstream Tasks
We evaluate two sizes of our model: CodeT5- small and CodeT5-base that are pre-trained with identifier-aware denoising. In addition, we consider the model that continues to train with bimodal dual generation (dual-gen) and show the results with multi-task fine-tuning. The results of all comparison models are obtained from their original papers and also the CodeXGLUE paper (Lu et al., 2021).
5.1 CodeXGLUE下游任务 我们评估了两种大小的模型:CodeT5-small和CodeT5-base,它们都是用标识符感知去噪进行预训练的。此外,我们还考虑了一个继续使用双模式双向生成(dual-gen)进行训练的模型,并展示了多任务微调的结果。所有比较模型的结果都是从它们的原始论文和CodeXGLUE论文(Lu et al., 2021)中获得的。
Code Summarization. We show code summarization results of smoothed BLEU-4 on six PL data in Table 2. We observe all our model variants significantly outperform prior work with either an encode-only (RoBERTa, CodeBERT, DOBF) or encoder-decoder framework (PLBART). Moreover, the salient performance gap between these two groups of models confirms that encode-only frameworks are suboptimal for generation tasks. Compared to the SOTA encoder-decoder model PLBART, we find that even our CodeT5-small yields better overall scores (also on Python and Java) given that our model is much smaller (60M vs. 140M) and PLBART is pre-trained with much larger Python and Java data (> 100 times). We attribute such improvement to our identifier-aware denoising pre-training and better employment of bimodal training data4 . By increasing the model size, our CodeT5-base boosts the overall performance by over 1.2 absolute points over PLBART.
代码摘要。我们在表2中展示了六种编程语言数据上的平滑BLEU-4的代码摘要结果。我们观察到我们的所有模型变体都显著地优于之前的工作,无论是使用仅编码(RoBERTa,CodeBERT,DOBF)还是编码-解码框架(PLBART)。此外,这两组模型之间明显的性能差距证实了仅编码框架对于生成任务是次优的。与最先进的编码-解码模型PLBART相比,我们发现即使是我们的CodeT5-small也能得到更好的整体分数(也包括Python和Java),考虑到我们的模型更小(60M vs. 140M),而PLBART是用更大的Python和Java数据(>100倍)进行预训练的。我们将这种改进归功于我们的标识符感知去噪预训练和更好地利用双模式训练数据4。通过增加模型大小,我们的CodeT5-base将整体性能提高了1.2个绝对点,超过了PLBART。
Code Generation. We compare CodeT5 with GPT-style models and PLBART in Table 3. Our CodeT5-small outperforms all decoder-only models and also the SOTA PLBART, which again confirms the superiority of encoder-decoder models at generating code snippets. Moreover, our CodeT5-base further significantly pushes the SOTA results across three metrics. Particularly, it achieves around 4.7 points improvement on CodeBLEU over PLBART, indicating our CodeT5 can better comprehend the code syntax and semantics with the help of identifier-aware pre-training.
代码生成。我们在表3中将CodeT5与GPT风格的模型和PLBART进行了比较。我们的 CodeT5-small 超越了所有的解码器-只有模型和也是SOTA的PLBART,这再次证实了编码器-解码器模型在生成代码片段方面的优越性。此外,我们的 CodeT5-base 进一步显著地推动了SOTA的结果在三个指标上。特别是,它在CodeBLEU上比PLBART提高了约4.7个点,表明我们的CodeT5可以更好地理解代码的语法和语义,得益于标识符感知的预训练。
Code-to-Code Generation Tasks. We compare two code-to-code generation tasks: code translation and code refinement in Table 4 and further consider one naive copy baseline by copying the source input as the target prediction. In the code translation task, our CodeT5-small outperforms most of baselines and obtains comparable results with PLBART, which shows the advantages of encoder-decoder models in the code-to-code generation setting. Our CodeT5-base further achieves consistent improvements over PLBART across various metrics for translating from Java to C# and vice versa.
代码到代码的生成任务。我们比较了两种代码到代码的生成任务:代码翻译和代码优化,并在表4中进一步考虑了一个简单的复制基线,即将源输入复制为目标预测。在代码翻译任务中,我们的CodeT5-small超过了大多数基线,并且与PLBART获得了相当的结果,这显示了编码器-解码器模型在代码到代码生成设置中的优势。我们的CodeT5-base进一步在各种指标上实现了对PLBART的一致改进,用于从Java到C#和反之亦然的翻译。
Here we show one CodeT5’s output of translating C# to Java in Figure 3. In this case, despite the poor BLEU score, CodeT5 is able to generate a function that reserves the same functionality and even has better readability compared to the ground-truth. This reveals that CodeT5 has a good generalization ability instead of memorizing and repeating what it has seen before. On the other hand, it also suggests that BLEU score is not a perfect evaluation metric for code generation tasks, where sometimes a higher score can instead reflect the problematic copy issues of neural models.
我们在图3中展示了一个CodeT5将C#翻译成Java的输出。在这种情况下,尽管BLEU分数很低,但CodeT5能够生成一个保留相同功能甚至比真实值更具可读性的函数。这表明CodeT5具有很好的泛化能力,而不是记住和重复它之前看到的内容。另一方面,这也表明BLEU分数不是代码生成任务的完美评估指标,因为有时更高的分数反而反映了神经模型的复制问题。
Another code-to-code generation task is code refinement, a challenging task that requires to detect which parts of code are buggy and fix them via generating a bug-free code sequence. Due to the large overlap of source and target code, even the naive copy approach yields very high BLEU scores but zero exact matches. Therefore, we focus on the exact match (EM) metric to evaluate on this task. We observe that EM scores for the small data are consistently higher than the medium one, indicating that it is harder to fix bugs for a longer code snippet. Our CodeT5-base significantly outperforms all baselines on EM and especially boosts over 4.8 points for the more challenging medium task (13.96 vs. GraphCodeBERT’s 9.10), reflecting its strong code understanding capability.
另一种代码到代码的生成任务是代码优化,这是一项具有挑战性的任务,需要检测哪些部分的代码是有缺陷的,并通过生成无缺陷的代码序列来修复它们。由于源代码和目标代码之间有很大的重叠,即使是简单的复制方法也能得到很高的BLEU分数,但是没有完全匹配。因此,我们关注这个任务的完全匹配(EM)指标来评估。我们观察到小数据集上的EM分数一直高于中等数据集,表明修复较长的代码片段中的缺陷更困难。我们的CodeT5-base在EM上显著优于所有基线,特别是在更具挑战性的中等任务上提高了4.8个点(13.96 vs. GraphCodeBERT的9.10),反映了它强大的代码理解能力。
Understanding Tasks. We compare with two understanding tasks of defect detection and clone detection in Table 5. Specifically, we generate the binary labels as a unigram sequence from the decoder for the defect detection task, while for the clone detection task, we first obtain the sequence embedding of each code snippet using the last decoder state follwing Lewis et al. (2020).
5.2 Effects of Bimodal Dual Generation and Multi-task Learning
5.3Effects of Identifier-aware Pre-training
双模态双向生成和多任务学习的影响
我们研究了我们提出的双模态双向生成(dual-gen)和多任务学习(multi-task)对CodeT5的影响
我们发现,对于跨模态的代码摘要和代码生成任务,dual-gen能够提高CodeT5的性能,说明它能够改善自然语言和编程语言之间的对齐
然而,对于代码到代码的任务,dual-gen并没有带来显著的提升,可能是因为这些任务不需要强烈的跨模态对齐
对于多任务学习,我们发现它能够在大多数任务上提高或保持CodeT5的性能,尤其是在数据较少的情况下,说明它能够增强CodeT5的泛化能力
然而,在一些数据较多或较难的任务上,多任务学习可能会导致一些性能损失,可能是因为不同任务之间存在一些干扰或竞争
标识符感知预训练的分析
我们分析了我们提出的标识符感知预训练(identifier-aware pre-training)对CodeT5的影响
我们进行了消融实验,分别去掉了遮盖跨度预测(MSP)、标识符标注(IT)和遮盖标识符预测(MIP)这三个预训练任务,并在四个选定的下游任务上进行评估
我们发现,去掉任何一个预训练任务都会导致CodeT5在下游任务上的性能下降,说明这三个任务都对CodeT5有正向的贡献
其中,MSP是最重要的预训练任务,它能够让CodeT5学习到通用的语言表示和生成能力
IT和MIP是我们提出的针对代码特征的预训练任务,它们能够让CodeT5更好地利用代码中的标识符信息,并提高代码理解和生成的质量
我们提出了CodeT5,一个基于T5架构的统一的预训练编码器-解码器模型,能够更好地利用代码中开发者指定的标识符传达的代码语义
我们的模型采用了一个统一的框架,无缝地支持代码理解和生成任务,并允许多任务学习
我们提出了一个新颖的标识符感知预训练目标,使模型能够区分哪些代码令牌是标识符,并在它们被遮盖时恢复它们
此外,我们还提出了利用用户编写的代码注释进行双模态双向生成的方法,以实现更好的自然语言和编程语言之间的对齐
综合实验表明,CodeT5在代码理解任务(如代码缺陷检测和克隆检测)和代码生成任务(包括编程语言-自然语言、自然语言-编程语言和编程语言-编程语言)上显著优于先前的方法
进一步的分析揭示了我们的模型能够更好地捕捉代码中的语义信息
CodeT5是一个能够理解和生成代码的预训练模型
CodeT5利用代码中的标识符信息和代码注释信息进行预训练
CodeT5在多个代码相关任务上超越了先前的模型
我们提供了一项消融研究,以检查每个成分在识别目标中的贡献。具体而言,我们通过消除三个目标中的每一个来比较CodeT5 small在四个选定任务上的性能:掩蔽跨度预测(MSP)、标识符标记(IT)和掩蔽标识符预测(MIP)。如表6所示,我们观察到,通常删除其中一个目标会降低所有任务的性能,这表明所有目标都有助于更好地理解CodeT5的代码。然而,每个目标的效果因任务而异。
具体而言,删除MSP将在很大程度上降低所有生成任务的性能,但反而会提高缺陷检测性能。这表明掩蔽跨度预测对于捕获生成任务的句法信息更为关键。相反,删除MIP对缺陷检测任务的伤害最大,这表明它可能更关注代码语义理解