论文笔记 | code pretraining（代码预训练系列）

最新推荐文章于 2025-03-12 16:49:28 发布

ttliu_kiwi

最新推荐文章于 2025-03-12 16:49:28 发布

阅读量5.3k

点赞数 5

分类专栏：论文笔记学习总结深度学习文章标签：深度学习

本文链接：https://blog.csdn.net/ting0922/article/details/113895780

版权

文章目录

Pre-trained contextual embedding of source code

ICLR 2020 reject，后改为Learning and Evaluating Contextual Embedding of Source Code，发表在ICML 2020
Google Brain
CuBERT：开源代码和数据链接
和Bert无区别，只是替换了语料和微调的任务。

本文做的是代码预训练，提出了CuBERT(Code Understanding BERT)。在6.6M python files的语料上做训练。数据是public Github repository hosted on Google’s BigQuery platform。

预训练任务

Masked language modeling (MLM)
预测mask的token是什么
Next Sentence Prediction（NSP）
预测两个代码逻辑行是否为上下句的关系

微调任务
5个分类任务和一个其他任务

Variable Misuse Classification
Wrong Binary Operator
Swapped Operand
Function-Docstring Mismatch
Exception type
Variable Misuse localization and repair
微调的数据集为：

CodeBERT: A Pre-trained model for programming and natural languages

EMNLP findings, 2020
Zhangyin Feng, Daya Guo, Duyu Tang …
哈工大，中山大学，微软亚洲研究院
开源代码和数据
可用：预训练模型；在每种语言上进行微调比在6种语言上一起微调效果更优；

本文做的是代码预训练，提出了CodeBERT（a first large bimodal pre-trained model for natural language and programming language）。

模型架构
Follow Bert，RoBERTa，使用多层双向Transformer作为CodeBERT的模型架构，和RoBERTa-Base中的架构一样，其中的模型参数有125 Million（1.25亿参数）

input: $w_1, w_2, \cdots, w_n, [SEP], c_1, c_2, \cdots, c_m, [EOS]$
其中的 $w_i$ 表示句子的token， $c_i$ 表示代码的token
output: 每一个token基于上下文的表示；[CLS]的表示，代表聚合的序列表示

预训练数据
在这里插入图片描述

数据集来自CodeSearchNet，其中的2.4M bimodal data的形式为<individual funciton, documentation>，其中的6.4M unimodal codes表示单独的代码数据。

预训练的任务

Masked Language Modeling (MLM)
输入是NL-PL pair，在其中随机选择一些位置mask掉。M

最低0.47元/天解锁文章