预训练语言模型复现CPT-1&Restructure_pretrain

YingJingh

已于 2022-11-05 20:12:58 修改

阅读量575

点赞数

分类专栏：论文复现记录文章标签：语言模型人工智能

于 2022-11-05 19:52:41 首次发布

本文链接：https://blog.csdn.net/Hekena/article/details/127706497

版权

论文复现记录专栏收录该内容

35 篇文章 2 订阅

订阅专栏

(1）CPT -pretrain

CPT参数初始化，不是random initilized
是inference Robert的参数。
roberta_zh/: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint.

CPT的pretrain是基于megetron_LM （github中有官方介绍），代码中的pretrain文件下的部分。
这是实现model分块在不同显卡上完成训练。model parallel 可以是model的不同layer分在不同的GPU上，也可以是model的的Tensor calculation 分在不同的GPU上。

megatron-lm是一个包，PLM训练的工具包。

CPT的整个过程是reference了Megatron-PLM的过程（data preprocess，pretrain，finetune,downstream task evaluation）
在CPT的introducation中，介绍了the process is refered megatron plm.
the whole pretrain process of CPT model: https://github.com/fastnlp/CPT/blob/master/pretrain/README.md

(2) restructure _pretrain_plm_process

data format: json (not text) _数据工具：Datalab.——加载数据
https://github.com/ExpressAI/DataLab/tree/main/datasets

{“text”: “Parker was selected to represent Queensland as an interchange for games II and III of the 2004 State of Origin series and game III of the 2005 State of Origin series.”, “entities”: [“Queensland”, “2004 State of Origin”, “2005 State of Origin”]}
{“text”: “The image is only a small portion of the commercial product.”, “entities”: []}
{“text”: “Rock of Ages is a stylised English version of the thirteenth century Hebrew Hanukkah hymn Ma’oz Tzur.”, “entities”: [“Hanukkah”, “Ma’oz Tzur”]}
{“text”: “Ok, nevermind, I’m apparently clueless as to what’s been going on with these boxes. If Cide could look into it, I know he has a lot more experience with this than I do. z4ns4tsu\talk 19:22, 12 September 2006 (UTC)”, “entities”: []}
{“text”: “I hope the line height will be reduced soon, if not to the density that I prefer.”, “entities”: []}

公开的模型，适用于处理的任务类型存在一些差异，根据能够处理的任务类型公开的模型结构。
the model details

模型参数量大致在11billon级别（可对比已有PLM list找到位置，这个paramter还可。https://openbmb.github.io/BMList/）

ques1: what the data format used for plm?
ques2: the template used for every task is same?

answer:

signal structure:

general singal:

source: {corrupted text}
target: {corrupted position1}{target span1}{corrupted position2}{target span2}…

“Thank you <X> me to your party <Y> week.” and the prompted
target would be “<X> for inviting <Y> last <Z>”

task related singal:
• multiple-choice format
• generation format

a multiple-choice format prompt for the sentiment
classification task could be the following: I like this movie. Is this text ‘‘positive" or
‘‘negative"? while a generation format prompt could be the following: I like this movie. What’s
the sentiment of the previous text?. We use two special markers: “TEXT:” and “QUERY:” to
separate the general context and the intended task to be completed.

each type of signal, we construct multiple prompts so that the model can learn various query forms. We design a total of 1124 prompts for the 30 signals