预训练语言模型复现CPT-1&Restructure_pretrain

(1)CPT -pretrain

CPT参数初始化,不是random initilized
是inference Robert的参数。
roberta_zh/: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint.

CPT的pretrain是基于megetron_LM (github中有官方介绍),代码中的pretrain文件下的部分。
这是实现model分块在不同显卡上完成训练。model parallel 可以是model的不同layer分在不同的GPU上,也可以是model的的Tensor calculation 分在不同的GPU上。

megatron-lm是一个包,PLM训练的工具包。

CPT的整个过程是reference了Megatron-PLM的过程(data preprocess,pretrain,finetune,downstream task evaluation
在CPT的introducation中,介绍了the process is refered megatron plm.
the whole pretrain process of CPT model: https://github.com/fastnlp/CPT/blob/master/pretrain/README.md

(2) restructure _pretrain_plm_process

data format: json (not text) _数据工具:Datalab.——加载数据
https://github.com/ExpressAI/DataLab/tree/main/datasets

{“text”: “Parker was selected to represent Queensland as an interchange for games II and III of the 2004 State of Origin series and game III of the 2005 State of Origin series.”, “entities”: [“Queensland”, “2004 State of Origin”, “2005 State of Origin”]}
{“text”: “The image is only a small portion of the commercial product.”, “entities”: []}
{“text”: “Rock of Ages is a stylised English version of the thirteenth century Hebrew Hanukkah hymn Ma’oz Tzur.”, “entities”: [“Hanukkah”, “Ma’oz Tzur”]}
{“text”: “Ok, nevermind, I’m apparently clueless as to what’s been going on with these boxes. If Cide could look into it, I know he has a lot more experience with this than I do. z4ns4tsu\talk 19:22, 12 September 2006 (UTC)”, “entities”: []}
{“text”: “I hope the line height will be reduced soon, if not to the density that I prefer.”, “entities”: []}

公开的模型,适用于处理的任务类型存在一些差异,根据能够处理的任务类型公开的模型结构。
the model details

模型参数量大致在11billon级别(可对比已有PLM list找到位置,这个paramter还可。https://openbmb.github.io/BMList/


ques1: what the data format used for plm?
ques2: the template used for every task is same?

answer:

signal structure:
  1. general singal:

source: {corrupted text}
target: {corrupted position1}{target span1}{corrupted position2}{target span2}…

“Thank you <X> me to your party <Y> week.” and the prompted
target would be “<X> for inviting <Y> last <Z>
  1. task related singal:
    • multiple-choice format
    • generation format

a multiple-choice format prompt for the sentiment
classification task could be the following: I like this movie. Is this text ‘‘positive" or
‘‘negative"? while a generation format prompt could be the following: I like this movie. What’s
the sentiment of the previous text?. We use two special markers: “TEXT:” and “QUERY:” to
separate the general context and the intended task to be completed.

each type of signal, we construct multiple prompts so that the model can learn various query forms. We design a total of 1124 prompts for the 30 signals
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

YingJingh

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值