CPT code: https://github.com/fastnlp/CPT
CPT paper: https://arxiv.org/pdf/2109.05729.pdf
数据预处理
-
https://zhuanlan.zhihu.com/p/388830967 megatron-lm中的preprocess_data.py的详解, json格式中, 最重要的key, 即text有值即可
-
用下面的命令, 准备训练数据集
jsonfile="/Users/phoenixbai/workspace/github/CPT/tmp/eight.files3.json" vocabfile="/Users/phoenixbai/workspace/github/CPT/finetune/generation/output/adgen/2/vocab.txt" prefix="test" python ../pretrain/tools/preprocess_data.py \ --input $jsonfile \ --output-prefix $prefix \ --vocab $vocabfile \ --dataset-impl mmap \ --tokenizer-type BertWordPieceCase
环境配置
-
需要一台带有gpu卡的机器, gpu驱动升级的方法, 在另一篇文章中再写.
-
如何从已trained好的cpt-base接着做pretrain, 需要稍改下代码 : https://github.com/fastnlp/CPT/issues/30,
# model_path = 'roberta_zh' model_name = "fnlp/cpt-base" # self.language_model = HFBartModel(config, encoder_config) #self.language_model = HFBartModel(config) #encoder_state = torch.load(model_path + '/pytorch_model.bin', map_location='cpu') #self.language_model.model.encoder.load_state_dict(encoder_state) self.language_model = HFBartModel.fro