深度学习模型CPT的环境配置经验

phoenix-bai

已于 2022-04-06 15:59:50 修改

阅读量2.1k

点赞数

分类专栏： NLP 文章标签：机器学习算法 linux nlp deep learning

于 2022-04-02 13:56:54 首次发布

本文链接：https://blog.csdn.net/weixin_40103562/article/details/123917074

版权

本文详细介绍了CPT深度学习模型的环境配置，包括数据预处理、GPU环境需求、库安装及依赖问题解决。在训练过程中，提到了可能出现的错误及解决方案，如CUDA OOM和session断开问题。最后，文章展示了模型训练的效果和验证方法。

摘要由CSDN通过智能技术生成

CPT code: https://github.com/fastnlp/CPT
CPT paper: https://arxiv.org/pdf/2109.05729.pdf

数据预处理

https://zhuanlan.zhihu.com/p/388830967 megatron-lm中的preprocess_data.py的详解, json格式中, 最重要的key, 即text有值即可

用下面的命令, 准备训练数据集

jsonfile="/Users/phoenixbai/workspace/github/CPT/tmp/eight.files3.json"
vocabfile="/Users/phoenixbai/workspace/github/CPT/finetune/generation/output/adgen/2/vocab.txt"
prefix="test"

python ../pretrain/tools/preprocess_data.py \
               --input $jsonfile \
               --output-prefix $prefix \
               --vocab $vocabfile \
               --dataset-impl mmap \
               --tokenizer-type BertWordPieceCase

环境配置

需要一台带有gpu卡的机器, gpu驱动升级的方法, 在另一篇文章中再写.

如何从已trained好的cpt-base接着做pretrain, 需要稍改下代码 : https://github.com/fastnlp/CPT/issues/30,

# model_path = 'roberta_zh'
model_name = "fnlp/cpt-base"
# self.language_model = HFBartModel(config, encoder_config)
#self.language_model = HFBartModel(config)
#encoder_state = torch.load(model_path + '/pytorch_model.bin', map_location='cpu')
#self.language_model.model.encoder.load_state_dict(encoder_state)
self.language_model = HFBartModel.fro

最低0.47元/天解锁文章

phoenix-bai

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
深度学习模型CPT的环境配置经验

CPT code: https://github.com/fastnlp/CPTCPT paper: https://arxiv.org/pdf/2109.05729.pdf数据预处理https://zhuanlan.zhihu.com/p/388830967 megatron-lm中的preprocess_data.py的详解, json格式中, 最重要的key, 即text有值即可用下面的命令, 准备训练数据集jsonfile="/Users/phoenixbai/workspace/
复制链接

扫一扫

专栏目录