CSANMT模型微调

ZZZZyh00000

已于 2023-12-19 09:26:17 修改

阅读量420

点赞数 6

分类专栏： NLP 文章标签： python

于 2023-12-01 15:18:19 首次发布

本文链接：https://blog.csdn.net/ZZZZyh00000/article/details/134735683

版权

NLP 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

CSANMT模型微调

1. 分词

有两种方式，一种是tokenization，一种是bpe

1.1 Tokenization

对于en文件，英文通过mosesdecoder进行Tokenization，使用如下命令对文件进行处理，会在本地生成train.en.tok文件：

perl tokenizer.perl -l en < train.en > train.en.tok

中文通过jieba进行中文分词，会在本地生成一个train.zh.tok文件：

import jieba

fR = open('train.zh', 'r', encoding='UTF-8')
fW = open('train.zh.tok', 'w', encoding='UTF-8')

for sent in fR: 
    sent = fR.read()
    sent_list = jieba.cut(sent)
    fW.write(' '.join(sent_list))

fR.close()
fW.close()

说明：

英文token化需要使用官方使用的工具，下载地址是：https://github.com/moses-smt/mosesdecoder

命令是：

~~~
perl tokenizer.perl -l en < train.en > train.en.tok
~~~

调用的脚本位置在/Yourpath/mosesdecoder-master/scripts/tokenizer/tokenizer.perl

但是脚本会使用相对路径调用/Yourpath/mosesdecoder-master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en

在进行英文tokenization的时候需要注意相对位置

1.2 Byte-Pair-Encoding（BPE）

如果使用bpe，则使用下面的方式

subword-nmt apply-bpe -c bpe.en < train.en.tok > train.en.tok.bpe

subword-nmt apply-bpe -c bpe.zh < train.zh.tok > train.zh.tok.bpe

说明：

bpe中，zh和en都使用的同一个工具，下载地址是： https://github.com/rsennrich/subword-nmt

调用命令是

~~~
subword-nmt apply-bpe -c bpe.en < train.en.tok > train.en.tok.bpe

subword-nmt apply-bpe -c bpe.zh < train.zh.tok > train.zh.tok.bpe
~~~

2. 修改配置文件

根据微调的需要，对模型Configuration.json文件进行修改（主要替换其中的训练集和词表字典）：

"train": {
        "num_gpus": 0,                                           # 指定GPU数量，0表示CPU运行
        "warmup_steps": 4000,                                    # 冷启动所需要的迭代步数，默认为4000
        "update_cycle": 1,                                       # 累积update_cycle个step的梯度进行一次参数更新，默认为1
        "keep_checkpoint_max": 1,                                # 训练过程中保留的checkpoint数量
        "confidence": 0.9,                                       # label smoothing权重为 1 - confidence
        "optimizer": "adam",
        "adam_beta1": 0.9,
        "adam_beta2": 0.98,
        "adam_epsilon": 1e-9,
        "gradient_clip_norm": 0.0,
        "learning_rate_decay": "linear_warmup_rsqrt_decay",      # 学习衰减策略，可选模式包括[none, linear_warmup_rsqrt_decay, piecewise_constant]
        "initializer": "uniform_unit_scaling",                   # 参数初始化策略，可选模式包括[uniform, normal, normal_unit_scaling, uniform_unit_scaling]
        "initializer_scale": 0.1,
        "learning_rate": 1.0,                                    # 学习率的缩放系数，即根据step值确定学习率以后，再根据模型的大小对学习率进行缩放
        "train_batch_size_words": 1024,                          # 单训练batch所包含的token数量
        "scale_l1": 0.0,
        "scale_l2": 0.0,
        "train_max_len": 100,                                    # 默认情况下，限制训练数据的长度为100，用户可自行调整
        "num_of_epochs": 2,                                      # 最大迭代轮数
        "save_checkpoints_steps": 1000,                          # 间隔多少steps保存一次模型
        "num_of_samples": 4,                                     # 连续语义采样的样本数量
        "eta": 0.6
    },
"dataset": {
        "train_src": "train.en",                                 # 指定源语言数据文件
        "train_trg": "train.zh",                                 # 指定目标语言数据文件
        "src_vocab": {
            "file": "src_vocab.txt"                              # 指定源语言词典
        },
        "trg_vocab": {
            "file": "trg_vocab.txt"                              # 指定目标语言词典
        }
    }

3. 训练

执行以下脚本即可

from modelscope.trainers.nlp import CsanmtTranslationTrainer
# model指向需要训练的模型路径
trainer = CsanmtTranslationTrainer(model="./model/damo/nlp_csanmt_translation_en2zh")
trainer.train()

4. 使用微调模型翻译

训练过程中会保存权重文件，主要有4个，和原模型中的tf_ckpts里的对应，但是由于我们启动方式仍然用的Modelscope封装好的函数，因此需要替换模型文件的同时重命名文件，使得名字和原模型文件名字一致，否则启动会报错，可以cp一份模型(代码中叫myCSANMT)，然后替换权重文件即可

调用翻译代码：

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

input_sequence = 'The failure of the therapeutic measures, such as chemotherapy, targeted therapy and immunotherapy, has been linked to the specific immune microenvironment of pancreatic cancer.'

pipeline_ins = pipeline(task=Tasks.translation, model="/root/autodl-tmp/trans/model/damo/myCSANMT")
outputs = pipeline_ins(input=input_sequence)

print(outputs['translation'])

5. 模型转换

也可将微调好的模型进行转换，和之前的转换代码一致，把路径替换为本地路径即可：

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

input_sequence = 'The failure of the therapeutic measures, such as chemotherapy, targeted therapy and immunotherapy, has been linked to the specific immune microenvironment of pancreatic cancer.'

pipeline_ins = pipeline(task=Tasks.translation, model="/root/autodl-tmp/trans/model/damo/myCSANMT")
outputs = pipeline_ins(input=input_sequence)

print(outputs['translation'])