微调BART进行中英文翻译任务(Pytorch代码)

微调BART进行中英文翻译任务(Pytorch代码)

BART的全称是Bidirectional and Auto-Regressive Transformers ,顾名思义,就是兼具上下文语境信息和自回归特性的Transformer,由FaceBook于2019年提出,其结构如下图所示。

在这里插入图片描述

BART继承了Transformer全部的编码器解码器架构,能够胜任各项任务,翻译就是其中之一。本文聚焦于在英中文翻译任务中微调BART的代码实现。

认识数据

本文的数据集采用WMT新闻翻译任务News Commentary语料中的中英文翻译数据集。

# 加载数据
import pandas as pd

# 加载dataframe
total_data_df = pd.read_csv('data/news-commentary-v15.en-zh.tsv/news-commentary-v15.en-zh.tsv', sep='\t', header=None)
new_columns = ['src', 'trg']
total_data_df = total_data_df.rename(columns=dict(zip(total_data_df.columns, new_columns)))
# 去除无效行
total_data_df = total_data_df.dropna()
total_data_df.shape
# (312268, 2)

对数据进行分析,英文长度在0400之间,中文长度在0150之间,且存在不少纯数字样本,如(‘-2.6’, ‘-2.6’)。

在这里插入图片描述
在这里插入图片描述

由于数据太多,本文更注重代码流程,所以对数据进行降采样。

# 采样策略:先打乱,然后随机选择7000条作为训练集、1000条测试集
total_data_df = total_data_df.sample(frac=1, random_state=42)
test_df = total_data_df.head(1000)
train_df = total_data_df.iloc[1000:8000]
test_df.shape, train_df.shape
# ((1000, 2), (7000, 2))

预训练模型

加载模型

本文的预训练语言模型选择 facebook/mbart-large-50-many-to-many-mmt ,mbart-large-50可以完成50种语言之间的翻译,包括中英文。

%%time
# 额外依赖
# pip install sentencepiece
# pip install protobuf
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

# 本地位置
PRETRAIN_MODEL_PATH = '/home/yzw/plm/facebook-mbart-large-50-many-to-many-mmt'

# 模型加载时间较长,22s
model = MBartForConditionalGeneration.from_pretrained(PRETRAIN_MODEL_PATH)
tokenizer = MBart50TokenizerFast.from_pretrained(PRETRAIN_MODEL_PATH, src_lang="en_XX", tgt_lang="zh_CN")

# 翻译测试
article_en = ["Calm down. You can do it well.", "Through life is hard, we should try to be happy every day."]
encoded_en = tokenizer(article_en, padding=True, max_length=50, truncation=True, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_en,
    forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# 翻译地还不错
# ['冷静下来,你做得很好。', '人生是艰难的,我们应该每天都努力快乐。']

评测初始效果

from tqdm import tqdm
import torch
import evaluate

# 加载评测器
# HuggingFace的evaluate库,部分评测器是懒加载。
# 由于服务器无法访问外面,可前往https://github.com/huggingface/evaluate 下载后存在于本地
metric = evaluate.load("./metrics/sacrebleu")
%%time

# 收集原始句子
sent_inputs = []
real_labels = []
for index, row in tqdm(test_df.iterrows(), total=test_df.shape[0]):
    src, trg = row['src'], row['trg']
    sent_inputs.append(src)
    real_labels.append(trg)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# 分批预测
model_ouputs = []
batch_size = 16
for i in  tqdm(range(0, len(sent_inputs), batch_size)):
    batch_inputs = sent_inputs[i: i+batch_size]
    model_inputs = tokenizer(batch_inputs, padding=True, max_length=400, truncation=True, return_tensors='pt')
    model_inputs = model_inputs.to(device)
    generated_tokens = model.generate(
        **model_inputs,
        forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]
        )
    
    decode_outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    model_ouputs.extend(decode_outputs)

metric.compute(predictions=model_ouputs, references=real_labels)
"""
{'score': 0.43534510117021935,
 'counts': [278, 110, 21, 12],
 'totals': [13575, 12575, 11667, 10772],
 'precisions': [2.0478821362799264,
  0.8747514910536779,
  0.17999485728979173,
  0.11139992573338285],
 'bp': 1.0,
 'sys_len': 13575,
 'ref_len': 1844}
 """

训练

对数据进行分词,转化为模型可以接受的输入。

from datasets import Dataset
# Huggingface的datasets包
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
# train_dataset
"""
Dataset({
    features: ['src', 'trg', '__index_level_0__'],
    num_rows: 7000
})
"""

# 自定义预处理函数,对每个examples进行修改
def preprocess_function(examples):
    inputs = [src for src in examples['src']]
    targets = [trg for trg in examples['trg']]
    model_inputs = tokenizer(inputs, text_target=targets)
    return model_inputs

tokenized_train_ds = train_dataset.map(preprocess_function, batched=True, remove_columns=['src', 'trg', '__index_level_0__'])
tokenized_test_ds = test_dataset.map(preprocess_function, batched=True, remove_columns=['src', 'trg', '__index_level_0__'])

# tokenized_train_ds
"""
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 7000
})
"""

# 动态padding,将文本padding到同一批次句子的最大长度
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=PRETRAIN_MODEL_PATH)

设定评估函数,每个epoch训练完成后对模型效果进行测评。

import numpy as np
import evaluate
# metric = evaluate.load("./metrics/sacrebleu")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

借助 Seq2SeqTrainingArguments, Seq2SeqTrainer,完成对模型的训练。

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

# model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_english2chinese_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3, # 最多保存3个模型
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

默认调用机器上全部的显卡进行训练,虽然batch_size已经调到4,但依然爆显存了,这可是8张3090啊,😦

在这里插入图片描述

参考

  1. Transformer官方翻译教程
  2. BERT实战——(5)生成任务-机器翻译
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值