微调BART进行中英文翻译任务(Pytorch代码)
BART的全称是Bidirectional and Auto-Regressive Transformers ,顾名思义,就是兼具上下文语境信息和自回归特性的Transformer,由FaceBook于2019年提出,其结构如下图所示。
BART继承了Transformer全部的编码器解码器架构,能够胜任各项任务,翻译就是其中之一。本文聚焦于在英中文翻译任务中微调BART的代码实现。
认识数据
本文的数据集采用WMT新闻翻译任务News Commentary语料中的中英文翻译数据集。
# 加载数据
import pandas as pd
# 加载dataframe
total_data_df = pd.read_csv('data/news-commentary-v15.en-zh.tsv/news-commentary-v15.en-zh.tsv', sep='\t', header=None)
new_columns = ['src', 'trg']
total_data_df = total_data_df.rename(columns=dict(zip(total_data_df.columns, new_columns)))
# 去除无效行
total_data_df = total_data_df.dropna()
total_data_df.shape
# (312268, 2)
对数据进行分析,英文长度在0400之间,中文长度在0150之间,且存在不少纯数字样本,如(‘-2.6’, ‘-2.6’)。
由于数据太多,本文更注重代码流程,所以对数据进行降采样。
# 采样策略:先打乱,然后随机选择7000条作为训练集、1000条测试集
total_data_df = total_data_df.sample(frac=1, random_state=42)
test_df = total_data_df.head(1000)
train_df = total_data_df.iloc[1000:8000]
test_df.shape, train_df.shape
# ((1000, 2), (7000, 2))
预训练模型
加载模型
本文的预训练语言模型选择 facebook/mbart-large-50-many-to-many-mmt ,mbart-large-50可以完成50种语言之间的翻译,包括中英文。
%%time
# 额外依赖
# pip install sentencepiece
# pip install protobuf
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
# 本地位置
PRETRAIN_MODEL_PATH = '/home/yzw/plm/facebook-mbart-large-50-many-to-many-mmt'
# 模型加载时间较长,22s
model = MBartForConditionalGeneration.from_pretrained(PRETRAIN_MODEL_PATH)
tokenizer = MBart50TokenizerFast.from_pretrained(PRETRAIN_MODEL_PATH, src_lang="en_XX", tgt_lang="zh_CN")
# 翻译测试
article_en = ["Calm down. You can do it well.", "Through life is hard, we should try to be happy every day."]
encoded_en = tokenizer(article_en, padding=True, max_length=50, truncation=True, return_tensors="pt")
generated_tokens = model.generate(
**encoded_en,
forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# 翻译地还不错
# ['冷静下来,你做得很好。', '人生是艰难的,我们应该每天都努力快乐。']
评测初始效果
from tqdm import tqdm
import torch
import evaluate
# 加载评测器
# HuggingFace的evaluate库,部分评测器是懒加载。
# 由于服务器无法访问外面,可前往https://github.com/huggingface/evaluate 下载后存在于本地
metric = evaluate.load("./metrics/sacrebleu")
%%time
# 收集原始句子
sent_inputs = []
real_labels = []
for index, row in tqdm(test_df.iterrows(), total=test_df.shape[0]):
src, trg = row['src'], row['trg']
sent_inputs.append(src)
real_labels.append(trg)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# 分批预测
model_ouputs = []
batch_size = 16
for i in tqdm(range(0, len(sent_inputs), batch_size)):
batch_inputs = sent_inputs[i: i+batch_size]
model_inputs = tokenizer(batch_inputs, padding=True, max_length=400, truncation=True, return_tensors='pt')
model_inputs = model_inputs.to(device)
generated_tokens = model.generate(
**model_inputs,
forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]
)
decode_outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
model_ouputs.extend(decode_outputs)
metric.compute(predictions=model_ouputs, references=real_labels)
"""
{'score': 0.43534510117021935,
'counts': [278, 110, 21, 12],
'totals': [13575, 12575, 11667, 10772],
'precisions': [2.0478821362799264,
0.8747514910536779,
0.17999485728979173,
0.11139992573338285],
'bp': 1.0,
'sys_len': 13575,
'ref_len': 1844}
"""
训练
对数据进行分词,转化为模型可以接受的输入。
from datasets import Dataset
# Huggingface的datasets包
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
# train_dataset
"""
Dataset({
features: ['src', 'trg', '__index_level_0__'],
num_rows: 7000
})
"""
# 自定义预处理函数,对每个examples进行修改
def preprocess_function(examples):
inputs = [src for src in examples['src']]
targets = [trg for trg in examples['trg']]
model_inputs = tokenizer(inputs, text_target=targets)
return model_inputs
tokenized_train_ds = train_dataset.map(preprocess_function, batched=True, remove_columns=['src', 'trg', '__index_level_0__'])
tokenized_test_ds = test_dataset.map(preprocess_function, batched=True, remove_columns=['src', 'trg', '__index_level_0__'])
# tokenized_train_ds
"""
Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 7000
})
"""
# 动态padding,将文本padding到同一批次句子的最大长度
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=PRETRAIN_MODEL_PATH)
设定评估函数,每个epoch训练完成后对模型效果进行测评。
import numpy as np
import evaluate
# metric = evaluate.load("./metrics/sacrebleu")
def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
借助 Seq2SeqTrainingArguments, Seq2SeqTrainer,完成对模型的训练。
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
# model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
training_args = Seq2SeqTrainingArguments(
output_dir="my_awesome_english2chinese_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=16,
weight_decay=0.01,
save_total_limit=3, # 最多保存3个模型
num_train_epochs=2,
predict_with_generate=True,
fp16=True
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_train_ds,
eval_dataset=tokenized_test_ds,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
默认调用机器上全部的显卡进行训练,虽然batch_size已经调到4,但依然爆显存了,这可是8张3090啊,😦