微调T5构建英文摘要模型

沈沈白

已于 2023-08-10 18:02:01 修改

阅读量887

点赞数 2

分类专栏： NLP 文章标签： nlp 深度学习人工智能自然语言处理语言模型 pytorch transformer

于 2023-08-09 21:00:00 首次发布

本文链接：https://blog.csdn.net/yulixixi/article/details/132172330

版权

NLP 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

基本介绍

T5全称是Text-to-Text Transfer Transformer，是一种基于Transformer的预训练模型，由Google Brain团队于2019年提出。T5在自然语言处理（NLP）任务中表现出色，并在多个公开数据集上取得了领先的性能。相比于传统的预训练模型如BERT、GPT等，T5拥有更为灵活的框架。T5的核心思想是"Text-To-Text Transfer"，即将所有NLP任务都转化为文本到文本（Text-To-Text）问题，然后使用统一的方式进行处理和训练。

T5的前缀

T5模型的前缀可以根据任务需求进行自定义。以下是一些常见任务的前缀示例：

1.摘要任务前缀：

"summarize: "：用于指定需要进行摘要的文本

2.翻译任务前缀：

"translate English to French: "：将英文翻译为法文

"translate English to Spanish: "：将英文翻译为西班牙文

对于代码部分，可以使用Transformers库中的T5Tokenizer来添加前缀，下面代码仅以摘要为例，其他功能同理。

模型推理代码

这里使用hugging face的t5-small预训练模型，下载地址 t5-small · Hugging Face

cache_dir：指定了模型文件和tokenizer文件的缓存目录，如果已经下载过，它会直接从这个目录中读取这些文件，避免了重复下载。受限于国内环境，还是建议提前下载好模型文件。

T5Tokenizer.from_pretrained: 从预训练模型中加载所需的tokenizer，并返回一个tokenizer对象，使用方法与其他Hugging Face的tokenizer相同，这里使用了T5-small模型的tokenizer

T5ForConditionalGeneration.from_pretrained: 加载预先训练好的T5模型，并返回一个T5模型对象，可以用来进行推理和生成。

from transformers import T5Tokenizer, T5ForConditionalGeneration, pipeline

# 加载T5模型和Tokenizer
cache_dir = 'model/t5_small'
tokenizer = T5Tokenizer.from_pretrained(cache_dir)
model = T5ForConditionalGeneration.from_pretrained(cache_dir)
print(model.config)  #打印相关信息
print(model)  #打印模型


text = '''The Eiffel Tower is one of the most famous landmarks in Paris, France, and one of the most famous buildings in the world. It was designed and built by the famous French engineer Gustave Eiffel. The Paris Tower was built for the 1889 Paris World Exposition to celebrate the 100th anniversary of the French Revolution.
The Paris Tower is located on Champs de Mars in the seventh arrondissement of Paris, France. It is a steel structure tower that reaches a height of 324 meters (approximately 1063 feet). At its completion, it was the world's tallest man-made building until it was surpassed by the Chrysler Building in New York in 1930.
The iron tower is divided into three observation platforms: the first layer, the second layer, and the third layer. You can take the elevator or climb the stairs to different observation platforms and enjoy the magnificent city scenery of Paris. The top level observation deck is the most popular, where tourists can overlook the entire city of Paris, including museums, churches, and the famous Seine River around the Eiffel Tower'''

input_ids = tokenizer("summarization:"+text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, num_beams=4, early_stopping=True)
summary_text=tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary_text)
#  output:The Eiffel Tower was built for the 1889 Paris World Exposition to celebrate the 100th anniversary of the French Revolution. It is a steel structure tower that reaches a height of 324 meters (approximately 1063 feet) at its completion, it was the world's tallest man-made building until it was surpassed by the Chrysler Building in New York in 1930.


#也可以使用pipline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")
result = summarizer(
    text,
    min_length=5,
    max_length=128,
)
print(result)

模型微调代码

下面使用xsum数据集对hugging face的T5模型进行文本摘要任务的微调，该数据集可以通过datasets库进行下载。

# 下载xsum数据集
from datasets import load_dataset
raw_datasets = load_dataset("xsum")

训练结果使用Rouge作为摘要评价指标，衡量生成的摘要与参考摘要之间的相似度。

可以使用下面代码下载评估函数：

from datasets import load_metric
metric = load_metric('rouge')

我因为之前下载过了，所以在代码中加了缓存路径。

模型微调代码主体如下，注意代码中的超参数并没有做精确调优。

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer
import torch
import numpy as np
from datasets import load_metric

cache_dir = 'model/t5_small'
tokenizer = T5Tokenizer.from_pretrained(cache_dir)
model = T5ForConditionalGeneration.from_pretrained(cache_dir)
print(model.config)  #打印相关信息
print(model)  #打印模型

device = torch.device('cuda')
model.to(device)

# 设置参数
MAX_INPUT_LENGTH = 1024  # 模型输入的最大长度
MIN_TARGET_LENGTH = 5  # 模型输出的最小长度
MAX_TARGET_LENGTH = 64  # 模型输出的最大长度


# 微调
batch_size = 16
args = Seq2SeqTrainingArguments(  # 这里默认使用AdamW优化器
    output_dir="model/t5_small/test-summarization",
    evaluation_strategy="epoch",  # 每个epoch会做一次验证评估
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=10,
    predict_with_generate=True,  # 设置为True才能计算生成指标
    fp16=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# 评估函数
metric = load_metric('rouge', cache_dir='model/metrics/rouge1')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge需要每句话都换行
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}


# 训练

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics  # 传递函数
)
trainer.train()

# 保存模型
save_dir = "model/t5/trained_t5"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

示例代码没有体现数据处理过程， train_dataset和val_dataset根据自己需要进行预处理后，输入模型即可，不要忘记加后缀。