采用Seq2SeqTrainer实现文本摘要

苏云南雁

已于 2024-03-29 10:59:29 修改

阅读量1.1k

点赞数 9

文章标签：机器学习人工智能

于 2024-03-19 11:29:48 首次发布

本文链接：https://blog.csdn.net/qq_22059611/article/details/136835877

版权

一、数据准备

这里采用了商城评论作为训练数据，采用csv构建训练数据集train.csv，示例如下：

id,data,label

1,很差的一款鞋，不要买！鞋底太硬，没有中底，所以没有缓震或支撑。,质量问题

同样方式构建test.csv数据集

二、导入模块

from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer
from datasets import load_metric

各模块说明：

transformers模块：HuggingFace开发的一个非常轻量级的 Transformers，里面有很丰富的函数库

datasets模块：datasets是Hugging Face公司开发的一个Python库，提供了访问和处理大量自然语言处理（NLP）数据集的工具。库中的每个数据集都被设计成一个高效、易用、可扩展的对象，可以方便地进行处理和操作。有很多数据集合操作工具。

AutoTokenizer：This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the [`AutoTokenizer.from_pretrained`] class method.通用的分词器，采用from_pretrained实例化

load_dataset：Load a dataset from the Hugging Face Hub, or a local dataset.加载官方或者本地的数据

AutoModelForSeq2SeqLM：This is a generic model class that will be instantiated as one of the model classes of the library (with a sequence-to-sequence language modeling head) when created with the from_pretrained() class method or the from_config() class method.通用的序列2序列语言模型，采用from_pretrained实例化。和AutoTokenizer类似

Seq2SeqTrainingArguments：设置训练参数的类

DataCollatorForSeq2Seq：Data collator that will dynamically pad the inputs received, as well as the labels.一个数据整理器，可以动态地填充（pad）收到的输入和标签。

Seq2SeqTrainer：训练器。

load_metric：加载度规，指加载评价指标

三、数据处理及模型训练

1、加载预训练模型及评价指标

通过load_metric加载rouge评价指标，偏向于召回率（预测真实数量/实际真实数量）；分词器打算用google/mt5-small，mt5是text-text预训练模型，支持多种语言，但由于网络连接问题，会发生报错，得用到hugging face镜像站 HF-Mirror - Huggingface 镜像站，处理方式详见这篇博文：OSError: We couldn‘t connect to ‘https://huggingface.co‘ to load this file-CSDN博客

rouge_score = load_metric("rouge")
# model_checkpoint = "google/mt5-small"
model_checkpoint = 'D:/download/model/mt5s'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

2、加载数据

通过load_dataset设置加载数据格式、本地路径来导入数据。

train_size、test_size这里设置的意义在于后续将训练集和测试集分别定位15、5，属于小数据量来跑，之后便是用train_test_split来区分train、test数据集。

这里设置的max_input_length、max_target_length分别表示最大输入长度、最大目标长度，限制最大输入长度为512，输出的结果最多30，因为是摘要任务，当然输出不能太大。

train_dataset = load_dataset("csv", data_files="train.csv")

#这里是设定训练集、测试集大小
train_size = 15
test_size = 5

train_val = train_dataset["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)

print(train_val)
max_input_length = 512
max_target_length = 30

这里train_val打印出来的结果，可以看出来train、test数据集的数据结构、数量：

DatasetDict({
train: Dataset({
features: ['id', 'data', 'label'],
num_rows: 15
})
test: Dataset({
features: ['id', 'data', 'label'],
num_rows: 5
})
})

3、文本预处理

tokenizer设置参数，max_length限定了输入长度，padding表示是否填充，truncation表示是否切分，超过了最大长度就要做截断，如果输入长度大于max_length就就行切分。

这里分成两部分处理，preprocess_function是官方提供的模板方法，首先是对“data”中的数据做分词，形成分词后的模型输入，放入"data"中，这一段就是处理原始文本，然后处理labels，获取标签，放入到"label"中，这一段就是处理摘要，把摘要放到label里。

tokenizer.as_target_tokenizer，官方说明：Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to sequence-to-sequence models that need a slightly different processing for the labels.跳进去看不到源码，想研究的小伙伴找找资料看看。

def preprocess_function(examples):
    # 通过tokenizer切分数据
    model_inputs = tokenizer(
        examples["data"], max_length=max_input_length, padding=True, truncation=True
    )
    # 标签处理方法人家也提供了
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["label"], max_length=max_target_length, padding=True, truncation=True
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_val.map函数：Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it.采用map函数将全部train_val数据全部进行处理，并且直接更新数据

这里面 batched参数：False时一个数据一个数据处理，为True时，根据另一个参数batch_size来定义每个批次处理的数据量，默认是1000。

- If batched is `False`, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. `{"text": "Hello there !"}`.
- If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`.
- If batched is `True` and `batch_size` is `n > 1`, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. Note that the last batch may have less than `n` examples. A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`.

tokenized_datasets = train_val.map(preprocess_function, batched=True)
print(tokenized_datasets)
# print(tokenized_datasets['train']['input_ids'])

完成后打印出来的datasets，只保留了最基本的特征：

DatasetDict({
train: Dataset({
features: ['id', 'data', 'label'],
num_rows: 15
})
test: Dataset({
features: ['id', 'data', 'label'],
num_rows: 5
})
})

4、模型训练

模型训练中想实现并行计算，先要安装个包：pip install sentencepiece，后面的train过程中会用到。加载模型，参数包括限定训练次数等

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
batch_size = 3
num_train_epochs = 1000
# 每一个epoch打印
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

定义compute_metrics方法：

def compute_metrics(eval_pred):
    predictions, labels = eval_pred  # 没有考虑特殊字符，实际用需根据你的tokenizer筛选剔除这些
    result = rouge_score.compute(
        predictions=predictions, references=labels, use_stemmer=True
    )
    # 返回结果
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

删除多余列：

tokenized_datasets = tokenized_datasets.remove_columns(
    train_dataset["train"].column_names
)

自动填充：

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

设置参数，output_dir设置训练好模型存放位置，evaluation_strategy评价策略，learning_rate学习率，per_device_train_batch_size每个设备的训练的batch size，per_device_eval_batch_size评价的批量数，weight_decay学习率的衰减参数，num_train_epochs设置训练的epochs，训练完了这么多轮之后就停了。

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-en-es",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,  # 评估的时候需要生成的结果
    logging_steps=logging_steps,
    save_strategy='steps',
    save_steps=2000,
)

构造训练类，模型、参数、数据等指定好就行

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

模型训练：

trainer.train()

四、模型预测示例

from transformers import pipeline

summarizer = pipeline("summarization", model='./mt5-small-finetuned-amazon-en-es/checkpoint-4000')

print(summarizer('垃圾东西，一用就坏了，根本没法使用退钱'))

打印输出：