一、数据准备
这里采用了商城评论作为训练数据,采用csv构建训练数据集train.csv,示例如下:
id,data,label 1,很差的一款鞋,不要买!鞋底太硬,没有中底,所以没有缓震或支撑。,质量问题
同样方式构建test.csv数据集
二、导入模块
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer
from datasets import load_metric
各模块说明:
transformers模块:HuggingFace开发的一个非常轻量级的 Transformers,里面有很丰富的函数库
datasets模块:datasets是Hugging Face公司开发的一个Python库,提供了访问和处理大量自然语言处理(NLP)数据集的工具。库中的每个数据集都被设计成一个高效、易用、可扩展的对象,可以方便地进行处理和操作。有很多数据集合操作工具。
AutoTokenizer:This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the [`AutoTokenizer.from_pretrained`] class method.通用的分词器,采用from_pretrained实例化
load_dataset:Load a dataset from the Hugging Face Hub, or a local dataset.加载官方或者本地的数据
AutoModelForSeq2SeqLM:This is a generic model class that will be instantiated as one of the model classes of the library (with a sequence-to-sequence language modeling head) when created with the from_pretrained() class method or the from_config() class method.通用的序列2序列语言模型,采用from_pretrained实例化。和AutoTokenizer类似
Seq2SeqTrainingArguments:设置训练参数的类
DataCollatorForSeq2Seq:Data collator that will dynamically pad the inputs received, as well as the labels.一个数据整理器,可以动态地填充(pad)收到的输入和标签。
Seq2SeqTrainer:训练器。
load_metric:加载度规,指加载评价指标
三、数据处理及模型训练
1、加载预训练模型及评价指标
通过load_metric加载rouge评价指标,偏向于召回率(预测真实数量/实际真实数量);分词器打算用google/mt5-small,mt5是text-text预训练模型,支持多种语言,但由于网络连接问题,会发生报错,得用到hugging face镜像站 HF-Mirror - Huggingface 镜像站,处理方式详见这篇博文:OSError: We couldn‘t connect to ‘https://huggingface.co‘ to load this file-CSDN博客
rouge_score = load_metric("rouge")
# model_checkpoint = "google/mt5-small"
model_checkpoint = 'D:/download/model/mt5s'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
2、加载数据
通过load_dataset设置加载数据格式、本地路径来导入数据。
train_size、test_size这里设置的意义在于后续将训练集和测试集分别定位15、5,属于小数据量来跑,之后便是用train_test_split来区分train、test数据集。
这里设置的max_input_length、max_target_length分别表示最大输入长度、最大目标长度,限制最大输入长度为512,输出的结果最多30,因为是摘要任务,当然输出不能太大。
train_dataset = load_dataset("csv", data_files="train.csv")
#这里是设定训练集、测试集大小
train_size = 15
test_size = 5
train_val = train_dataset["train"].train_test_split(
train_size=train_size, test_size=test_size, seed=42
)
print(train_val)
max_input_length = 512
max_target_length = 30
这里train_val打印出来的结果,可以看出来train、test数据集的数据结构、数量:
DatasetDict({
train: Dataset({
features: ['id', 'data', 'label'],
num_rows: 15
})
test: Dataset({
features: ['id', 'data', 'label'],
num_rows: 5
})
})
3、文本预处理
tokenizer设置参数,max_length限定了输入长度,padding表示是否填充,truncation表示是否切分,超过了最大长度就要做截断,如果输入长度大于max_length就就行切分。
这里分成两部分处理,preprocess_function是官方提供的模板方法,首先是对“data”中的数据做分词,形成分词后的模型输入,放入"data"中,这一段就是处理原始文本,然后处理labels,获取标签,放入到"label"中,这一段就是处理摘要,把摘要放到label里。
tokenizer.as_target_tokenizer,官方说明:Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to sequence-to-sequence models that need a slightly different processing for the labels.跳进去看不到源码,想研究的小伙伴找找资料看看。
def preprocess_function(examples):
# 通过tokenizer切分数据
model_inputs = tokenizer(
examples["data"], max_length=max_input_length, padding=True, truncation=True
)
# 标签处理方法人家也提供了
with tokenizer.as_target_tokenizer():
labels = tokenizer(
examples["label"], max_length=max_target_length, padding=True, truncation=True
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
train_val.map函数:Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it.采用map函数将全部train_val数据全部进行处理,并且直接更新数据
这里面 batched参数:False时一个数据一个数据处理,为True时,根据另一个参数batch_size来定义每个批次处理的数据量,默认是1000。
- If batched is `False`, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. `{"text": "Hello there !"}`.
- If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`.
- If batched is `True` and `batch_size` is `n > 1`, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. Note that the last batch may have less than `n` examples. A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`.
tokenized_datasets = train_val.map(preprocess_function, batched=True)
print(tokenized_datasets)
# print(tokenized_datasets['train']['input_ids'])
完成后打印出来的datasets,只保留了最基本的特征:
DatasetDict({
train: Dataset({
features: ['id', 'data', 'label'],
num_rows: 15
})
test: Dataset({
features: ['id', 'data', 'label'],
num_rows: 5
})
})
4、模型训练
模型训练中想实现并行计算,先要安装个包:pip install sentencepiece,后面的train过程中会用到。加载模型,参数包括限定训练次数等
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
batch_size = 3
num_train_epochs = 1000
# 每一个epoch打印
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]
定义compute_metrics方法:
def compute_metrics(eval_pred):
predictions, labels = eval_pred # 没有考虑特殊字符,实际用需根据你的tokenizer筛选剔除这些
result = rouge_score.compute(
predictions=predictions, references=labels, use_stemmer=True
)
# 返回结果
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
return {k: round(v, 4) for k, v in result.items()}
删除多余列:
tokenized_datasets = tokenized_datasets.remove_columns(
train_dataset["train"].column_names
)
自动填充:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)
设置参数,output_dir设置训练好模型存放位置,evaluation_strategy评价策略,learning_rate学习率,per_device_train_batch_size每个设备的训练的batch size,per_device_eval_batch_size评价的批量数,weight_decay学习率的衰减参数,num_train_epochs设置训练的epochs,训练完了这么多轮之后就停了。
args = Seq2SeqTrainingArguments(
output_dir=f"{model_name}-finetuned-amazon-en-es",
evaluation_strategy="epoch",
learning_rate=5.6e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=1,
num_train_epochs=num_train_epochs,
predict_with_generate=True, # 评估的时候需要生成的结果
logging_steps=logging_steps,
save_strategy='steps',
save_steps=2000,
)
构造训练类,模型、参数、数据等指定好就行
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
模型训练:
trainer.train()
四、模型预测示例
from transformers import pipeline
summarizer = pipeline("summarization", model='./mt5-small-finetuned-amazon-en-es/checkpoint-4000')
print(summarizer('垃圾东西,一用就坏了,根本没法使用退钱'))
打印输出:
[{'summary_text': '质量问题'}]
还挺好玩的!