transformers库的使用【三】对预训练模型进行微调

最新推荐文章于 2025-03-09 20:37:23 发布

桉夏与猫

最新推荐文章于 2025-03-09 20:37:23 发布

阅读量9.3k

点赞数 10

分类专栏： transformers 文章标签：深度学习神经网络 pytorch

本文链接：https://blog.csdn.net/qq_28790663/article/details/120703495

版权

transformers 专栏收录该内容

4 篇文章

订阅专栏

这篇博客介绍了如何利用Transformers库和Hugging Face的预训练模型对IMDB数据集进行情感分析任务的微调。首先，从Datasets库加载IMDB数据集，并对其进行切分。然后，使用AutoTokenizer创建一个编码器，对数据进行预处理。接着，定义了一个小规模的训练集和验证集，并实例化了一个BertForSequenceClassification模型。通过TrainingArguments设置训练参数，并创建Trainer对象进行模型训练。最后，定义了计算准确率的评估指标，在训练过程中展示模型的性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1、准备一个数据集

在这里将使用Datasets库来下载和准备IMDB数据集

首先，使用load_dataset函数来下载数据集

from datasets import load_dataset
raw_datasets = load_dataset("imdb")

这里创建的raw_datasets对象是一个包三个键的字典，其中包含：train、test和unsupervised。

在这里我们将使用train进行训练，使用test进行验证

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

为了对数据集进行处理，这里需要一个编码器tokenizer

创建一个编码器

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

接下来去使用map方法来处理每一个部分

def tokenize_function(examples):
    return tokenizer(examples["text"],padding='max_length',truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function,batched=True)
print(tokenized_datasets)

处理完的数据集成了下面的样子

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 50000
    })
})

那么，接下来去划分一个小的训练集和验证集，来更快的训练

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

在pytorch中使用Trainer API进行微调

由于Pytorch中不提供训练循环，因此Transformers库提供了一个针对Transformers模型的优化器，具有广泛的训练选项以及内置功能

接下来定义模型

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased",num_labels=2)

注意，这里会提示一些警告，提示一些未使用与预训练的权重，同时有些权重将会被随机初始化。

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

这是因为我们没有使用BERT模型的与训练头，而是使用随机初始化的分类头。

我们将在任务中对这个模型进行微调，把预训练模型的知识转移进去

接下来我们要创建一个Trainer，我们需要实例化TrainingArguments

这个类包含了Trainer的超参数

from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer")

接下来我们需要实例化一个Trainer

from transformers import Trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset= small_train_dataset,
    eval_dataset=small_eval_dataset
)

为了微调模型，只需要使用trainer.train()

trainer.train()

在调用过train函数后，可以使用进度条查看训练进度。

但是这里不会告诉你模型表现如何，默认情况下在训练期间没有评估，那么如何在训练过程中进行模型的评估呢？

为了让Trainer来计算并显示评价指标，需要一个计算指标的函数，这个函数需要接收预测的标签和真实的标签。

在Dataset库中，提供了一个很简单的方法load_metric函数，这个函数中包含一些常用的评价指标。

import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metric(eval_pred):
    logits,labels = eval_pred
    predictions = np.argmax(logits,axis=-1)
    return metric.compute(predictions = predictions,references = labels)
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset= small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metric = compute_metric,
)
trainer.evaluate()

注意，计算函数compute_metric的参数是一个元组，这个元组分别是logits和labels。返回一个带有字符串键的字典。

这个函数将在每一个评估阶段结束的时候，对整个预测/标签数组进行调用。