ViT：使用 HuggingFace 和 PyTorch 对 Vision Transformer 进行微调实战

最新推荐文章于 2024-06-25 18:27:48 发布

小北的北

最新推荐文章于 2024-06-25 18:27:48 发布

阅读量1.8k

点赞数 13

文章标签： pytorch transformer 人工智能 python 深度学习

本文链接：https://blog.csdn.net/weixin_38739735/article/details/137064991

版权

点击下方卡片，关注“小白玩转Python”公众号

探索 CIFAR-10 图像分类

介绍

你一定听说过“Attention is all your need”？Transformers 最初从文本开始，现在已无处不在，甚至在图像中使用了一种称为视觉变换器 (ViT) 的东西，这种变换器最早是在论文《一张图片胜过 16x16 个单词：用于大规模图像识别的 Transformers》中引入的。这不仅仅是另一个浮华的趋势；事实证明，它们是强有力的竞争者，可以与卷积神经网络 (CNN) 等传统模型相媲美。

ViT 简要概述：

将图像分成多个块，将这些块传递到全连接（FC）网络或 FC+CNN 以获取输入嵌入向量。
添加位置信息。
将其传递到传统的 Transformer 编码器中，并在末端附加一个 FC 层。

ViT 架构

这个故事并不是关于理解 ViT 的细节，而更像是关于如何使用 Hugging Face 和 PyTorch 微调预训练的 ViT 图像分类模型并将其用于您自己的任务的指南。

问题描述

我们的目标是利用预训练的 Vision Transformer 模型对 CIFAR-10 数据集*进行图像分类。然而，挑战在于用于训练模型的数据集和目标数据集的大小和输出类别数量不匹配。为了解决这个问题，我们采用了Fine Tuning。

我们将使用的模型是google/vit-base-patch16–224 （任何数据集/模型都可以通过适当调整来使用）。该模型已在 ImageNet-21k（1400 万张图像，21,843 个类别）上进行了训练，并在 ImageNet-1k（100 万张图像，1,000 个类别）上进行了微调。它使用 16x16 的补丁大小并处理大小为 3x224x224 的图像。

我们的目标是在CIFAR-10数据集上进一步微调它，该数据集只有 10 个输出类和大小为 3x32x32 的图像。本教程可作为对 Hugging Face 库中现有的任何 ViT 进行微调以用于各种任务的起点。

设置环境

您可以使用 Jupyter 或Google Colab。安装并导入必要的库和框架。

!pip install torch torchvision
!pip install transformers datasets
!pip install transformers[torch]

# PyTorch
import torch
import torchvision
from torchvision.transforms import Normalize, Resize, ToTensor, Compose 
# 用于显示图像
from PIL import Image
import matplotlib.pyplot as plt
from torchvision.transforms import ToPILImage 
# 加载数据集
from datasets import load_dataset 
# Transformers
从transformers import ViTImageProcessor, ViTForImageClassification
从transformers import TrainingArguments, Trainer 
# 矩阵运算
import  numpy as np 
# 评估
from sklearn.metrics import accuracy_score
from sklearn.metrics import confused_matrix, ConfusionMatrixDisplay

数据预处理

仅使用一小部分数据集进行演示。将数据分为训练、验证和测试数据集：

trainds, testds = load_dataset("cifar10", split=["train[:5000]","test[:1000]"])
splits = trainds.train_test_split(test_size=0.1)
trainds = splits['train']
valds = splits['test']
trainds, valds, testds

# Output
(Dataset({
     features: ['img', 'label'],
     num_rows: 4500
 }),
 Dataset({
     features: ['img', 'label'],
     num_rows: 500
 }),
 Dataset({
     features: ['img', 'label'],
     num_rows: 1000
 }))

如果您不熟悉数据集包，可以按照以下方法访问项目：

trainds.features，trainds.num_rows，trainds[ 0 ]

# Output
({'img': Image(decode=True, id=None),
  'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'], id=None)},
 4500,
 {'img': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>,
  'label': 0})

现在让我们将整数标签映射到字符串标签，反之亦然。

itos = dict((k,v) for k,v in enumerate(trainds.features['label'].names))
stoi = dict((v,k) for k,v in enumerate(trainds.features['label'].names))
itos

# Output
{0: 'airplane',
  1: 'automobile',
  2: 'bird',
  3: 'cat',
  4: 'deer',
  5: 'dog',
  6: 'frog',
  7: 'horse',
  8: 'ship',
  9: 'truck'}

现在，让我们显示数据集中的图像和相应的标签。

index = 0
img, lab = trainds[index]['img'], itos[trainds[index]['label']]
print(lab)
img

飞机：3x32x32 图像

现在，让我们使用 Hugging Face 和 PyTorch 进行一些图像处理。我们使用ViTImageProcessor来处理图像到补丁的转换（图像标记器）和规范化。

model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name) 


mu, sigma = processor.image_mean, processor.image_std #get default mu,sigma
size = processor.size

我们使用 TorchVision Transformers 管道。可以使用其他转换来满足您的数据需求。

norm = Normalize(mean=mu, std=sigma) #normalize image pixels range to [-1,1]


# resize 3x32x32 to 3x224x224 -> convert to Pytorch tensor -> normalize
_transf = Compose([
    Resize(size['height']),
    ToTensor(),
    norm
]) 


# apply transforms to PIL Image and store it to 'pixels' key
def transf(arg):
    arg['pixels'] = [_transf(image.convert('RGB')) for image in arg['img']]
    return arg

将转换应用于每个数据集。

trainds.set_transform(transf)
valds.set_transform(transf)
testds.set_transform(transf)

要查看转换后的图像，请运行以下代码片段：

idx = 0
ex = trainds[idx]['pixels']
ex = (ex+1)/2 #imshow requires image pixels to be in the range [0,1]
exi = ToPILImage()(ex)
plt.imshow(exi)
plt.show()

转换后的飞机：3x224x224

微调模型

我们使用 Hugging Face 的ViTForImageClassification，它将图像作为输入并输出类别的预测。我们首先看看原始模型的分类器是什么样子的。

model_name = "google/vit-base-patch16-224"
model = ViTForImageClassification.from_pretrained(model_name)
print(model.classifier)

# Output
Linear(in_features=768, out_features=1000, bias=True)

它输出 1000 个类的概率，因为它最初是在 ImageNet-1k 上进行微调的。

我们可以使用以下参数对其进行微调以输出 10 个类：num_labels 基本上改变了最终线性层中的节点数，ignore_mismatched_sizes 因为它最初有 1000 个输出节点，但现在我们只有 10 个，以及标签索引和标签字符串的映射。

model = ViTForImageClassification.from_pretrained(model_name, num_labels=10, ignore_mismatched_sizes=True, id2label=itos, label2id=stoi)
print(model.classifier)

# Output
Linear(in_features=768, out_features=10, bias=True)

拥抱脸部训练师

训练器提供了高级别的抽象，简化了训练和评估。

让我们从训练参数开始，您可以在其中定义超参数、日志记录、指标等。

args = TrainingArguments(
    f"test-cifar-10",
    save_strategy="epoch",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_dir='logs',
    remove_unused_columns=False,
)

现在，我们需要一个用于数据加载的collate 函数。它将像素值堆叠到张量中，并为标签创建张量。该模型需要一批输入中的 pixel_values 和 labels，因此不要更改这些张量的名称。

我们还需要一个函数来计算指标。在我们的例子中，我们将使用准确度。我建议将示例输入传递给这些函数并打印值以更好地理解它们。

def collate_fn(examples):
    pixels = torch.stack([example["pixels"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixels, "labels": labels}


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return dict(accuracy=accuracy_score(predictions, labels))

现在，将模型、训练参数、数据集、整理函数、度量函数和我们之前定义的图像处理器传递到 Trainer 中：

trainer = Trainer(
    model,
    args, 
    train_dataset=trainds,
    eval_dataset=valds,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

训练模型

我们必须训练我们微调的最后一层，同时保持所有其他层冻结。只需调用以下命令即可完成此操作：

trainer.train()

训练完成后，您可以看到如下日志和输出：

# Output
TrainOutput(global_step=675, training_loss=0.22329048227380824, metrics={'train_runtime': 1357.9833, 'train_samples_per_second': 9.941, 'train_steps_per_second': 0.497, 'total_flos': 1.046216869705728e+18, 'train_loss': 0.22329048227380824, 'epoch': 3.0})

评估

outputs = trainer.predict(testds)
print(outputs.metrics)

# Output
{'test_loss': 0.07223748415708542, 'test_accuracy': 0.973, 'test_runtime': 28.5169, 'test_samples_per_second': 35.067, 'test_steps_per_second': 4.383}

以下是访问输出的方法：

itos[np.argmax(outputs.predictions[0])], itos[outputs.label_ids[0]]

要绘制混淆矩阵，请使用以下代码：

y_true = outputs.label_ids
y_pred = outputs.predictions.argmax(1)


labels = trainds.features['label'].names
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(xticks_rotation=45)

混淆矩阵

· END ·

HAPPY LIFE

本文仅供学习交流使用，如有侵权请联系作者删除

小北的北

关注

13
点赞
踩
23

收藏

觉得还不错? 一键收藏
1
评论
ViT：使用 HuggingFace 和 PyTorch 对 Vision Transformer 进行微调实战

点击下方卡片，关注“小白玩转Python”公众号探索 CIFAR-10 图像分类介绍你一定听说过“Attention is all your need”？Transformers 最初从文本开始，现在已无处不在，甚至在图像中使用了一种称为视觉变换器 (ViT) 的东西，这种变换器最早是在论文《一张图片胜过 16x16 个单词：用于大规模图像识别的 Transformers》中引入的。这不仅仅是另一...
复制链接

扫一扫