ViT(Vision Transformer)原理及基于pytorch构建和训练视觉Transformer实例(ViTransformer)

scient

scient一个用python实现科学计算相关算法的包,包括自然语言、图像、神经网络、优化算法、机器学习、图计算等模块。

scient源码和编译安装包可以在Python package index获取。

The source code and binary installers for the latest released version are available at the [Python package index].

https://pypi.org/project/scient

可以用pip安装scient

You can install scient like this:

pip install scient

也可以用setup.py安装。

Or in the scient directory, execute:

python setup.py install

scient.neuralnet

神经网络相关算法模块,包括attention、transformer、bert、lstm、resnet、crf、dataset、fit等。

scient.neuralnet.transformer

transformer模块,实现以多头注意力机制为基础的transformer模型,包括Transformer、ViTransformer、Former、Encoder、Decoder。

scient.neuralnet.transformer.ViTransformer

应用于图像的transformer模型Vision Transformer。

scient.neuralnet.transformer.ViTransformer(n_class,patch_size=16,in_channels = 3,in_shape=(224,224),embed_size=768,
                 n_head=8,n_encode_layer=6,n_decode_layer=6,ff_size=2048,dropout=0.1,quantile=None,
                 eps:float=1e-9)

Parameters

n_class=1000 int 分类数,ViTransformer预训练任务为图像分类
patch_size=16 int 或 (int,int),patch大小
in_channels = 3 int 图像通道数
in_shape=(224,224) 图像高宽
embed_size=768 int embedding向量长度
以下为transformer参数
n_head:int=8 int 多头注意力head数量
n_encode_layer:int=6 encoder层数
n_decode_layer:int=6 decoder层数
ff_size:int=2048 feed forward 参数
dropout:float=0.1
quantile=None 稀疏注意力分位数,在多头注意力中小于该分位数值的注意力得分会被忽略
eps:float=1e-9

scient.neuralnet.transformer.ViTransformer.forward(source,pad_mask)

Parameters

soruce torch.Tensor 输入这序列
pad_mask torch.Tensor 用pad填充的掩码

Returns

torch.Tensor batch_size * n_class

scient.neuralnet.transformer.ViTransformer.encode(source,pad_mask)

Parameters

同forward

Returns

torch.Tensor batch_size * n_patch+1 * embed_size

Algorithms

ViTransformer(ViT)全称Vision Transformer,是将Transformer架构应用于计算机视觉任务的深度学习模型。

ViTransformer与Transformer最大的不同,是采用PatchEmbedding和PositionEmbedding将图像映射成向量,在Embedding之后采用Transformer的Encoder模块进行计算。下图是Transformer模型架构,其中绿框是ViTransformer复用的架构,红框部分ViTransformer与Transformer不同。

在这里插入图片描述

ViTransformer采用如下图所示的PatchEmbedding+PositionEmbedding代替上图红框部的内容,

在这里插入图片描述

  • PatchEmbedding
    仿照Transformer在文本上的应用,ViTransformer先将图像切分成若干个大小相同的碎片(Patch),然后按照顺将每个碎片(Patch)视为一个Token,所有碎片(Patch)平铺后组成一个输入序列。图像切分碎片通过卷积来实现,设置卷积核大小以及步长可以控制图像碎片的大小以及分多少块。
class PatchEmbedding(nn.Module):
    def __init__(self, patch_size = 16, in_channels = 3, in_shape = [224,224], embed_size = 768):
        super().__init__()
        self.n_patch = (in_shape[0] // patch_size) * (in_shape[1] // patch_size)  #n_patch = 196
 
        self.projection = nn.Conv2d(in_channels, embed_size, kernel_size=patch_size, stride=patch_size)
 
    def forward(self, x):
        x = self.projection(x)  #先进行卷积 [1,3,224,224] ->[1,768,14,14]
        x = x.flatten(2)  #[1,768,196]
        x = x.transpose(1, 2)  #[1,196,768]
        return x

n_patch可以切分的图像碎片的数量,projection对输入图像进行卷积分块,采用上述代码,可以将一个3 * 224 * 224大小的图像,转换成一个196 * 768的矩阵,其中196是序列长度n_patch,768是embed_size。

  • PositionEmbedding
    PositionEmbedding设置成n_patch+1 * embed_size的矩阵,后面讲这里为什么是n_patch + 1。
PositionEmbedding=nn.Parameter(torch.zeros(1, n_patch + 1, embed_size))
  • BOS token
    BOS是Begain of Sequence,在文本Transformer语料处理时会在句首加上,有时也会在句末加上,或在句子对中间加上等占位符,当然在文本训练中,这些占位符有其特定的用途。ViTransformer会在图像碎片序列的前面加上占位符,在序列经过Encoder处理后,用对应的向量作为图片的特征,进行预测误差计算,如下图所示:

在这里插入图片描述

将如下bos添加到PatchEmbedding的第一列,这样PatchEmbedding与PositionEmbedding的尺寸就一样,两者相加后得到图像的Embedding映射。

bos=nn.Parameter(torch.zeros(1, 1, embed_size))

下面是采用scient构建ViTransformer模型,并在数据集上进行训练的完整代码:

import torch
import torchvision.transforms as tt
from torchvision.datasets import ImageFolder
from scient.neuralnet import transformer,fit

data_path='imagewoof2-160'

#%% 数据
train_tfms = tt.Compose([tt.RandomCrop(160, padding=4, padding_mode='reflect'), 
                         tt.RandomHorizontalFlip(), 
                         tt.ToTensor(), 
                         tt.Normalize(mean=(0.4914, 0.4822, 0.4465),
                                      std=(0.2023, 0.1994, 0.2010),
                                      inplace=True)])
valid_tfms = tt.Compose([tt.Resize([160,160]),
                         tt.ToTensor(), 
                         tt.Normalize(mean=(0.4914, 0.4822, 0.4465),
                                      std=(0.2023, 0.1994, 0.2010))])

# 创建ImageFolder对象
data_train = ImageFolder(data_path+'/train', train_tfms)
data_eval = ImageFolder(data_path+'/val', valid_tfms)

# 设置批量大小
batch_size = 32

# 创建训练集和验证集的数据加载器
train_loader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, shuffle=True)
eval_loader = torch.utils.data.DataLoader(data_eval, batch_size=batch_size, shuffle=False)
#%% 模型
model=transformer.ViTransformer(n_class=10,patch_size=16,in_shape=(160,160),embed_size=768)

#%%训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

max_lr = 0.0001
weight_decay = 1e-4
n_iter=10

optimizer = torch.optim.Adam(model.parameters(), max_lr, weight_decay=weight_decay)
# Set up one-cycle learning rate scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=n_iter, 
                                            steps_per_epoch=len(train_loader))
loss_func=torch.nn.CrossEntropyLoss()

def perform_func(y_hat,y):
    y_hat,y=torch.cat(y_hat),torch.cat(y)
    _,y_hat=y_hat.max(dim=1)
    return round((y_hat==y).sum().item()/len(y),5)

def grad_func(p):
    torch.nn.utils.clip_grad_value_(p,0.1)
    
model=fit.set(model,optimizer=optimizer,scheduler=scheduler,grad_func=grad_func,loss_func=loss_func,perform_func=perform_func,n_iter=n_iter,device=device)
model.fit(train_loader,eval_loader,mode=('input','target'))

数据集的下载地址:
imagewoof数据集下载

通过训练过程可以看出,仅经过10次迭代,在测试集上的准确率就从0.15169上升到0.41511,说明ViTransformer模型对图像的学习能力是很强大的。

train iter 0: avg_batch_loss=2.26922 perform=0.15711: 100%|██████████| 283/283 [06:01<00:00,  1.28s/it]
eval iter 0: avg_batch_loss=2.29831 perform=0.15169: 100%|██████████| 123/123 [01:02<00:00,  1.96it/s]
train iter 1: avg_batch_loss=2.21478 perform=0.18072: 100%|██████████| 283/283 [04:19<00:00,  1.09it/s]
eval iter 1: avg_batch_loss=2.30398 perform=0.16899: 100%|██████████| 123/123 [01:00<00:00,  2.03it/s]
train iter 2: avg_batch_loss=2.09721 perform=0.22504: 100%|██████████| 283/283 [04:16<00:00,  1.10it/s]
eval iter 2: avg_batch_loss=2.35796 perform=0.16772: 100%|██████████| 123/123 [01:01<00:00,  2.01it/s]
train iter 3: avg_batch_loss=1.98703 perform=0.25939: 100%|██████████| 283/283 [04:43<00:00,  1.00s/it]
eval iter 3: avg_batch_loss=2.01205 perform=0.27895: 100%|██████████| 123/123 [01:08<00:00,  1.80it/s]
train iter 4: avg_batch_loss=1.89123 perform=0.30603: 100%|██████████| 283/283 [04:45<00:00,  1.01s/it]
eval iter 4: avg_batch_loss=1.96699 perform=0.28607: 100%|██████████| 123/123 [01:02<00:00,  1.96it/s]
train iter 5: avg_batch_loss=1.82591 perform=0.32565: 100%|██████████| 283/283 [04:16<00:00,  1.10it/s]
eval iter 5: avg_batch_loss=1.94186 perform=0.29294: 100%|██████████| 123/123 [00:59<00:00,  2.07it/s]
train iter 6: avg_batch_loss=1.74809 perform=0.36332: 100%|██████████| 283/283 [04:19<00:00,  1.09it/s]
eval iter 6: avg_batch_loss=1.75617 perform=0.36777: 100%|██████████| 123/123 [01:03<00:00,  1.94it/s]
train iter 7: avg_batch_loss=1.64716 perform=0.40698: 100%|██████████| 283/283 [04:20<00:00,  1.09it/s]
eval iter 7: avg_batch_loss=1.74583 perform=0.38763: 100%|██████████| 123/123 [01:00<00:00,  2.04it/s]
train iter 8: avg_batch_loss=1.55075 perform=0.44110: 100%|██████████| 283/283 [04:32<00:00,  1.04it/s]
eval iter 8: avg_batch_loss=1.67609 perform=0.40773: 100%|██████████| 123/123 [01:00<00:00,  2.03it/s]
train iter 9: avg_batch_loss=1.49680 perform=0.46880: 100%|██████████| 283/283 [04:17<00:00,  1.10it/s]
eval iter 9: avg_batch_loss=1.65652 perform=0.41511: 100%|██████████| 123/123 [01:01<00:00,  2.01it/s]
  • 29
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值