Vision Transformer源码详解

Vision Transformer源码详解



前言

本篇文章主要分享视觉Transformer的Pytorch实现和代码细节问题。

一、模型架构

在这里插入图片描述

整体思路是将图片数据转换成序列数据,连接一个分类特征class_token,在加上位置信息,通过多层堆叠的Transformer Encoder,这个class_token融合了其他图片序列的特征,在经过多层感知机MLP后,输出最终分类结果。

二、整体代码

import numpy as np
import torch
import torch.nn as nn


class Vit(nn.Module):
    def __init__(self,
                 batch_size=1,
                 image_size=224,
                 patch_size=16,
                 in_channels=3,
                 embed_dim=768,
                 num_classes=1000,
                 depth=12,
                 num_heads=12,
                 mlp_ratio=4,
                 dropout=0,
                 ):
        super(Vit, self).__init__()

        self.patch_embedding = PatchEmbedding(batch_size, image_size, patch_size, in_channels, embed_dim, dropout)

        self.encoder = Encoder(batch_size, embed_dim, num_heads, mlp_ratio, dropout, depth)

        self.classifier = Classification(embed_dim,num_classes,dropout)

    def forward(self, x):
        x = self.patch_embedding(x)
        x = self.encoder(x)
        x = self.classifier(x)
        return x


class PatchEmbedding(nn.Module):
    def __init__(self, batch_size, image_size, patch_size, in_channels, embed_dim, dropout):
        super(PatchEmbedding, self).__init__()
        n_patchs = (image_size // patch_size) ** 2
        self.conv1 = nn.Conv2d(in_channels, embed_dim, patch_size, patch_size)
        self.dropout = nn.Dropout(dropout)
        self.class_token = torch.randn((batch_size, 1, embed_dim))
        self.position = torch.randn((batch_size, n_patchs + 1, embed_dim))

    def forward(self, x):
        x = self.conv1(x)  # (batch,in_channel,h,w)-(batch,embed_dim,h/patch_size,w/patch_size)(1,768,14,14)
        x = x.flatten(2)  # batch,embed_dim,h*w/(patch_size)**2   (1,768,196)
        x = x.transpose(1, 2)  # batch,h*w/(patch_size)^^2,embed_dim  (1,196,768)
        x = torch.concat((self.class_token, x), axis=1)  # (1,197,768)
        x = x + self.position
        x = self.dropout(x)
        return x


class Encoder(nn.Module):
    def __init__(self,
                 batch_size,
                 embed_dim,
                 num_heads,
                 mlp_ratio,
                 dropout,
                 depth):
        super(Encoder, self).__init__()
        layer_list = []
        for i in range(depth):
            encoder_layer = EncoderLayer(batch_size,
                                         embed_dim,
                                         num_heads,
                                         mlp_ratio,
                                         dropout,
                                         )
            layer_list.append(encoder_layer)
        self.layer = nn.Sequential(*layer_list)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        for layer in self.layer:
            x = layer(x)
        x = self.norm(x)
        return x


class EncoderLayer(nn.Module):
    def __init__(self,
                 batch_size,
                 embed_dim,
                 num_heads,
                 mlp_ratio,
                 dropout,
                 ):
        super(EncoderLayer, self).__init__()

        self.attn_norm = nn.LayerNorm(embed_dim, eps=1e-6)
        self.attn = Attention(batch_size,
                              embed_dim,
                              num_heads,
                              )
        self.mlp_norm = nn.LayerNorm(embed_dim, eps=1e-6)
        self.mlp = Mlp(embed_dim, mlp_ratio, dropout)

    def forward(self, x):
        h = x
        x = self.attn_norm(x)
        x = self.attn(x)
        x = x + h

        h = x
        x = self.mlp_norm(x)
        x = self.mlp(x)
        x = x + h
        return x


class Attention(nn.Module):
    def __init__(self,
                 batch_size,
                 embed_dim,
                 num_heads,
                 ):
        super(Attention, self).__init__()
        self.qkv = embed_dim // num_heads
        self.batch_size = batch_size
        self.num_heads = num_heads
        self.W_Q = nn.Linear(embed_dim, embed_dim)
        self.W_K = nn.Linear(embed_dim, embed_dim)
        self.W_V = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        Q = self.W_Q(x).view(self.batch_size, -1, self.num_heads, self.qkv).transpose(1, 2)
        K = self.W_K(x).view(self.batch_size, -1, self.num_heads, self.qkv).transpose(1, 2)  # (1,12,197,64)
        V = self.W_V(x).view(self.batch_size, -1, self.num_heads, self.qkv).transpose(1,
                                                                                      2)  # (batch,num_heads,length,qkv_dim)
        att_result = CalculationAttention()(Q, K, V, self.qkv)  # (batch,num_heads,length,qkv)
        att_result = att_result.transpose(1, 2).flatten(2)  # (1,197,768)
        return att_result


class CalculationAttention(nn.Module):
    def __init__(self,
                 ):
        super(CalculationAttention, self).__init__()

    def forward(self, Q, K, V, qkv):
        score = torch.matmul(Q, K.transpose(2, 3)) / (np.sqrt(qkv))
        score = nn.Softmax(dim=-1)(score)
        score = torch.matmul(score, V)
        return score


class Mlp(nn.Module):
    def __init__(self,
                 embed_dim,
                 mlp_ratio,
                 dropout):
        super(Mlp, self).__init__()
        self.fc1 = nn.Linear(embed_dim,embed_dim*mlp_ratio)
        self.fc2 = nn.Linear(embed_dim*mlp_ratio,embed_dim)
        self.actlayer = nn.GELU()
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self,x):
        x = self.fc1(x)
        x = self.actlayer(x)
        x = self.dropout1(x)
        x = self.fc2(x)
        x = self.dropout2(x)
        return x


class Classification(nn.Module):
    def __init__(self,embed_dim,num_class,dropout):
        super(Classification, self).__init__()
        self.fc1 = nn.Linear(embed_dim,embed_dim)
        self.fc2 = nn.Linear(embed_dim,num_class)
        self.relu = nn.ReLU(True)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self,x):
        x = x[:,0]
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout1(x)
        x = self.fc2(x)
        x = self.dropout2(x)
        return x

def main():
    ins = torch.randn((1, 3, 224, 224))
    vitmodel = Vit()
    out = vitmodel(ins)
    print(out.shape)


if __name__ == '__main__':
    main()

三、各模块代码详解

1. Vit()类

class Vit(nn.Module):
    def __init__(self,
                 batch_size=1,   # 样本批量
                 image_size=224, # 输入图片大小
                 patch_size=16,  # 所用卷积核尺寸,认为patch*patch块大小为一个序列数据
                 in_channels=3, #输入通道数
                 embed_dim=768, #输出通道数,即卷积核个数
                 num_classes=1000, # 分类个数
                 depth=12,  # EncoderLayer层堆叠深度
                 num_heads=12, # 多头自注意力机制的heads数
                 mlp_ratio=4, # 隐藏层节点倍数
                 dropout=0, #Dropout发生概率
                 ):
        super(Vit, self).__init__()

        self.patch_embedding = PatchEmbedding(batch_size, image_size, patch_size, in_channels, embed_dim, dropout)

        self.encoder = Encoder(batch_size, embed_dim, num_heads, mlp_ratio, dropout, depth)

        self.classifier = Classification(embed_dim,num_classes,dropout)

    def forward(self, x):
        x = self.patch_embedding(x)
        x = self.encoder(x)
        x = self.classifier(x)
        return x

Vision Transfomer基本框架由PatchEmbedding层,Transfomer Encoder层和分类器Classifier构成

2.PatchEmbedding()类

class PatchEmbedding(nn.Module):
    def __init__(self, batch_size, image_size, patch_size, in_channels, embed_dim, dropout):
        super(PatchEmbedding, self).__init__()
        n_patchs = (image_size // patch_size) ** 2
        self.conv1 = nn.Conv2d(in_channels, embed_dim, patch_size, patch_size)
        self.dropout = nn.Dropout(dropout)
        self.class_token = torch.randn((batch_size, 1, embed_dim))
        self.position = torch.randn((batch_size, n_patchs + 1, embed_dim))

    def forward(self, x):
        x = self.conv1(x)  # (batch,in_channel,h,w)-(batch,embed_dim,h/patch_size,w/patch_size)(1,768,14,14)
        x = x.flatten(2)  # batch,embed_dim,h*w/(patch_size)**2   (1,768,196)
        x = x.transpose(1, 2)  # batch,h*w/(patch_size)^^2,embed_dim  (1,196,768)
        x = torch.concat((self.class_token, x), axis=1)  # (1,197,768)
        x = x + self.position  # (1,197,768)
        x = self.dropout(x)  #(1,197,768)
        return x

PatchEmbedding类通过尺寸大小为16*16,步长为16,数量为768的卷积核实现了将输入[1,3,224,224]转化为[1,768,14,14],再通过flatten()将最后两位展平变为[1,768,196],transpose()转换维度为[1,196,768],concat()连接class_token变为[1,197,768],最后加上随机产生的位置信息。

3.Encoder()类

class Encoder(nn.Module):
    def __init__(self,
                 batch_size,
                 embed_dim,
                 num_heads,
                 mlp_ratio,
                 dropout,
                 depth):
        super(Encoder, self).__init__()
        layer_list = []
        for i in range(depth):
            encoder_layer = EncoderLayer(batch_size,
                                         embed_dim,
                                         num_heads,
                                         mlp_ratio,
                                         dropout,
                                         )
            layer_list.append(encoder_layer)
        self.layer = nn.Sequential(*layer_list)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        for layer in self.layer:
            x = layer(x)
        x = self.norm(x)
        return x


class EncoderLayer(nn.Module):
    def __init__(self,
                 batch_size,  
                 embed_dim,   
                 num_heads,   
                 mlp_ratio,
                 dropout,
                 ):
        super(EncoderLayer, self).__init__()

        self.attn_norm = nn.LayerNorm(embed_dim, eps=1e-6)
        self.attn = Attention(batch_size,
                              embed_dim,
                              num_heads,
                              )
        self.mlp_norm = nn.LayerNorm(embed_dim, eps=1e-6)
        self.mlp = Mlp(embed_dim, mlp_ratio, dropout)

    def forward(self, x):
        residual = x       # 残差 residual 
        x = self.attn_norm(x)
        x = self.attn(x)
        x = x + residual

        residual = x       # 残差 residual
        x = self.mlp_norm(x)
        x = self.mlp(x)
        x = x + residual
        return x

nn.Sequential(*layer_list)是将layer_list列表拆成一个个元素容纳

4.Attention()类

class Attention(nn.Module):
    def __init__(self,
                 batch_size,
                 embed_dim,
                 num_heads,
                 ):
        super(Attention, self).__init__()
        self.qkv = embed_dim // num_heads
        self.batch_size = batch_size
        self.num_heads = num_heads
        self.W_Q = nn.Linear(embed_dim, embed_dim)
        self.W_K = nn.Linear(embed_dim, embed_dim)
        self.W_V = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        Q = self.W_Q(x).view(self.batch_size, -1, self.num_heads, self.qkv).transpose(1, 2)
        K = self.W_K(x).view(self.batch_size, -1, self.num_heads, self.qkv).transpose(1, 2)  # (1,12,197,64)
        V = self.W_V(x).view(self.batch_size, -1, self.num_heads, self.qkv).transpose(1,
                                                                                      2)  # (batch,num_heads,length,qkv_dim)
        att_result = CalculationAttention()(Q, K, V, self.qkv)  # (batch,num_heads,length,qkv)
        att_result = att_result.transpose(1, 2).flatten(2)  # (1,197,768)
        return att_result


class CalculationAttention(nn.Module):
    def __init__(self,
                 ):
        super(CalculationAttention, self).__init__()

    def forward(self, Q, K, V, qkv):
        score = torch.matmul(Q, K.transpose(2, 3)) / (np.sqrt(qkv))
        score = nn.Softmax(dim=-1)(score)
        score = torch.matmul(score, V)
        return score

Attention()类产生Q,K,V矩阵,Calculation()类进行Attention的计算。Q,K,V矩阵利用nn.Linear()线性映射产生W_Q,W_K,W_V参数矩阵,与x相乘得到。

5.Mlp()类

class Mlp(nn.Module):
    def __init__(self,
                 embed_dim,
                 mlp_ratio,
                 dropout):
        super(Mlp, self).__init__()
        self.fc1 = nn.Linear(embed_dim,embed_dim*mlp_ratio)
        self.fc2 = nn.Linear(embed_dim*mlp_ratio,embed_dim)
        self.actlayer = nn.GELU()  # GELU>ELU>RELU>Sigmond
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self,x):
        x = self.fc1(x)
        x = self.actlayer(x)
        x = self.dropout1(x)
        x = self.fc2(x)
        x = self.dropout2(x)
        return x

多层感知机为多层线性映射,通过GELU()增加非线性,Dropout()防止过拟合

6.Classifier()类

class Classification(nn.Module):
    def __init__(self,embed_dim,num_class,dropout):
        super(Classification, self).__init__()
        self.fc1 = nn.Linear(embed_dim,embed_dim)
        self.fc2 = nn.Linear(embed_dim,num_class)
        self.relu = nn.ReLU(True)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self,x):
        x = x[:,0]        # 取class_token输入到分类器中进行最后的分类判别
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout1(x)
        x = self.fc2(x)
        x = self.dropout2(x)
        return x

分类器本质上也为多层感知机,与MLP相似,不过在前向传播过程中,需注意取最开始添加class_token进行最后分类判别。

7.数据流图

在这里插入图片描述

四、总结

本篇着重在于Vision Transfomer的Pytorch实现,接下来会复现Vision Transformer Advanced,如有问题可或想法可相互交流.

  • 2
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
### 回答1: Vision Transformer(视觉Transformer)是一种基于Transformer架构的深度学习模型,主要用于图像分类和目标检测任务。与传统卷积神经网络不同,Vision Transformer使用了全局自注意力机制,使得模型可以更好地捕捉到不同位置之间的关系。Vision Transformer已经在ImageNet等大规模数据集上取得了优秀的性能表现,并逐渐成为深度学习领域的研究热点。 ### 回答2: Vision Transformer(ViT)是一种全新的视觉识别模型,由谷歌提出,它借鉴了自然语言处理领域中的transformer技术。ViT在图像分类、目标检测和分割等视觉任务中均有较好的效果,并且在一些任务中超越了传统的卷积神经网络(CNN)模型。 ViT模型的核心是transformer encodertransformer decoder两大部分。transformer encoder负责将输入序列转换成特征向量,而transformer decoder则是根据特征向量生成目标输出序列。在ViT模型中,将一张图片视为一个大小为H×W的序列,然后再通过一些处理,将会得到一个大小为N的特征向量,其中每个元素代表了原图中的一个位置坐标。 ViT模型通过将一张图像划分成大小为P × P的图块,然后将每个图块拼接成一个序列,来处理整个图像。基于这样的序列表示方式,ViT将应用transformer架构对序列进行处理,以生成特征表示。同时,由于传统的transformer是为自然语言处理领域设计的,所以需要对其进行一定的调整,才能适用于图像处理任务。因此,ViT引入了一个叫做“patch embedding”的操作,它将每个P × P的图块映射成一个特征向量,然后再进行transformer处理。 除此之外,在ViT模型中还使用了一些其他的技术来提升模型的性能,包括:1)将transformer encoder中的自注意力替换为多头注意力,以增加模型的并行性和泛化性;2)在每个transformer block中应用残差连接和归一化,以加速训练、提高稳定性和精度;3)引入了一个分类头,用于将特征向量转换为最终的输出类别概率。这些技术的应用均使得ViT模型在视觉识别任务上表现出了很好的效果。 总之,ViT模型是一种基于transformer架构的新型视觉识别模型,它采用多头注意力、残差连接等技术,将图像视为序列,利用transformer encodertransformer decoder对序列进行处理,并最终输出目标类别概率。相比于传统的CNN模型,在一些任务中ViT具有更优秀的表现,有望成为未来视觉处理领域的新趋势。 ### 回答3: Vision Transformer(ViT)是谷歌的一款新型视觉模型,与传统的卷积神经网络(CNN)不同,ViT是由注意力机制(Attention Mechanism)组成的纯粹Transformer模型。Transformer源于自然语言处理领域,它能解决文本序列问题,但对于图像数据来说,采用Transformer是一个全新的尝试。 ViT将图像分割成固定数量的块(例如16*16),每个块被视为一个向量。这些向量然后传递给Transformer编码器,其中包括多层自注意力机制。通过学习这些向量之间的相互作用,模型能够提取出各个块之间的关键信息。最后,全连接层通过分类器将最终向量映射到相应的类别。 相较于传统CNN,ViT的明显优势是无需人工设计的特征提取器,这使得模型更具通用性,适用于各种视觉任务,并且能够处理多种分辨率和大小的图像。同时,attention机制带来的优点也让ViT在处理长时间序列信息时表现突出。 然而ViT在使用时还存在一些挑战。由于图像信息需要被分割成固定大小的块,因此对于具有细长结构的对象(如马路、河流等),模型很容易将它们拆分为多个块,导致信息的丢失。此外,由于向量长度的限制,ViT的处理能力存在局限性。 在处理大规模数据时,ViT还需要面对计算资源的挑战。为解决这一问题,研究人员提出了一系列改进算法,如DeiT、T2T-ViT、Swin Transformer等,它们能够更好地处理大规模图像数据。 总的来说,Vision Transformer模型是一种全新的尝试,它使用自注意力机制构建纯Transformer模型来处理图像数据。虽然存在一些性能挑战,但随着技术的不断进步和改进算法的诞生,ViT模型必将成为图像处理领域的重要一员。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值