利用Pytorch实现Vision Transformer

目  录

1. 网络整体架构

1.1 Linear Projection of Flattened Patches

1.2 Transformer Encoder

1.2.1 Self-attention

1.2.2 Multi-Head Attention

1.3 MLP Head

2. 利用Pytorch实现Vision Transformer

2.1 Linear Projection of Flattened Patches

2.2 Attention模块

2.3 MLP Block

2.4 Transformer Encoder

2.5 Vision Transformer网络架构

3. 训练结果


1. 网络整体架构

  1. 将输入图片分成一个个的patch,假设输入图片的shape为(224, 224, 3),若按照16×16的大小来划分,划分后会得到(224/16)*(224/16) = 196个patch,每个patch的shape为(16, 16, 3)
  2. 将patch输入到Embedding层(Linear Projection of Flattened Patches),得到一个个的输出向量(称为token),即(16, 16, 3) -> (768)
  3. 在每tokens前加(concat)一个新的用于分类的[class]token(参考BERT网络),即cat((1, 768), (196, 768)) -> (197, 768)
  4. 在每一个token上加(add)一个Position Embedding(一个可训练的参数,对应图中的0~9),由于是add操作,Position Embedding的shape和tokens的shape是相同的,即tokens.shape(197, 768) =Position Embedding.shape(197, 768) 
  5. 将tokens(包含[class]token)+Position Embedding输入到Transformer Encoder中,只提取[class]token对应的输出,通过MLP Head得到分类结果。

综上整个模型结构分为三部分:

  • Linear Projection of Flattened Patches(Embedding层)
  • Transformer Encoder
  • MLP Head(最终用于分类的层结构)

1.1 Linear Projection of Flattened Patches

标准的Transformer模块要求的输入是token序列,即(num_token, token_dim),在代码实现时,用一个卷积层(kernel_size为16×16,stride=16,卷积核个数为768)来实现,再将H和W维度展平此时shape变化为(224, 224, 3) -> (14, 14, 768) -> (196, 768)

再拼接一个可训练的参数作为[class]token,cat((1, 768), (196, 768)) -> (197, 768)

再叠加(add)Position Embedding,(196, 768) -> (196, 768)


1.2 Transformer Encoder

首先通过一个Layer Norm;再通过一个Multi-Head Attention,Multi-Head Attention原理如下:


1.2.1 Self-attention

假设input是图中的x^{1}-x^{4},每个input先乘上一个矩阵W,得到向量a^{1}-a^{4},每个向量分别乘上W_{q},W_{k},W_{v},以向量a_{1}为例,分别得到3个不同的向量q^{1},k^{1},v^{1}

接下来利用每个q^{i} 对每个k^{i}做attention(即这2个向量有多接近),以q^{1}k^{1}为例,利用以下公式:

\alpha _{i,j}=\frac{q^{i}\cdot k^{j}}{\sqrt{d}}

计算得到\alpha _{1,1},其余同理,其中dqk的维度,相当于归一化的效果,随后把计算得到的\alpha _{1,j}做softmax操作:

\hat{\alpha}_{1,j}=\frac{e^{\alpha _{1,j}}}{\sum _{j}e^{\alpha _{1,j}}} 

经过softmax操作得到\hat{\alpha }_{1,j},将其和所有的v^{i}值相乘,将4个结果加起来得到b^{1},即:

b^{1}=\sum_{j}\hat{\alpha }_{1,j}v^{i}

同样地,可以计算出b_{2},b_{3},b_{4}

将上述过程用矩阵表示: 

输入矩阵I=\left [ a^1,a^2,a^3,a^4 \right ],分别用W^q,W^k,W^v与其相乘得到Q,K,V,其每一列代表一个q,k,v,Self-attention的矩阵计算过程如下:


1.2.2 Multi-Head Attention

上图为两个head,即由a^i生成的q^i进一步乘以两个转移矩阵变为q^{i,1}q^{i,2},其他的q以及kv同理;然后q^{i,1}先和k^{i,1}做attention,再和k^{j,1}做attention,将得到的两个\alpha做softmax操作,得到两个\hat{\alpha }后分别与v^{i,1}v^{j,1}相乘相加得到b_{i,1};同理可以得到b_{i,2}

此时我们将b_{i,1}b_{i,2}concat起来,然后利用一个变换矩阵调整:


再经过一个Dropout层或者DropPath层;再与捷径分支进行add操作;再经过一个Layer Norm层;经过MLP Block;经过一个Dropout层或者DropPath层;再与捷径分支进行add操作得到输出。MLP Block结构如下:


1.3 MLP Head

该模块当数据集很大时是由Linear+tanh激活函数+Linear组成,若使用较小的数据集则只用一个Linear即可;注意只提取出[class]token生成的对应结果送入到MLP Head就行。

以ViT-B/16为例,网络整体结构如下:

其中的Pre-Logits在数据集较大时才使用,为tanh激活函数+Linear。


2. 利用Pytorch实现Vision Transformer

2.1 Linear Projection of Flattened Patches

class PatchEmbed(nn.Module):
    """
    2D Image to Patch Embedding
    """
    # 输入图片大小,patch大小,输入图片channel,token大小(16×16×3)
    def __init__(self, img_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None):
        super().__init__()
        img_size = (img_size, img_size)
        patch_size = (patch_size, patch_size)
        self.img_size = img_size
        self.patch_size = patch_size
        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
        self.num_patches = self.grid_size[0] * self.grid_size[1]

        self.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

    def forward(self, x):
        B, C, H, W = x.shape
        # 如果传入图片的高和宽和预设不同会报错
        assert H == self.img_size[0] and W == self.img_size[1], \
            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."

        # flatten: [B, C, H, W] -> [B, C, HW]
        # transpose: [B, C, HW] -> [B, HW, C]
        x = self.proj(x).flatten(2).transpose(1, 2)
        x = self.norm(x)
        return x

2.2 Attention模块

class Attention(nn.Module):
    def __init__(self,
                 dim,  # 输入token的dim
                 num_heads=8,  # 几头
                 qkv_bias=False,
                 qk_scale=None,
                 attn_drop_ratio=0.,
                 proj_drop_ratio=0.):
        super(Attention, self).__init__()
        self.num_heads = num_heads
        # 得到q,k,v之后,根据num_heads对q,k,v进行切块,切块后的dim就是dim/头数
        head_dim = dim // num_heads
        # 就是根号d,等于q,k的维度
        self.scale = qk_scale or head_dim ** -0.5
        # 直接通过一个全连接层生成q,k,v
        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop_ratio)
        # 表示映射矩阵Wo
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop_ratio)

    def forward(self, x):
        # [batch_size, num_patches + 1, total_embed_dim]
        B, N, C = x.shape

        # qkv(): -> [batch_size, num_patches + 1, 3 * total_embed_dim]
        # reshape: -> [batch_size, num_patches + 1, 3, num_heads, embed_dim_per_head]
        # permute: -> [3, batch_size, num_heads, num_patches + 1, embed_dim_per_head]
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        # [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)

        # transpose: -> [batch_size, num_heads, embed_dim_per_head, num_patches + 1]
        # @: multiply -> [batch_size, num_heads, num_patches + 1, num_patches + 1]
        # @是矩阵乘法,*scale就是是乘根号d
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        # @: multiply -> [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
        # transpose: -> [batch_size, num_patches + 1, num_heads, embed_dim_per_head]
        # reshape: -> [batch_size, num_patches + 1, total_embed_dim]
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        # *Wo
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

2.3 MLP Block

class Mlp(nn.Module):
    """
    MLP as used in Vision Transformer, MLP-Mixer and related networks
    """
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

2.4 Transformer Encoder

class Block(nn.Module):
    def __init__(self,
                 dim,
                 num_heads,
                 mlp_ratio=4.,  # 第一个Linear输出倍数
                 qkv_bias=False,
                 qk_scale=None,
                 drop_ratio=0.,
                 attn_drop_ratio=0.,
                 drop_path_ratio=0.,
                 act_layer=nn.GELU,
                 norm_layer=nn.LayerNorm):
        super(Block, self).__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
                              attn_drop_ratio=attn_drop_ratio, proj_drop_ratio=drop_ratio)
        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
        self.drop_path = DropPath(drop_path_ratio) if drop_path_ratio > 0. else nn.Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop_ratio)

    def forward(self, x):
        x = x + self.drop_path(self.attn(self.norm1(x)))
        x = x + self.drop_path(self.mlp(self.norm2(x)))
        return x

2.5 Vision Transformer网络架构

class VisionTransformer(nn.Module):
    # depth指encoder block堆叠个数
    def __init__(self, img_size=224, patch_size=16, in_c=3, num_classes=1000,
                 embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True,
                 qk_scale=None, representation_size=None, drop_ratio=0.,
                 attn_drop_ratio=0., drop_path_ratio=0., embed_layer=PatchEmbed, norm_layer=None,
                 act_layer=None):
        super(VisionTransformer, self).__init__()
        self.num_classes = num_classes
        self.embed_dim = embed_dim
        self.num_tokens = 2 if distilled else 1
        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
        act_layer = act_layer or nn.GELU

        self.patch_embed = embed_layer(img_size=img_size, patch_size=patch_size, in_c=in_c, embed_dim=embed_dim)
        num_patches = self.patch_embed.num_patches

        # 构建可训练参数,[class]token,[batch, 1, 768]
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        # [196+1, 768]
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
        self.pos_drop = nn.Dropout(p=drop_ratio)

        # 每个encoder的drop_path_ratio不同,是递增的
        dpr = [x.item() for x in torch.linspace(0, drop_path_ratio, depth)]  # stochastic depth decay rule
        self.blocks = nn.Sequential(*[
            Block(dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
                  drop_ratio=drop_ratio, attn_drop_ratio=attn_drop_ratio, drop_path_ratio=dpr[i],
                  norm_layer=norm_layer, act_layer=act_layer)
            for i in range(depth)
        ])
        self.norm = norm_layer(embed_dim)

        # Representation layer
        # 判断加不加Pre-Logits
        if representation_size:
            self.has_logits = True
            self.num_features = representation_size
            self.pre_logits = nn.Sequential(OrderedDict([
                ("fc", nn.Linear(embed_dim, representation_size)),
                ("act", nn.Tanh())
            ]))
        else:
            self.has_logits = False
            self.pre_logits = nn.Identity()

        # Classifier head(s)
        self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
        self.head_dist = None

        # Weight init
        nn.init.trunc_normal_(self.pos_embed, std=0.02)
        if self.dist_token is not None:
            nn.init.trunc_normal_(self.dist_token, std=0.02)

        nn.init.trunc_normal_(self.cls_token, std=0.02)
        self.apply(_init_vit_weights)

    def forward_features(self, x):
        # [B, C, H, W] -> [B, num_patches, embed_dim]
        x = self.patch_embed(x)  # [B, 196, 768]
        # [1, 1, 768] -> [B, 1, 768]
        cls_token = self.cls_token.expand(x.shape[0], -1, -1)
        if self.dist_token is None:
            x = torch.cat((cls_token, x), dim=1)  # [B, 197, 768]
        else:
            x = torch.cat((cls_token, self.dist_token.expand(x.shape[0], -1, -1), x), dim=1)

        x = self.pos_drop(x + self.pos_embed)
        x = self.blocks(x)
        x = self.norm(x)
        if self.dist_token is None:
            return self.pre_logits(x[:, 0])  # 取[class]token
        else:
            return x[:, 0], x[:, 1]

    def forward(self, x):
        x = self.forward_features(x)
        if self.head_dist is not None:
            x, x_dist = self.head(x[0]), self.head_dist(x[1])
            if self.training and not torch.jit.is_scripting():
                # during inference, return the average of both classifier predictions
                return x, x_dist
            else:
                return (x + x_dist) / 2
        else:
            x = self.head(x)
        return x

3. 训练结果

base网络,使用预训练权重,自制数据集,经过30个epoch,结果如下:

[train epoch 27] loss: 0.109, acc: 0.962: 100%|██████████| 308/308 [02:00<00:00,  2.55it/s]
[valid epoch 27] loss: 0.091, acc: 0.974: 100%|██████████| 77/77 [00:25<00:00,  3.02it/s]
[train epoch 28] loss: 0.109, acc: 0.963: 100%|██████████| 308/308 [02:03<00:00,  2.50it/s]
[valid epoch 28] loss: 0.091, acc: 0.973: 100%|██████████| 77/77 [00:26<00:00,  2.95it/s]
[train epoch 29] loss: 0.113, acc: 0.962: 100%|██████████| 308/308 [02:06<00:00,  2.44it/s]
[valid epoch 29] loss: 0.091, acc: 0.975: 100%|██████████| 77/77 [00:26<00:00,  2.93it/s]
  • 3
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
Transformer发轫于NLP(自然语言处理),并跨界应用到CV(计算机视觉)领域。目前已成为深度学习的新范式,影响力和应用前景巨大。  本课程对Transformer的原理和PyTorch代码进行精讲,来帮助大家掌握其详细原理和具体实现。  原理精讲部分包括:注意力机制和自注意力机制、Transformer的架构概述、Encoder的多头注意力(Multi-Head Attention)、Encoder的位置编码(Positional Encoding)、残差链接、层规范化(Layer Normalization)、FFN(Feed Forward Network)、Transformer的训练及性能、Transformer的机器翻译工作流程。   代码精讲部分使用Jupyter Notebook对TransformerPyTorch代码进行逐行解读,包括:安装PyTorchTransformer的Encoder代码解读、Transformer的Decoder代码解读、Transformer的超参设置代码解读、Transformer的训练示例(人为随机数据)代码解读、Transformer的训练示例(德语-英语机器翻译)代码解读。相关课程: 《Transformer原理与代码精讲(PyTorch)》https://edu.csdn.net/course/detail/36697《Transformer原理与代码精讲(TensorFlow)》https://edu.csdn.net/course/detail/36699《ViT(Vision Transformer)原理与代码精讲》https://edu.csdn.net/course/detail/36719《DETR原理与代码精讲》https://edu.csdn.net/course/detail/36768《Swin Transformer实战目标检测:训练自己的数据集》https://edu.csdn.net/course/detail/36585《Swin Transformer实战实例分割:训练自己的数据集》https://edu.csdn.net/course/detail/36586《Swin Transformer原理与代码精讲》 https://download.csdn.net/course/detail/37045

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值