Transformer 架构与源码结合讲解

最新推荐文章于 2024-04-26 09:20:17 发布

Chenql716

最新推荐文章于 2024-04-26 09:20:17 发布

阅读量513

点赞数

文章标签：微信小程序 leetcode java

本文链接：https://blog.csdn.net/Chenql716/article/details/123820936

版权

在这里插入图片描述

1.PatchEmbedding

图1 下

# 就是结构最底层的patchEmbedding，使用卷积层实现将大图片分解为小patch作为类似文本处理的时序输入
class PatchEmbed(nn.Module):
    """
    2D Image to Patch Embedding
    """
    # vit-B 的dim是默认的768
    def __init__(self, img_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None):
        super().__init__()
        img_size = (img_size, img_size)
        patch_size = (patch_size, patch_size)
        self.img_size = img_size
        self.patch_size = patch_size

        # 224/16=14
        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
        self.num_patches = self.grid_size[0] * self.grid_size[1]

        self.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

    def forward(self, x):
        B, C, H, W = x.shape
        assert H == self.img_size[0] and W == self.img_size[1], \
            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."

        # flatten: [B, C, H, W] -> [B, C, HW] 展平处理
        # transpose: [B, C, HW] -> [B, HW, C]
        x = self.proj(x).flatten(2).transpose(1, 2)
        x = self.norm(x)
        return x

2.multihead Attention

图1 中encoder部分图2 详解部分的multihead

# 实现Attention的模块
class Attention(nn.Module):
    def __init__(self,
                 dim,   # 输入token的dim
                 num_heads=8,
                 qkv_bias=False,
                 qk_scale=None,
                 attn_drop_ratio=0.,
                 proj_drop_ratio=0.):
        super(Attention, self).__init__()
        self.num_heads = num_heads

        # 在多头注意力机制的条件下，每个head分到的dim是源dim_num/head_num
        head_dim = dim // num_heads 
        # Attention公式里的根号k
        self.scale = qk_scale or head_dim ** -0.5
        # Attention里的q/k/v是根据全连接层生成的
        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        # 定义一个dropout层
        self.attn_drop = nn.Dropout(attn_drop_ratio)
        # Attention公式里的Wo
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop_ratio)

    def forward(self, x):
        # x.shape=[batch_size, num_patches + 1, total_embed_dim]
        # num_patch= 14* 14 =196 +1=class token 
        B, N, C = x.shape

        # qkv(): -> [batch_size, num_patches + 1, 3 * total_embed_dim]
        # reshape: -> [batch_size, num_patches + 1, 3, num_heads, embed_dim_per_head]
        # permute: -> [3, batch_size, num_heads, num_patches + 1, embed_dim_per_head]
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        # [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)

        # transpose: -> [batch_size, num_heads, embed_dim_per_head, num_patches + 1]
        # @: multiply -> [batch_size, num_heads, num_patches + 1, num_patches + 1]
        attn = (q @ k.transpose(-2, -1)) * self.scale

        # 对attn里的每一个数据进行softmax
        attn = attn.softmax(dim=-1)
        # 通过dropout
        attn = self.attn_drop(attn)

        # @: multiply -> [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
        # transpose: -> [batch_size, num_patches + 1, num_heads, embed_dim_per_head]
        # reshape: -> [batch_size, num_patches + 1, total_embed_dim]
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

3.MLP module

图2 上mlp，图三是图2 上mlp结构的详解

class Mlp(nn.Module):
    """
    MLP as used in Vision Transformer, MLP-Mixer and related networks
    """
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

Chenql716

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Transformer 架构与源码结合讲解

1.PatchEmbedding图1 下# 就是结构最底层的patchEmbedding，使用卷积层实现将大图片分解为小patch作为类似文本处理的时序输入class PatchEmbed(nn.Module): """ 2D Image to Patch Embedding """ # vit-B 的dim是默认的768 def __init__(self, img_size=224, patch_size=16, in_c=3, embed_dim=768.
复制链接

扫一扫