ViT结构

最新推荐文章于 2024-04-18 19:10:59 发布

平丘月初

最新推荐文章于 2024-04-18 19:10:59 发布

阅读量967

点赞数

文章标签： python 深度学习

本文链接：https://blog.csdn.net/u011994454/article/details/120833034

版权

Vision Transformer

图像输入尺寸为 $[N, C, H, W]$ ， $C$ 通常为3，为了构建为 $T r a n s f o r m e r$ 需要的输入，将输入图像切分为 $p_h * p_w * C$ 尺寸的 $n$ 个小图块，合计切出 $h * w$ 个小图块。

# reshape and flatten
[N, C, H, W] => [N, h*w, p_h * p_w * C] => [N, h*w, dim] # h = H // p_h, w = W // p_w, input flattened feature to nn.Linear, map into dim dimenstion.
# concat cls_tokens and add positional embedding
cls_token = nn.Parameter(torch.randn(1, 1, dim))
cls_token = repeat(cls_token, '() n d -> b n d', b=b)
pose_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim)
[N, n, dim] => [N, n + 1, dim] => [N, n + 1, dim] # n = h * w, cls_tokens -> positional embedding.

经过 $n$ 个 $encoding\; layers$ 构建成的 $T r a n s f o r m e r$ 提取特征后，输入到 $MLP\; head$ 模块

[N, n + 1, dim] => [N, num_classes]

$T r a n s f o r m e r$ 的 $encoding\; layer$ 模块的结构如下：

encoding layer = MSA + MLP
MSA: Multi-headed Self-Attention
MLP: Multi-Layer Perceptron

在这里插入图片描述
注意力模块如下：

多层注意力由多个单一的注意力模块提取信息后，concat到一起。
在这里插入图片描述

平丘月初

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
ViT结构

vision Transformer# reshape and flatten[N, C, H, W] => [N, h*w, p_h * p_w * C] => [N, h*w, dim] # h = H // p_h, w = W // p_w, input flattened feature to nn.Linear, map into dim dimenstion.# concat cls_tokens and add positional embedding[N, n, dim
复制链接

扫一扫