GitHub：ViT-pytorch相关学习-视觉分类方向-1

最新推荐文章于 2024-03-14 10:25:50 发布

MengYa_DreamZ

最新推荐文章于 2024-03-14 10:25:50 发布

阅读量1.8k

点赞数 2

分类专栏：【科研探索】文章标签： pytorch 学习深度学习

本文链接：https://blog.csdn.net/MengYa_Dream/article/details/126579405

版权

【科研探索】专栏收录该内容

20 篇文章 3 订阅

订阅专栏

GitHub - lucidrains/vit-pytorch: Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Vision Transformer的实现，在视觉分类中只需要一个transformer就能实现SOTA。

不涉及过多的代码，以此为基础进行实验，就可以加快注意力革命。（有点像集成了一个工具？）

基于预训练模型的实验，可参考此处！

1.安装vit-pytorch

pip install vit-pytorch

2.使用教程

import torch
from vit_pytorch import ViT

v = ViT(
    image_size = 256,     
    patch_size = 32,      
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)

3.参数说明

image_size: int，图像为矩形时，应当保证其取值为长宽中的最大值。
patch_size: int，图像划分时的单位尺寸，patch数量为 n=(image_size//patch_size）**2，同时，patch的数量必须大于16。
num_classes: int，要分类的数量。（os：这个参数要注意一下）
dim: int，线性变换后输出张量tensor的最后维度 nn.Linear(..., dim)。
depth: int，Transformer块的数量。（Q：Transformer块的概念）
heads: int，多头注意力层的头数量。
mlp_dim: int，MLP（前向）层的维度
channels: int，默认3（RGB），图像的通道数。
dropout: float between [0，1]，默认0。衰退率。
emb_dropout: float between [0，1], 默认0。嵌入衰退率。
pool: string，cls token池化或平均池化。

4.Simple ViT

简单的ViT包含2维余弦位置嵌入，全局平均池化（没有cls token），没有衰退，批处理大小为1024而不是4096，使用了随机增强和混合增强。他们还表明，最后使用一个简单的线性处理所得到的效果与原来的MLP头相比效果无明显差异。

Paper

import torch
from vit_pytorch import SimpleViT

v = SimpleViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)

5.Distillation

使用蒸馏token从卷积网络提取知识到视觉变压器，可以产生小型和高效的视觉transformer。这个存储库提供了轻松进行蒸馏的方法。

例如. distilling from Resnet50 (or any teacher) to a vision transformer

import torch
from torchvision.models import resnet50

from vit_pytorch.distill import DistillableViT, DistillWrapper

teacher = resnet50(pretrained = True)

v = DistillableViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 8,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

distiller = DistillWrapper(
    student = v,
    teacher = teacher,
    temperature = 3,           # temperature of distillation
    alpha = 0.5,               # trade between main loss and distillation loss
    hard = False               # whether to use soft or hard distillation
)

img = torch.randn(2, 3, 256, 256)
labels = torch.randint(0, 1000, (2,))

loss = distiller(img, labels)
loss.backward()

# after lots of training above ...

pred = v(img) # (2, 1000)

除了处理前向传递的方式不同，DistillableViT类与ViT相同，因此在完成蒸馏训练后，能够将参数加载回ViT。

还可以在DistillableViT实例上使用方便的.to_vit方法来返回一个ViT实例。

v = v.to_vit()
type(v) # <class 'vit_pytorch.vit_pytorch.ViT'>

6.DeepViT

研究增加ViT的层数，即网络深度(过去的12层)，并建议混合每个头部的注意力后softmax作为一个解决方案，称为重新注意。研究结果与NLP的Talking Heads论文一致。

import torch
from vit_pytorch.deepvit import DeepViT

v = DeepViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)

7.CaiT

指出了更深入训练视觉变压器的困难，并提出了两种解决方案。首先，它提出对剩余块的输出逐通道相乘。其次，它建议让补丁相互关注，只允许CLS令牌关注最后几层的补丁。他们还添加了Talking Heads，提出改进。

import torch
from vit_pytorch.cait import CaiT

v = CaiT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 12,             # depth of transformer for patch to patch attention only
    cls_depth = 2,          # depth of cross attention of CLS tokens to patch
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1,
    layer_dropout = 0.05    # randomly dropout 5% of the layers
)

img = torch.randn(1, 3, 256, 256)

preds = v(img) # (1, 1000)

8.Token-to-Token ViT

提出前两层通过展开对图像序列进行下采样，使每个令牌的图像数据重叠，如图所示。

import torch
from vit_pytorch.t2t import T2TViT

v = T2TViT(
    dim = 512,
    image_size = 224,
    depth = 5,
    heads = 8,
    mlp_dim = 512,
    num_classes = 1000,
    t2t_layers = ((7, 4), (3, 2), (3, 2)) # tuples of the kernel size and stride of each consecutive layers of the initial token to token module
)

img = torch.randn(1, 3, 224, 224)

preds = v(img) # (1, 1000)

9.CCT

CCT提出了使用卷积而不是补丁和执行序列池的紧凑变压器。这使得CCT具有高精度和低数量的参数。

import torch
from vit_pytorch.cct import CCT

cct = CCT(
    img_size = (224, 448),
    embedding_dim = 384,
    n_conv_layers = 2,
    kernel_size = 7,
    stride = 2,
    padding = 3,
    pooling_kernel_size = 3,
    pooling_stride = 2,
    pooling_padding = 1,
    num_layers = 14,
    num_heads = 6,
    mlp_radio = 3.,
    num_classes = 1000,
    positional_embedding = 'learnable', # ['sine', 'learnable', 'none']
)

img = torch.randn(1, 3, 224, 448)
pred = cct(img) # (1, 1000)

或者，也可以使用几个预定义的模型[2,4,6,7,8,14,16]，这些模型预先定义了层数、注意头数量、mlp比例和嵌入维度。

import torch
from vit_pytorch.cct import cct_14

cct = cct_14(
    img_size = 224,
    n_conv_layers = 1,
    kernel_size = 7,
    stride = 2,
    padding = 3,
    pooling_kernel_size = 3,
    pooling_stride = 2,
    pooling_padding = 1,
    num_classes = 1000,
    positional_embedding = 'learnable', # ['sine', 'learnable', 'none']
)

10.Cross ViT

本文提出用两个视觉transformer对图像进行不同尺度的处理，每隔一段时间交叉处理一个图像。它们展示了在基本视觉转换器上的改进。

import torch
from vit_pytorch.cross_vit import CrossViT

v = CrossViT(
    image_size = 256,
    num_classes = 1000,
    depth = 4,               # number of multi-scale encoding blocks
    sm_dim = 192,            # high res dimension
    sm_patch_size = 16,      # high res patch size (should be smaller than lg_patch_size)
    sm_enc_depth = 2,        # high res depth
    sm_enc_heads = 8,        # high res heads
    sm_enc_mlp_dim = 2048,   # high res feedforward dimension
    lg_dim = 384,            # low res dimension
    lg_patch_size = 64,      # low res patch size
    lg_enc_depth = 3,        # low res depth
    lg_enc_heads = 8,        # low res heads
    lg_enc_mlp_dim = 2048,   # low res feedforward dimensions
    cross_attn_depth = 2,    # cross attention rounds
    cross_attn_heads = 8,    # cross attention heads
    dropout = 0.1,
    emb_dropout = 0.1
)

img = torch.randn(1, 3, 256, 256)

pred = v(img) # (1, 1000)

11.PiT

提出通过使用深度卷积的池化过程向下采样令牌。

import torch
from vit_pytorch.pit import PiT

v = PiT(
    image_size = 224,
    patch_size = 14,
    dim = 256,
    num_classes = 1000,
    depth = (3, 3, 3),     # list of depths, indicating the number of rounds of each stage before a downsample
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

# forward pass now returns predictions and the attention maps

img = torch.randn(1, 3, 224, 224)

preds = v(img) # (1, 1000)