语义分割系列26-VIT+SETR——Transformer结构如何在语义分割中大放异彩

yumaomi

已于 2022-07-08 18:43:31 修改

阅读量1.4w

点赞数 25

分类专栏：语义分割文章标签： pytorch 计算机视觉 transformer 人工智能深度学习

于 2022-07-08 15:54:46 首次发布

本文链接：https://blog.csdn.net/yumaomi/article/details/125676811

版权

语义分割专栏收录该内容

31 篇文章

订阅专栏

SETR：《Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspectivewith Transformers》

重新思考语义分割范式，使用Transformer实现语义分割。

论文链接：SETR

VIT：《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》

首次提出使用Transformer进行图像分类。

至于为什么要在介绍SETR的同时介绍Vision Transformer，因为SETR本质上是使用VIT做backbone提取特征，使用传统的decoder来实现语义分割任务。

本文将介绍以下内容：

Vision Transformer如何将Transformer应用于图像、为什么Vision Transformer能在图像领域中大放异彩？
Self-Attention 和 Multi-Head Self-Attention是如何实现的，为什么MSA效果会好？
SETR如何结合Vision Transformer来实现语义分割。
Vision Transformer代码实现、SETR代码实现。

论文部分

引文

Transformer模型早在2017年就已提出，用于机器翻译等任务。在当时也以《Attention is all you need》一文惊艳了大众，至那以后，Transformer与Attention在各个任务中所向披靡。Transformer和Attention机制的成功，也启发了语义分割领域的工作研究。陆续提出了基于Spatial Attention机制的Non-Local机制，基于Channel Attention和Spatial Attention机制的DANet，以及后续许多简化Attention繁杂计算的CrissCross Attention、Interlaced Sparse Self-Attention、EM Attention等等。但这些机制还是依赖于传统的CNN结构，还未有纯Transformer结构应用于图像领域。

直到Vision Transformer的出现，An Image is Worth 16x16 Words，VIT的出现将纯Transformer结构引入到图像分类中，将图像分块、嵌入以后使用Transformer进行计算，通过MLP来实现分类，并在ImageNet中取得优秀的效果。

VIT的出现启发了语义分割领域，Transformer这种基于Attention的机制，比起CNN需要使用卷积来提升感受野的操作，Attention无疑更加优秀。在任意层，Transformer就能实现全局的感受野，建立全局依赖。而且，CNN网络往往需要将原始图像的分辨率采用到8倍甚至32倍，这样就会损失一些信息，而Transformer无需进行下采样就能实现特征提取，保留了图像的更多信息。

因此，SETR采取了VIT作为语义分割encoder-decoder结构中的encoder结构，作为编码器来提取图像特征。SETR在论文提交当天在ADE20K任务中获得了第一名的成绩，这证明SETR在语义分割任务中确实能够获得十分优秀的效果。

Vision Transformer

由于Transformer结构最先被设计用于机器翻译，而文本中的数据类型为序列结构。所以，当Transformer用于图像时，图像也需要被处理成一个序列。也就是图中的Linear Projection of Flattened Patches，将一张图像分成9个patch。为了保留图像patch的顺序，需要对每一个patch标上序号，也就是Position Embedding，VIT中，将这个Position信息加到每一个图像patch中。这样，每一个图像patch中就包含了他的位置信息。

经过嵌入的image patched，输入到Transformer Encoder（右图）结构中，首先进行Layer Norm，再输入到Multi-Head Self-Attention（MSA）中进行计算注意力，经过残差模块，再输入到Layer Norm、MLP、计算残差，这样经过L个Transformer结构得到最终输出。最后，将这个输出结果输入到一个MLP中，得到预测结果。

其中，MSA是是Transformer的最重要的一环，Transformer为何如此有效主要归功于MSA结构。MSA顾名思义，由多个Self-Attention结构堆叠，称为多头-自注意力。

对于Self-Attention

$\large Attention = Softmax(\frac{Q\odot K^T}{\sqrt{d_k}})\odot V$

而对于MSA，可以理解为多个计算多个Self-Attention的结果，将结果进行拼接，得到MSA的输出。

当然，这里可能会有一个问题，为什么需要计算多个Self Attention？

因为，在计算Self Attention时，数据维度导致了计算量过大，同时在高纬度空间内学习特征也比较困难。而MSA在计算Self Attention时，需要对数据进行一个降维（或者叫“截断”），这样只选到了原始数据中的一部分，计算出来的Attention自然不够全面。因此，计算多个经过降维（截断）的Self Attention，每一个attention计算过程并行，且参数不共享，保证计算结果的全面性和效率。同时，这里也有一个优势，多个Self Attention计算时，能够在不同的特征子空间内计算其Attention，使结果更加丰富，模型对数据特征的理解也更加深入。

import torch
from torch import nn

from einops import rearrange, repeat
from einops.layers.torch import Rearrange

# Multi-Head Self-Attention
class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head ** -0.5

        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)
        attn = self.dropout(attn)

        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

if __name__ == "__main__":
    MSA = Attention(dim=256, heads = 8, dim_head = 64, dropout = 0.).cpu()
    # 1 is batch, 1024 is data, 256 is channels
    img = torch.randn(1, 1024, 256).cpu()
    preds = MSA(img)
    print(preds.shape)

下面也给出VIT作为语义分割encoder部分的代码实现，输出为四个 [batch, channels, h, w]。前三个，为中间的Transformer层输出，用于设计辅助损失。最后一个为最终输出，用于输出分割结果。

import torch
from torch import nn

from einops import rearrange, repeat
from einops.layers.torch import Rearrange

# helpers
def pair(t):
    return t if isinstance(t, tuple) else (t, t)

# classes 
class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)

class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head ** -0.5

        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)
        attn = self.dropout(attn)

        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0., out_indices = (9, 14, 19, 23)):
        super().__init__()
        self.out_indices = out_indices
        assert self.out_indices[-1] == depth - 1

        self.layers = nn.ModuleList([])
        
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
            ]))

    def forward(self, x):
        out = []
        for index, (attn, ff) in enumerate(self.layers):
            x = attn(x) + x
            x = ff(x) + x

            if index in self.out_indices:
                out.append(x)

        return out

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0., out_indices = (9, 14, 19, 23)):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.Linear(patch_dim, dim),
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)

        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout, out_indices=out_indices)

        self.out = Rearrange("b (h w) c->b c h w", h=image_height//patch_height, w=image_width//patch_width)

    def forward(self, img):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape
        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)

        out = self.transformer(x)

        for index, transformer_out in enumerate(out):
            # delete cls_tokens and transform output to [b, c, h, w]
            out[index] = self.out(transformer_out[:,1:,:])

        return out


import torch
if __name__ == "__main__":
    v = ViT(image_size = (256, 256), patch_size = 256//16, dim = 1024, depth = 24, heads = 16, mlp_dim = 2048, dropout = 0.1, emb_dropout = 0.1, out_indices = (9, 14, 19, 23)).cpu()
    img = torch.randn(1, 3, 256, 256).cpu()
    preds = v(img)
    for output in preds:
        print(output.size())

    """
    output:
        transformer layer 9:     torch.Size([1, 1024, 16, 16])  for aux loss
        transformer layer 14:    torch.Size([1, 1024, 16, 16])  for aux loss
        transformer layer 19:    torch.Size([1, 1024, 16, 16])  for aux loss
        transformer layer 23:    torch.Size([1, 1024, 16, 16])  for segmentation
    """

SETR

回到本文主题，VIT的成功启发了SETR。

在这里（图3），a中采用了VIT结构。假设一张图像x（H×W×3）大小，如果直接输入到VIT中，序列化成H×W×3，这对于Transformer的二次复杂度而言，运算量可能过大，因此，作者在这里做了一个下采样操作，将图像x映射成（H/16×W/16×3），这样就可以得到H×W×3/256的序列大小。将序列嵌入并编码位置后，得到最终的输入E = {e1 + p1, e2 + p2, · · · , eL + pL}，其中e是embedding，p是positon information，L为序列长度。

如上文提到，Vit的输出为 [1, 1024, 16, 16] 大小，也就是[batch, channels, h, w]，传统的语义分割的encoder输出模型。在这里，同样需要对结果进行上采样。

作者设计了三种上采样模式来完成，分别为：

Naive mode：直接上采样16倍，16->256。
Progressive UPsampling（PUP）通过卷积->2倍上采样->卷积->2倍上采样...的逐步上采样模式（图3 b）。
Multi-Level feature Aggregation (MLA)获取Transformer中间层结果，聚合后4倍上采样。

作者也在ADE20K、Pascal VOC等数据集上测试了几个上采样模块的结果。PUP和MLA结果会比Naive更好，同时，两者之间差距并不是很大。

在各个数据集中，SETR也获得比FCN baseline更好的效果。这说明SETR能够较好完成语义分割任务。

作者也可视化了中间层的中间结果，可以看到Transformer结构确实能够学到图像中的一些信息。

注：SETR模型在Camvid测试中给出。

Camvid数据集测试

SETR model

import torch
from torch import nn

from einops import rearrange, repeat
from einops.layers.torch import Rearrange

# helpers
def pair(t):
    return t if isinstance(t, tuple) else (t, t)

# classes 
class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)

class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head ** -0.5

        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)
        attn = self.dropout(attn)

        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0., out_indices = (9, 14, 19, 23)):
        super().__init__()
        self.out_indices = out_indices
        assert self.out_indices[-1] == depth - 1

        self.layers = nn.ModuleList([])
        
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
            ]))

    def forward(self, x):
        out = []
        for index, (attn, ff) in enumerate(self.layers):
            x = attn(x) + x
            x = ff(x) + x

            if index in self.out_indices:
                out.append(x)

        return out

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0., out_indices = (9, 14, 19, 23)):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.Linear(patch_dim, dim),
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)

        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout, out_indices=out_indices)

        self.out = Rearrange("b (h w) c->b c h w", h=image_height//patch_height, w=image_width//patch_width)

    def forward(self, img):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape
        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)

        out = self.transformer(x)

        for index, transformer_out in enumerate(out):
            # delete cls_tokens and transform output to [b, c, h, w]
            out[index] = self.out(transformer_out[:,1:,:])

        return out


class PUPHead(nn.Module):
    def __init__(self, num_classes):
        super(PUPHead, self).__init__()
        
        self.UP_stage_1 = nn.Sequential(
            nn.Conv2d(1024, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )        
        self.UP_stage_2 = nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )        
        self.UP_stage_3= nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )        
        self.UP_stage_4= nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )
    
        self.cls_seg = nn.Conv2d(256, num_classes, 3, padding=1)

    def forward(self, x):
        x = self.UP_stage_1(x)
        x = self.UP_stage_2(x)
        x = self.UP_stage_3(x)
        x = self.UP_stage_4(x)
        x = self.cls_seg(x)
        return x

class SETR(nn.Module):
    def __init__(self, num_classes, image_size, patch_size, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0., out_indices = (9, 14, 19, 23)):
        super(SETR, self).__init__()
        self.out_indices = out_indices
        self.num_classes = num_classes
        self.VIT = ViT( image_size=image_size, patch_size=patch_size, dim=dim, depth=depth, heads=heads, mlp_dim=mlp_dim, 
                        channels = channels, dim_head = dim_head, dropout = dropout, emb_dropout = emb_dropout, out_indices = out_indices)

        
        self.Head = nn.ModuleDict()

        for index, indices in enumerate(self.out_indices):
            self.Head["Head"+str(indices)] = PUPHead(num_classes)
        
    def forward(self, x):
        VIT_OUT = self.VIT(x)

        out = []
        for index, indices in enumerate(self.out_indices):
            # 最后一个是最后层的输出
            out.append(self.Head["Head"+str(indices)](VIT_OUT[index]))
        return out


if __name__ == "__main__":
    # VIT-Large  设置了16个patch
    SETRNet = SETR(num_classes=3, image_size=256, patch_size=256//16, dim=1024, depth = 24, heads = 16, mlp_dim = 2048, out_indices = (9, 14, 19, 23)).cpu()
    img = torch.randn(1, 3, 256, 256).cpu()
    preds = SETRNet(img)
    for output in preds:
        print("output: ",output.size())

Dataset

# 导入库
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import Dataset, DataLoader, random_split
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
import albumentations as A
from albumentations.pytorch.transforms import ToTensorV2
 
torch.manual_seed(17)
# 自定义数据集CamVidDataset
class CamVidDataset(torch.utils.data.Dataset):
    def __init__(self, images_dir, masks_dir):
        self.transform = A.Compose([
            A.Resize(256, 256),
            A.HorizontalFlip(),
            A.VerticalFlip(),
            A.Normalize(),
            ToTensorV2(),
        ]) 
        self.ids = os.listdir(images_dir)
        self.images_fps = [os.path.join(images_dir, image_id) for image_id in self.ids]
        self.masks_fps = [os.path.join(masks_dir, image_id) for image_id in self.ids]
 
    
    def __getitem__(self, i):
        # read data
        image = np.array(Image.open(self.images_fps[i]).convert('RGB'))
        mask = np.array( Image.open(self.masks_fps[i]).convert('RGB'))
        image = self.transform(image=image,mask=mask)
        
        return image['image'], image['mask'][:,:,0]
        
    def __len__(self):
        return len(self.ids)
    
    
# 设置数据集路径
DATA_DIR = r'../database/camvid/camvid/' # 根据自己的路径来设置
x_train_dir = os.path.join(DATA_DIR, 'train_images')
y_train_dir = os.path.join(DATA_DIR, 'train_labels')
x_valid_dir = os.path.join(DATA_DIR, 'valid_images')
y_valid_dir = os.path.join(DATA_DIR, 'valid_labels')
    
train_dataset = CamVidDataset(
    x_train_dir, 
    y_train_dir, 
)
val_dataset = CamVidDataset(
    x_valid_dir, 
    y_valid_dir, 
)
 
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True,drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=True,drop_last=True)

Train

# 在模型训练前，建议先加载vit的权重
model = SETR(num_classes=32, image_size=256, patch_size=256//16, dim=1024, depth = 24, heads = 16, mlp_dim = 2048).cuda()


from d2l import torch as d2l
from tqdm import tqdm
import pandas as pd
import monai
from torchcontrib.optim import SWA
# training loop 100 epochs
epochs_num = 100
# 选用SGD优化器来训练
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
schedule = monai.optimizers.LinearLR(optimizer, end_lr=0.05, num_iter=int(epochs_num*0.75))
# 使用SWA优化 来提升SGD的效果
steps_per_epoch = int(len(train_loader.dataset) / train_loader.batch_size)
swa_start = int(epochs_num*0.75)
optimizer = SWA(optimizer, swa_start=swa_start*steps_per_epoch, swa_freq=steps_per_epoch, swa_lr=0.05)

# 损失函数选用多分类交叉熵损失函数
lossf = nn.CrossEntropyLoss(ignore_index=255)


def evaluate_accuracy_gpu(net, data_iter, device=None):
    if isinstance(net, nn.Module):
        net.eval()  # Set the model to evaluation mode
        if not device:
            device = next(iter(net.parameters())).device
    # No. of correct predictions, no. of predictions
    metric = d2l.Accumulator(2)

    with torch.no_grad():
        for X, y in data_iter:
            if isinstance(X, list):
                # Required for BERT Fine-tuning (to be covered later)
                X = [x.to(device) for x in X]
            else:
                X = X.to(device)
            y = y.to(device)
            output = net(X)
            pred = output[-1]
            metric.add(d2l.accuracy(pred, y), d2l.size(y))
    return metric[0] / metric[1]


# 训练函数
def train_ch13(net, train_iter, test_iter, loss, optimizer, num_epochs, schedule, swa_start=swa_start, devices=d2l.try_all_gpus()):
    timer, num_batches = d2l.Timer(), len(train_iter)
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0, 1], legend=['train loss', 'train acc', 'test acc'])
    net = nn.DataParallel(net, device_ids=devices).to(devices[0])
    # 用来保存一些训练参数

    loss_list = []
    train_acc_list = []
    test_acc_list = []
    epochs_list = []
    time_list = []
    lr_list = []
    

    for epoch in range(num_epochs):
        # Sum of training loss, sum of training accuracy, no. of examples,
        # no. of predictions
        metric = d2l.Accumulator(4)
        for i, (X, labels) in enumerate(train_iter):
            timer.start()

            if isinstance(X, list):
                X = [x.to(devices[0]) for x in X]
            else:
                X = X.to(devices[0])
            gt = labels.long().to(devices[0])

            net.train()
            optimizer.zero_grad()
            result = net(X)
            
            pred = result[-1]
            seg_loss = loss(result[-1], gt)

            aux_loss_1 = loss(result[0], gt)
            aux_loss_2 = loss(result[1], gt)
            aux_loss_3 = loss(result[2], gt)

            loss_sum = seg_loss + 0.2*aux_loss_1 + 0.3*aux_loss_2 + 0.4*aux_loss_3

            l = loss_sum
            loss_sum.sum().backward()
            optimizer.step()

            acc = d2l.accuracy(pred, gt)
            metric.add(l, acc, labels.shape[0], labels.numel())

            timer.stop()
            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
                animator.add(epoch + (i + 1) / num_batches,(metric[0] / metric[2], metric[1] / metric[3], None))
                
        if optimizer.state_dict()['param_groups'][0]['lr']>0.05:
            schedule.step()

        test_acc = evaluate_accuracy_gpu(net, test_iter)
        
        if (epoch + 1) >= swa_start:
            if epoch == 0 or epoch % 5 == 5 - 1 or epoch == num_epochs - 1:
                # Batchnorm update
                optimizer._reset_lr_to_swa()
                optimizer.swap_swa_sgd()
                optimizer.bn_update(train_iter, net, device='cuda')
                test_acc = evaluate_accuracy_gpu(net, test_iter)
                optimizer.swap_swa_sgd()
        
        animator.add(epoch + 1, (None, None, test_acc))

        print(f"epoch {epoch+1}/{epochs_num} --- loss {metric[0] / metric[2]:.3f} --- train acc {metric[1] / metric[3]:.3f} --- test acc {test_acc:.3f} --- lr {optimizer.state_dict()['param_groups'][0]['lr']} --- cost time {timer.sum()}")
        
        #---------保存训练数据---------------
        df = pd.DataFrame()
        loss_list.append(metric[0] / metric[2])
        train_acc_list.append(metric[1] / metric[3])
        test_acc_list.append(test_acc)
        epochs_list.append(epoch+1)
        time_list.append(timer.sum())
        lr_list.append(optimizer.state_dict()['param_groups'][0]['lr'])
        
        df['epoch'] = epochs_list
        df['loss'] = loss_list
        df['train_acc'] = train_acc_list
        df['test_acc'] = test_acc_list
        df["lr"] = lr_list
        df['time'] = time_list
        
        df.to_excel("../blork_file/savefile/test.xlsx")
        #----------------保存模型------------------- 
        if np.mod(epoch+1, 5) == 0:
            torch.save(net, f'../blork_file/checkpoints/test{epoch+1}.pth')

    # 保存下最后的model
    torch.save(net, f'../blork_file/checkpoints/test.pth')


train_ch13(model, train_loader, val_loader, lossf, optimizer, epochs_num, schedule=schedule)

Result

Transformer比较难训练，作者这里只训练了一半。

Inference

这里还做了一点Inference的实现，发现效果其实比较一般，损失函数选择mIoU可能会比交叉熵会好一点，或者将一部分类别删掉简化一下也会好一点。

import torch
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import torch.nn as nn

# 截取模型
# 读取老模型
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# 剪枝掉多余的decoder
class prune_model(nn.Module):
    def __init__(self, encode, decode):
        super(prune_model, self).__init__()
        self.encode = encode
        self.decode = decode

    def forward(self, x):
        x = self.encode(x)
        x = self.decode(x[-1])
        return x


model = torch.load(r"../checkpoints/test_last.pth")
# 创建新模型
new_model = prune_model(model.VIT, model.Head.Head23).to(device)
model = new_model#.cpu()

NCLASSES = 32
Cam_COLORMAP = [[128, 0, 0], [0, 128, 0], [128, 128, 0],
                [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
                [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
                [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
                [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
                [0, 64, 128], [0, 32, 128],[0, 16, 128],[0, 64, 64],[0, 64, 32],
                [0, 64, 16],[64, 64, 128],[0, 32, 16],[32,32,32],[16,16,16],[32,16,128],
                [192,16,16],[32,32,196],
               ]

#32类
Cam_CLASSES = ['Animal','Archway','Bicyclist','Bridge','Building','Car','CartLuggagePram','Child',
               'Column_Pole','Fence','LaneMkgsDriv','LaneMkgsNonDriv','Misc_Text','MotorcycleScooter',
               'OtherMoving','ParkingBlock','Pedestrian','Road','RoadShoulder','Sidewalk','SignSymbol',
               'Sky', 'SUVPickupTruck','TrafficCone','TrafficLight', 'Train','Tree','Truck_Bus', 'Tunnel',
               'VegetationMisc', 'Void','Wall']
               
assert len(Cam_COLORMAP) == len(Cam_CLASSES) == 32


image = Image.open("../database/camvid/camvid/train_images/0001TP_008100.png").convert("RGB")
image = image.resize((256, 256))
temp = image

image = np.array(image)

Transform = A.Compose([
    A.Resize(256, 256),
    A.HorizontalFlip(),
    A.Normalize(),
    ToTensorV2(),
]) 

image = Transform(image=image)
image = image['image'].view(1, 3, 256, 256)

image = image.cuda()

preds = model(image)
out = np.array(preds.argmax(1).view(1, 256, 256).permute(1,2,0).view(256,256).cpu().detach().numpy())

seg_img = np.zeros((256, 256, 3))

colors = Cam_COLORMAP
colors[0][0] = 0
colors[0][1] = 0
colors[0][2] = 0

for c in range(NCLASSES):
    seg_img[:,:,0] += ((out[:,:] == c )*( colors[c][0] )).astype('uint8')
    seg_img[:,:,1] += ((out[:,:] == c )*( colors[c][1] )).astype('uint8')
    seg_img[:,:,2] += ((out[:,:] == c )*( colors[c][2] )).astype('uint8')

seg_img = Image.fromarray(np.uint8(seg_img))
image = Image.blend(temp,seg_img,0.5)
image = image.resize((960, 720), Image.BILINEAR)

plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.show()

结果比较抽象，不过有大部分确实是预测到了而已。

本文总结

本文介绍了Vision Transformer和SETR在语义分割中的应用。
在介绍Vision Transformer时，还顺带讲解了Multi-Head Self-Attention的实现原理，以及为何MSA会比Self-Attention有效果。
对于SETR，本质是VIT+Decoder结构，其中Decoder主要由三种设计：Naive PUP MLA，本文只实现了PUP结构。
基于SETR在Camvid数据集上完成了一部分训练和推理（虽然我们的推理结果比较一般，这Transformer模型也确实比较难以训练）。但作者确实是在总多数据集上得到了比较好的一个效果。说明Transformer结构应用在语义分割领域是可行的。