CNN插件:把YOLOF中的Encoder变为一个Pytorch插件

本文介绍了YOLOF论文中提出的DilatedEncoder,它作为FPN的替代品,用于目标检测网络。DilatedEncoder包含FPN侧边连接和膨胀残差块,以覆盖不同尺度的目标。作者提供了纯Pytorch实现的代码,方便在其他网络结构中使用。
摘要由CSDN通过智能技术生成

目录

1、目的

2、关于Dilated Encoder

3、Pytorch代码


1、目的

之前的一篇博客介绍了YOLOF的论文,其中提到了一种Dilated Encoder来作为FPN的替代品。那么,我们是否可以将其提取出来作为一个插件来用呢?

基于这种想法,就产生了本篇博客的内容。

2、关于Dilated Encoder

在YOLOF中,提出了一种SiSo(单进单出)的模块,可以代替传统MiMo(多进多出)的FPN,其结构如图所示:

其构造十分清晰:首先按照FPN的方式在backbone后面增加了两个投影层(一个1*1卷积、一个3*3卷积),生成一个512通道的feature map;然后,为了使得Encoder的输出特征能够覆盖所有尺度的目标,我们提出了一个额外的残差块,其包含三个连续的卷积:先用一个1*1卷积将通道维度减少4倍,接着用一个3*3膨胀卷积来增大感受野,最后再用一个1*1卷积恢复通道数。

事实上,该模块继承了FPN中的侧边连接(上图中的Projector),C5特征经过Projector之后,然后经过四个连续的残差块,从而生成P5特征。

3、Pytorch代码

在YOLOF官方实现的Encoder中,其包含了一些detectron2、fvcore的API,这对使用纯Pytorch框架的同学来说不太友好。于是,我对其中涉及Pytorch之外框架的内容进行了更换、删减,从而形成了一个纯Pytorch的Encoder,其可以作为一种插件,放到任意其他网络结构中。

废话不多说,直接上码:

"""
The define of Dilated Encoder from YOLOF:
No detectron2, only Pytorch.

"""

import torch
import torch.nn as nn


class DilatedEncoder(nn.Module):
    """
    Dilated Encoder for YOLOF.

    This module contains two types of components:
        - the original FPN lateral convolution layer and fpn convolution layer,
          which are 1x1 conv + 3x3 conv
        - the dilated residual block
    """

    def __init__(self,
                 in_channels=2048,
                 encoder_channels=512,
                 block_mid_channels=128,
                 num_residual_blocks=4,
                 block_dilations=[2, 4, 6, 8]
                 ):
        super(DilatedEncoder, self).__init__()
        # fmt: off
        self.in_channels = in_channels
        self.encoder_channels = encoder_channels
        self.block_mid_channels = block_mid_channels
        self.num_residual_blocks = num_residual_blocks
        self.block_dilations = block_dilations

        assert len(self.block_dilations) == self.num_residual_blocks

        # init
        self._init_layers()
        self._init_weight()

    def _init_layers(self):
        self.lateral_conv = nn.Conv2d(self.in_channels,
                                      self.encoder_channels,
                                      kernel_size=1)
        self.lateral_norm = nn.BatchNorm2d(self.encoder_channels)
        self.fpn_conv = nn.Conv2d(self.encoder_channels,
                                  self.encoder_channels,
                                  kernel_size=3,
                                  padding=1)
        self.fpn_norm = nn.BatchNorm2d(self.encoder_channels)
        encoder_blocks = []
        for i in range(self.num_residual_blocks):
            dilation = self.block_dilations[i]
            encoder_blocks.append(
                Bottleneck(
                    self.encoder_channels,
                    self.block_mid_channels,
                    dilation=dilation
                )
            )
        self.dilated_encoder_blocks = nn.Sequential(*encoder_blocks)

    def xavier_init(self, layer):
            if isinstance(layer, nn.Conv2d):
                # print(layer.weight.data.type())
                # m.weight.data.fill_(1.0)
                nn.init.xavier_uniform_(layer.weight, gain=1)

    def _init_weight(self):
        self.xavier_init(self.lateral_conv)
        self.xavier_init(self.fpn_conv)
        for m in [self.lateral_norm, self.fpn_norm]:
            nn.init.constant_(m.weight, 1)
            nn.init.constant_(m.bias, 0)
        for m in self.dilated_encoder_blocks.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.normal_(m.weight, mean=0, std=0.01)
                if hasattr(m, 'bias') and m.bias is not None:
                    nn.init.constant_(m.bias, 0)

            if isinstance(m, (nn.GroupNorm, nn.BatchNorm2d, nn.SyncBatchNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, feature: torch.Tensor) -> torch.Tensor:
        out = self.lateral_norm(self.lateral_conv(feature))
        out = self.fpn_norm(self.fpn_conv(out))
        return self.dilated_encoder_blocks(out)


class Bottleneck(nn.Module):

    def __init__(self,
                 in_channels: int = 512,
                 mid_channels: int = 128,
                 dilation: int = 1):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=1, padding=0),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(inplace=True)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(mid_channels, mid_channels,
                      kernel_size=3, padding=dilation, dilation=dilation),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(inplace=True)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(mid_channels, in_channels, kernel_size=1, padding=0),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = x
        out = self.conv1(x)
        out = self.conv2(out)
        out = self.conv3(out)
        out = out + identity
        return out


if __name__ == '__main__':
    encoder = DilatedEncoder()
    print(encoder)

    x = torch.rand(1, 2048, 32, 32)
    y = encoder(x)
    print(y.shape)

 

好的,可以使用 PyTorch 实现一个 Vision Transformer Encoder。首先,需要导入 PyTorch 和其他必要的库。 ```python import torch import torch.nn as nn import torch.nn.functional as F ``` 接下来,定义一个 `PatchEmbedding` 类,用于将输入图像分割成大小为 `patch_size` 的小块,并将每个小块转换为向量。我们可以使用一个卷积层来实现这个过程。 ```python class PatchEmbedding(nn.Module): def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768): super().__init__() self.image_size = image_size self.patch_size = patch_size self.in_channels = in_channels self.embed_dim = embed_dim self.num_patches = (image_size // patch_size) ** 2 self.patch_embedding = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size) def forward(self, x): # Input shape: (batch_size, channels, height, width) batch_size, channels, height, width = x.shape assert height == width == self.image_size, f"Input image size must be {self.image_size}x{self.image_size}" # Patch embedding x = self.patch_embedding(x) # (batch_size, embed_dim, num_patches_h, num_patches_w) x = x.flatten(2).transpose(1, 2) # (batch_size, num_patches, embed_dim) return x ``` 接下来,定义一个 `MultiHeadAttention` 类,用于实现多头自注意力机制。这里我们使用 PyTorch 的 `MultiheadAttention` 模块。 ```python class MultiHeadAttention(nn.Module): def __init__(self, embed_dim, num_heads, dropout=0.0): super().__init__() self.embed_dim = embed_dim self.num_heads = num_heads self.head_dim = embed_dim // num_heads self.dropout = nn.Dropout(dropout) self.qkv = nn.Linear(embed_dim, embed_dim * 3) self.fc = nn.Linear(embed_dim, embed_dim) self.scale = self.head_dim ** -0.5 def forward(self, x): # Input shape: (batch_size, num_patches, embed_dim) batch_size, num_patches, embed_dim = x.shape # Compute queries, keys, and values qkv = self.qkv(x).reshape(batch_size, num_patches, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4) q, k, v = qkv[0], qkv[1], qkv[2] # Compute attention scores and attention weights attn_scores = (q @ k.transpose(-2, -1)) * self.scale attn_weights = F.softmax(attn_scores, dim=-1) attn_weights = self.dropout(attn_weights) # Compute the weighted sum of values attn_output = attn_weights @ v attn_output = attn_output.transpose(1, 2).reshape(batch_size, num_patches, embed_dim) # Apply a linear layer and residual connection x = self.fc(attn_output) x = self.dropout(x) x = x + attn_output return x ``` 接下来,定义一个 `FeedForward` 类,用于实现前馈神经网络。这里我们使用两个线性层和一个激活函数来实现。 ```python class FeedForward(nn.Module): def __init__(self, embed_dim, hidden_dim, dropout=0.0): super().__init__() self.embed_dim = embed_dim self.hidden_dim = hidden_dim self.dropout = nn.Dropout(dropout) self.fc1 = nn.Linear(embed_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, embed_dim) def forward(self, x): # Input shape: (batch_size, num_patches, embed_dim) x = self.fc1(x) x = F.gelu(x) x = self.dropout(x) x = self.fc2(x) x = self.dropout(x) x = x + x return x ``` 最后,定义一个 `TransformerEncoder` 类,它将上述三个模块组合在一起,实现 Vision Transformer Encoder 的功能。 ```python class TransformerEncoder(nn.Module): def __init__(self, num_patches, embed_dim, num_heads, hidden_dim, dropout=0.0): super().__init__() self.num_patches = num_patches self.embed_dim = embed_dim self.num_heads = num_heads self.hidden_dim = hidden_dim self.patch_embedding = nn.Linear(3 * 16 * 16, embed_dim) self.position_embedding = nn.Parameter(torch.randn(1, num_patches + 1, embed_dim)) self.dropout = nn.Dropout(dropout) self.attention = MultiHeadAttention(embed_dim, num_heads, dropout) self.feedforward = FeedForward(embed_dim, hidden_dim, dropout) def forward(self, x): # Input shape: (batch_size, channels, height, width) x = self.patch_embedding(x) x = x.permute(0, 2, 1) x = torch.cat([self.position_embedding[:, :self.num_patches], x], dim=1) x = self.dropout(x) x = self.attention(x) x = self.feedforward(x) return x[:, 1:, :] # Remove the first token, which is the position embedding ``` 现在,我们已经定义了一个 Vision Transformer Encoder。可以使用以下代码测试它: ```python encoder = TransformerEncoder(num_patches=14 * 14, embed_dim=768, num_heads=12, hidden_dim=3072, dropout=0.1) x = torch.randn(1, 3, 224, 224) x = encoder(x) print(x.shape) # Output shape: (1, 196, 768) ``` 这个模型将一个大小为 $224 \times 224$ 的 RGB 图像转换为一个大小为 $196 \times 768$ 的向量序列。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

AICVHub

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值