RetinaNet解析
1. RetinaNet
one-stage detector
- 创新点:RetinaNet网络+Focal loss解决正负样本不平衡
- 结构:backbone + fpn + head (bbox & class)
2. 配置文件
retinanet_r50_fpn.py解析如下:
backbone
- 配置
model = dict(
type='RetinaNet', # model名称
backbone=dict(
type='ResNet', # backbone名称,采用ResNet
depth=50, # ResNet 50
num_stages=4, # ResNet设计范式为 stem + 4_stage,4就表示采用的stage的数量
out_indices=(0, 1, 2, 3), # backbone输出了4张特征图,索引分别为(0,1,2,3),stride分别为(4,8,16,32),输出通道数分别为(256,512,1024,2048)
frozen_stages=1, # 表示冻结stem和第一个stage的权重,不训练
norm_cfg=dict(type='BN', requires_grad=True), # 是否需要进行参数更新
norm_eval=True, # 整个backbone网络的归一化算子变成eval模式,均值和方差采用预训练值,不更新。
# norm_eval控制整个backbone的归一化算子是否需要变成eval模式
style='pytorch',
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
# backbone采用pytorch提供的在imagenet上的预训练权重
out_indices: 一般分类模型都遵循 stem + n_stage + fc_head 的结构,ResNet为 stem+4stage+fc 3个部分。stem输出的stride为4,4个stage的stride分别为4,8,16,32,如 out_indices=(0,) 表示输出stride为4的特征图。backbone后接FPN,需4个feature map。
frozen_stages=-1(不冻结),0(stem),1(stem+stage1),2(stem+stage1+stage2)
参看mmdet/models/backbones/resnet.py,可以看到resnet的构建。
复习卷积/池化的feature map尺寸计算: W ′ = W − F + 2 P S + 1 W'=\frac{W-F+2P}{S}+1 W′=SW−F+2P+1;空洞卷积 W ′ = W − d ( F − 1 ) − 1 + 2 P S + 1 W'=\frac{W-d(F-1)-1+2P}{S}+1 W′=SW−d(F−1)−1+2P+1。(向下取整)实际卷积核大小相当于 d ( F − 1 ) + 1 d(F-1)+1 d(F−1)+1,d为空洞率,在卷积核之间填充(d-1)个0。
- stem:相较于原图的stride=4,输出通道数=64。stem由:conv(+norm+relu)+maxpool构成(依据conv+norm+relu的数量为1还是3,分为stem和deep_stem)。feature map 的尺寸经过conv+maxpool之后缩小了4倍,即stem的stride为4。
def _make_stem_layer(self, in_channels, stem_channels):
if self.deep_stem: # conv + norm + relu + maxpool
self.stem = nn.Sequential(
build_conv_layer(
self.conv_cfg,
in_channels,
stem_channels // 2,
kernel_size=3,
stride=2,
padding=1,
bias=False),
build_norm_layer(self.norm_cfg, stem_channels // 2)[1],
nn.ReLU(inplace=True),
build_conv_layer(
self.conv_cfg,
stem_channels // 2,
stem_channels // 2,
kernel_size=3,
stride=1,
padding=1,
bias=False),
build_norm_layer(self.norm_cfg, stem_channels // 2)[1],
nn.ReLU(inplace=True),
build_conv_layer(
self.conv_cfg,
stem_channels // 2,
stem_channels,
kernel_size=3,
stride=1,
padding=1,
bias=False),
build_norm_layer(self.norm_cfg, stem_channels)[1],
nn.ReLU(inplace=True))
else:
self.conv1 = build_conv_layer(
self.conv_cfg,
in_channels,
stem_channels,
kernel_size=7,
stride=2,
padding=3,
bias=False)
self.norm1_name, norm1 = build_norm_layer(
self.norm_cfg, stem_channels, postfix=1)
self.add_module(self.norm1_name, norm1)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
- stage:stride=(1,2,2,2),即经过每个stage,feature map缩小的倍数,相较于原图缩小的倍数分别为4,8,16,32,通道数分别为256,512,1024,2048。构建细节这里不再赘述。
- ResNet的前向过程:只有在 self.out_indices 中的feature map才输出,这里4个stage的feature map全部输出作为后面neck的输入。
def forward(self, x):
"""Forward function."""
if self.deep_stem:
x = self.stem(x)
else:
x = self.conv1(x)
x = self.norm1(x)
x = self.relu(x)
x = self.maxpool(x)
outs = []
for i, layer_name in enumerate(self.res_layers):
res_layer = getattr(self, layer_name)
x = res_layer(x)
if i in self.out_indices:
outs.append(x)
return tuple(outs)
Neck
即FPN
-
结构:
-
配置:
neck=dict(
type='FPN', # 采用FPN作为neck进行特征融合
in_channels=[256, 512, 1024, 2048], # 对应于backbone输出的4个特征图的通道数,即FPN输入4个feature map
out_channels=256, # 每个feature map的输出通道数
start_level=1, # 从索引为1的feature map开始构建特征金字塔,即从通道为512的开始,也就是说FPN只用了后面三个
add_extra_convs='on_input', # 多出来的2个feature map来源,来自于backbone的输出
num_outs=5), # FPN最终输出5个feature map,且通道数均为256
- 代码:
FPN的输出由两部分组成:(1)backbone输出的feature map(start_level=1,即后三张feature map),经过侧向连接+上采样融合;(2)直接由backbone的输出生成额外的feature map
Part 1:backbone输出的后三张feature map进行侧向连接+上采样融合
(1)后三张feature map即c3,c4,c5先分别经过侧向连接(即 self.lateral_convs ,为1*1的卷积),将通道数统一变换为256,即m3,m4,m5;
(2)变换通道后的后三张feature map(m3,m4,m5) ,从最小的m5开始,经过2倍最近邻上采样与m4相加得到新的m4,新m4经过两倍最近邻上采样与m3相加得到新的m3;
(3)未经变化的m5和新融合的m4、m3,经过self.fpn_conv,即 3 ∗ 3 3*3 3∗3卷积,得到最终输出的三个feature map:P3,P4,P5
def forward(self, inputs):
"""Forward function."""
assert len(inputs) == len(self.in_channels)
# build laterals
# 1. 后三个feature map经过1*1卷积的侧向连接将通道数变换为256
laterals = [
lateral_conv(inputs[i + self.start_level])
for i, lateral_conv in enumerate(self.lateral_convs)
]
# build top-down path
# 2. 变换通道后的三张feature map,从尺寸最小的开始进行自顶向下的特征融合,即2倍最近邻上采样和相加
used_backbone_levels = len(laterals)
for i in range(used_backbone_levels - 1, 0, -1):
# In some cases, fixing `scale factor` (e.g. 2) is preferred, but
# it cannot co-exist with `size` in `F.interpolate`.
if 'scale_factor' in self.upsample_cfg:
laterals[i - 1] += F.interpolate(laterals[i],
**self.upsample_cfg)
# build outputs
# part 1: from original levels
# 3. 自顶向下后的三张feature map经过self.fpn_conv,即3*3卷积,得到最终输出的三个feature map
outs = [
self.fpn_convs[i](laterals[i]) for i in range(used_backbone_levels)
]
侧向连接 self.lateral_convs( 1 ∗ 1 1*1 1∗1卷积)和 self.fpn_conv( 3 ∗ 3 3*3 3∗3卷积)构建如下:
self.lateral_convs = nn.ModuleList()
self.fpn_convs = nn.ModuleList()
for i in range(self.start_level, self.backbone_end_level):
l_conv = ConvModule(
in_channels[i],
out_channels,
1,
conv_cfg=conv_cfg,
norm_cfg=norm_cfg if not self.no_norm_on_lateral else None,
act_cfg=act_cfg,
inplace=False)
fpn_conv = ConvModule(
out_channels,
out_channels,
3,
padding=1,
conv_cfg=conv_cfg,
norm_cfg=norm_cfg,
act_cfg=act_cfg,
inplace=False)
self.lateral_convs.append(l_conv)
self.fpn_convs.append(fpn_conv)
Part 2:backbone的输出生成额外两个feature map
(1)额外的feature map来自于input[3],即backbone输出的最后一张feature map:c5
(2) c5经过一个 3 ∗ 3 3*3 3∗3,stride=2,padding=1的卷积,形成第四张feature map:P6(尺寸减半,8,16,32,64)
(3)P6再经过一个 3 ∗ 3 3*3