在第二部分,我们对ResNet18的结构进行了分析。当图像经过ResNet18后,会产生Feature Map(特征图)。这些特征图将会被送到RPN(Region Proposal Network-候选区域生成网络)中来产生RoI。在此文章采用的RPN中,使用了FPN结构。
参考内容链接如下:
Detect-and-Track源代码:【网页链接】
Generic Object Detection综述文章:【网页链接】
FPN论文:【网页链接】
目录
一、RPN简单回顾
首先看一下RPN是怎么来的,在综述文献中我们看到R-CNN家族进化史中的这个部分:
尽管Fast R-CNN在R-CNN的基础上提升了速度,但是候选区域的计算(Selective Search)成为了Fast R-CNN的瓶颈。Faster R-CNN用CNN网络来代替Selective Search生成候选区域,速度得到了数量级级别的提升。RPN的主要步骤如下:
★卷积:首先对Backbone卷积得到的特征图再进行一次卷积,目的有两个:再次提取特征、重新设置一下张量的维度。
★红色框部分“For Each Spatial Location”:此处以特征图上每个像素为中心,生成一系列Anchors(锚框),每个锚框有3个尺度,3个纵横比(一般来说是这样的设定)。最后会像下图这个样子。
★sibling FC layers:之后这些锚框进入两个“sibling FC layers”(姐妹全连接网络)Cls和Reg:分类和回归。分类是要进行前景/背景分类,本质是一个二分类,确定这个锚框是否是背景。对于前景的锚框,网络会基于Ground Truth(人标记的框),对锚框进行修正,修正结果如下:
二、FPN简单回顾
FPN被提出的初衷就是提高网络对物体尺度的敏感性。R-CNN家族前期的网络都是在单个尺度上的Feature Map上进行分析,这导致了小物体在卷积之后变得更小,增大了识别难度。而FPN采用了金字塔结构,使得网络可以在不同的特征图层上对物体进行识别,提高了小物体识别准确率。其原理可以用此结构图来阐明:
左侧是我们用Backbone对图像进行卷积的过程,随着卷积的进行,特征图越来越厚,分辨率越来越小。之前我们是采用某一个层上的特征图用于识别,现在进行扩展:
★Backbone:一般采用ResNet,左侧每层自底向上依次命名为:C1、C2、C3、C4。其中C1分辨率最大,C4分辨率最小。每层之间长宽差2倍,面积差4倍。
★右侧上采样:在右侧,最小的图像C4直接传递过来变为P4。P4进行上采样后,尺度与C3保持一致。经过“lateral conn /侧连接”后,与C3相加得到P3。以此类推得到P2。
★lateral conn:侧连接,P层的得出,既需要上级P层的上采样,还需要C级的求和。
注意:左侧结构中,随着C层级别的升高,蓝色框逐渐变厚。蓝框代表“语义强度”,即抽象的程度。可以发现,随着卷积的进行:分辨率越低,语义强度越强;分辨率月高,语义强度曰弱。而通过这种侧连接的方式,右侧P层中,每层都由两部分构成:一个是强语义支流上采样得来,一个是高分辨率支流得来。所以P层中每层都包含丰富的语义含义。这也是FPN的精妙之处。
三、RPN&FPN构建过程
1. model_builder.py
此文件中的build_generic_fast_rcnn_model()函数构建了整个网络,其中包含RPN&FPN构建的下列代码,代码主要包含三部分:
★首先选择FPN库,使用3D版本。
★[Key]之后构建RPN网络,调用add_fpn_rpn_outputs()函数,构建含有FPN的RPN网络。
★最后将FPN中各层信息取出。
# Select the FPN lib, based on whether the head is 3D or 2D,3D
# 先选择FPN库,3D还是2D
if cfg.MODEL.VIDEO_ON and cfg.VIDEO.BODY_HEAD_LINK == '':
FPN_lib = FPN3D
head_3d = True
out_time_dim = cfg.VIDEO.NUM_FRAMES_MID # 3
else:
FPN_lib = FPN
head_3d = False
out_time_dim = 1
# ★ Add the RPN branch(区域候选网络)
if cfg.MODEL.FASTER_RCNN: # True 使用Faster-RCNN
if cfg.FPN.FPN_ON: # FPN特征金字塔
FPN_lib.add_fpn_rpn_outputs( # 创建带FPN的RPN
model, blob_conv, dim_conv, spatial_scale_conv,
time_dim=out_time_dim)
model.CollectAndDistributeFpnRpnProposals()
else:
# 创建一个单独的RPN
add_rpn_outputs(model, blob_conv, dim_conv, spatial_scale_conv,
nd=head_3d, time_dim=out_time_dim)
if cfg.FPN.FPN_ON:
# Code only supports case when RPN and ROI min levels are the same
assert cfg.FPN.RPN_MIN_LEVEL == cfg.FPN.ROI_MIN_LEVEL # 都是2
# FPN RPN max level might be > FPN ROI max level in which case we
# need to discard some leading conv blobs (blobs are ordered from
# max level to min level)
num_roi_levels = cfg.FPN.ROI_MAX_LEVEL - cfg.FPN.ROI_MIN_LEVEL + 1 # 5-2+1 = 4
blob_conv = blob_conv[-num_roi_levels:] # 把最后四个取出来:2、3、4、5层
spatial_scale_conv = spatial_scale_conv[-num_roi_levels:]
2. FPN3D.py
add_fpn_rpn_outputs()就是RPN的构建过程。首先一个for循环对每层进行处理,内部if筛选出feat第2层单独进行处理,else处理第3、4、5、6层。
def add_fpn_rpn_outputs(model, blobs_in, dim_in, spatial_scales, time_dim):
# blobs_in是什么?怎么还分层了?ResNet的时候就取出了每层的feat?
num_anchors = len(cfg.FPN.RPN_ASPECT_RATIOS) # 长度为3 (0.5, 1, 2)
dim_out = dim_in
raise NotImplementedError('Redo bbox_targets like in model_builder.py')
if cfg.VIDEO.DEBUG_USE_RPN_GT:
raise NotImplementedError('Need to implement this similar to non-FPN')
k_max = cfg.FPN.RPN_MAX_LEVEL # coarsest level of pyramid 6层 最粗糙
k_min = cfg.FPN.RPN_MIN_LEVEL # finest level of pyramid 2层 最清晰
assert len(blobs_in) == k_max - k_min + 1 # 共有5层
for lvl in range(k_min, k_max + 1): # 遍历2到6层
bl_in = blobs_in[k_max - lvl] # blobs_in is in reversed order
sc = spatial_scales[k_max - lvl] # in reversed order
slvl = str(lvl)
# 对金字塔第2层创建RPN
if lvl == k_min:
# ★Create conv ops with randomly initialized weights and
# zeroed biases for the first FPN level; these will be shared by
# all other FPN levels
# RPN hidden representation
conv_rpn_fpn = model.ConvNd(
bl_in, 'conv_rpn_fpn' + slvl, # conv_rpn_fpn2
dim_in, dim_out, [cfg.VIDEO.TIME_KERNEL_DIM.HEAD_RPN, 3, 3],
pads=2 * [cfg.VIDEO.TIME_KERNEL_DIM.HEAD_RPN // 2, 1, 1],
strides=[1, 1, 1],
weight_init=('GaussianFill', {'std': 0.01}), # randomly initialized weights
bias_init=('ConstantFill', {'value': 0.})) # zeroed biases
model.Relu(conv_rpn_fpn, conv_rpn_fpn)
# ★Convert to 2D. Earlier was averaging in time, now moving to channel
conv_rpn_fpn_timepool = model.MoveTimeToChannelDim( # 以前有时间维度,现在整合成为2D
conv_rpn_fpn, 'conv_rpn_timepooled_fpn' + slvl) # conv_rpn_timepooled_fpn2
# ★Proposal classification scores 分类得分
rpn_cls_logits_fpn = model.Conv( # 经过整合的feat在这里转化为logits(概率前体)
conv_rpn_fpn_timepool, 'rpn_cls_logits_fpn' + slvl, # rpn_cls_logits_fpn2
dim_out * time_dim, num_anchors, 1, pads=0, stride=1, # 每个锚框有一个分数
weight_init=('GaussianFill', {'std': 0.01}),
bias_init=('ConstantFill', {'value': 0.}))
# ★Proposal bbox regression deltas 约束框回归
rpn_bbox_pred_fpn = model.Conv(
conv_rpn_fpn_timepool, 'rpn_bbox_pred_fpn' + slvl, # rpn_bbox_pred_fpn2
dim_out * time_dim, 4 * time_dim * num_anchors, 1, pad=0, stride=1, # 每个锚框有4个点,共有3张图像
weight_init=('GaussianFill', {'std': 0.01}),
bias_init=('ConstantFill', {'value': 0.}))
# 实例框可视化分类得分(这个没有用上)
# Proposal visibility classification scores
# TODO(rgirdhar): Need to use this in future
# rpn_vis_cls_logits_fpn =
model.Conv(
conv_rpn_fpn_timepool, 'rpn_vis_cls_logits_fpn' + slvl,
dim_out * time_dim, num_anchors * time_dim, 1, pads=0, stride=1, # 3张图像总共的锚框数
weight_init=('GaussianFill', {'std': 0.01}),
bias_init=('ConstantFill', {'value': 0.}))
# 3、4、5、6层都在这里
else:
# Share weights and biases
sk_min = str(k_min) # 这个就是2
# ★RPN hidden representation(就是feat再过一次卷积)
conv_rpn_fpn = model.ConvShared( # 添加conv op,与其它conv op共享weights和biases.
bl_in, 'conv_rpn_fpn' + slvl,
dim_in, dim_out, [3, 3, 3],
pads=2 * [1, 1, 1], strides=[1, 1, 1],
nd=True,
weight='conv_rpn_fpn' + sk_min + '_w',
bias='conv_rpn_fpn' + sk_min + '_b')
model.Relu(conv_rpn_fpn, conv_rpn_fpn)
# ★Convert to 2D. Earlier was averaging in time, now moving to channel
conv_rpn_fpn_timepool = model.MoveTimeToChannelDim( # 这个操作把时间维度去掉了
conv_rpn_fpn, 'conv_rpn_timepooled_fpn' + slvl)
# ★Proposal classification scores
rpn_cls_logits_fpn = model.ConvShared(
conv_rpn_fpn_timepool, 'rpn_cls_logits_fpn' + slvl,
dim_out * time_dim, num_anchors, 1, pad=0, stride=1, # 出来的通道数是num_anchors,3张图像中锚框分类概率
weight='rpn_cls_logits_fpn' + sk_min + '_w',
bias='rpn_cls_logits_fpn' + sk_min + '_b')
# ★Proposal bbox regression deltas
rpn_bbox_pred_fpn = model.ConvShared(
conv_rpn_fpn_timepool, 'rpn_bbox_pred_fpn' + slvl,
dim_out * time_dim, 4 * time_dim * num_anchors # 3张图像每一个锚框都有4个坐标
, 1, pad=0, stride=1, weight='rpn_bbox_pred_fpn' + sk_min + '_w',
bias='rpn_bbox_pred_fpn' + sk_min + '_b')
# Proposal visibility classification scores
# TODO(rgirdhar): Need to use this in future
# rpn_vis_cls_logits_fpn =
model.ConvShared(
conv_rpn_fpn_timepool, 'rpn_vis_cls_logits_fpn' + slvl,
dim_out * time_dim, num_anchors * time_dim, 1, pad=0, stride=1,
weight='rpn_vis_cls_logits_fpn' + sk_min + '_w',
bias='rpn_vis_cls_logits_fpn' + sk_min + '_b')
# 预测模式的时候才会使用,用来生成Proposals
if not model.train or cfg.MODEL.FASTER_RCNN:
# Add op that generates RPN proposals
# The proposals are needed when generating proposals from an
# RPN-only model at inference time, but *not* when training it
lvl_anchors = generate_anchors(
stride=2. ** lvl,
sizes=(cfg.FPN.RPN_ANCHOR_START_SIZE * 2. ** (lvl - k_min), ),
aspect_ratios=cfg.FPN.RPN_ASPECT_RATIOS,
time_dim=time_dim)
rpn_cls_probs_fpn = model.net.Sigmoid(
rpn_cls_logits_fpn, 'rpn_cls_probs_fpn' + slvl)
# Need to use this in future
# rpn_vis_cls_probs_fpn = model.net.Sigmoid(
# rpn_cls_logits_fpn, 'rpn_vis_cls_probs_fpn' + slvl)
model.GenerateProposals(
[rpn_cls_probs_fpn, rpn_bbox_pred_fpn, 'im_info'],
['rpn_rois_fpn' + slvl, 'rpn_roi_probs_fpn' + slvl],
anchors=lvl_anchors,
spatial_scale=sc)
四、一些疑问
1.并没有看到侧连接的步骤,而是在卷积后直接进行Cls和Reg。难道不采用这种结构?
2.在看ResNet结构时,并没有看到存储各个C层特征图。但是在RPN构建函数中,直接将blobs_in做为输入。是什么时候存储的C层?
3. model.ConvShared函数中的Shared是什么意思?哪里的参数共享?
欢迎交流!