fpn的理解

最新推荐文章于 2024-07-07 22:05:09 发布

大郎拱白菜

最新推荐文章于 2024-07-07 22:05:09 发布

阅读量1.5k

点赞数 1

分类专栏：深度学习

原文链接：https://zhuanlan.zhihu.com/p/35854548

版权

深度学习专栏收录该内容

34 篇文章 1 订阅

订阅专栏

本部分截取自知乎文章：从代码细节理解 FPN，作者使用Mask-RCNN的源码辅助理解FPN结构，项目地址见MRCNN，

1、怎么做的上采样？

高层特征怎么上采样和下一层的特征融合的，代码里面可以看到:

1	`P5` `=` `KL.Conv2D(256, (1,` `1), name='fpn_c5p5')(C5)`

C5是 resnet最顶层的输出，它会先通过一个1*1的卷积层，同时把通道数转为256，得到FPN 的最上面的一层 P5。

1	`KL.UpSampling2D(size=(2,` `2),name="fpn_p5upsampled")(P5)`

Keras 的 API 说明告诉我们：

也就是说，这里的实现使用的是最简单的上采样，没有使用线性插值，没有使用反卷积，而是直接复制。

2、怎么做的横向连接？

P4 = KL.Add(name="fpn_p4add")

([KL.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),

KL.Conv2D(256,(1, 1), name='fpn_c4p4')(C4)])

这里可以很明显的看到，P4就是上采样之后的 P5加上1*1 卷积之后的 C4，这里的横向连接实际上就是像素加法，先把 P5和C4转换到一样的尺寸，再直接进行相加。

注意这里对从 resnet抽取的特征图做的是 1*1 的卷积：

1x1的卷积我认为有三个作用：使bottom-up对应层降维至256；缓冲作用，防止梯度直接影响bottom-up主干网络，更稳定；组合特征。

3、 FPN自上而下的网络结构代码怎么实现？

# 先从 resnet 抽取四个不同阶段的特征图 C2-C5。

_, C2, C3, C4, C5 =

resnet_graph(input_image, config.BACKBONE,stage5=True, train_bn=config.TRAIN_BN)

# Top-down Layers 构建自上而下的网络结构

# 从 C5开始处理，先卷积来转换特征图尺寸

P5 = KL.Conv2D(256, (1, 1), name='fpn_c5p5')(C5)

# 上采样之后的P5和卷积之后的 C4像素相加得到 P4，后续的过程就类似了

P4 = KL.Add(name="fpn_p4add")([

KL.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),

KL.Conv2D(256, (1, 1),name='fpn_c4p4')(C4)])

P3 = KL.Add(name="fpn_p3add")([

KL.UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),

KL.Conv2D(256, (1, 1), name='fpn_c3p3')(C3)])

P2 = KL.Add(name="fpn_p2add")([

KL.UpSampling2D(size=(2, 2),name="fpn_p3upsampled")(P3),

KL.Conv2D(256, (1, 1), name='fpn_c2p2')(C2)])

# P2-P5最后又做了一次3*3的卷积，作用是消除上采样带来的混叠效应

# Attach 3x3 conv to all P layers to get the final feature maps.

P2 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p2")(P2)

P3 = KL.Conv2D(256, (3, 3), padding="SAME",name="fpn_p3")(P3)

P4 = KL.Conv2D(256, (3, 3), padding="SAME",name="fpn_p4")(P4)

P5 = KL.Conv2D(256, (3, 3), padding="SAME",name="fpn_p5")(P5)

# P6 is used for the 5th anchor scale in RPN. Generated by

# subsampling from P5 with stride of 2.

P6 = KL.MaxPooling2D(pool_size=(1, 1), strides=2,name="fpn_p6")(P5)

# 注意 P6是用在 RPN 目标区域提取网络里面的，而不是用在 FPN 网络

# Note that P6 is used in RPN, but not in the classifier heads.

rpn_feature_maps = [P2, P3, P4, P5, P6]

# 最后得到了5个融合了不同层级特征的特征图列表；

注意 P6是用在 RPN 目标区域提取网络里面的，而不是用在 FPN 网络；

另外这里 P2-P5最后又做了一次3*3的卷积，作用是消除上采样带来的混叠效应。

4、如何确定某个 ROI 使用哪一层特征图进行 ROIpooling ?

看代码：

# Assign each ROI to a level in the pyramid based on the ROI area.

# 这里的 boxes 是 ROI 的框，用来计算得到每个 ROI 框的面积

y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)

h = y2 - y1

w = x2 - x1

# Use shape of first image. Images in a batch must have the same size.

# 这里得到原图的尺寸，计算原图的面积

image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]

# Equation 1 in the Feature Pyramid Networks paper. Account for

# the fact that our coordinates are normalized here.

# e.g. a 224x224 ROI (in pixels) maps to P4

# 原图面积

image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)

# 分两步计算每个 ROI 框需要在哪个层的特征图中进行 pooling

roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))

roi_level = tf.minimum(5, tf.maximum(

2, 4 + tf.cast(tf.round(roi_level), tf.int32)))

不同尺度的ROI，使用不同特征层作为ROI pooling层的输入，大尺度ROI就用后面一些的金字塔层，比如P5；小尺度ROI就用前面一点的特征层，比如P4。那怎么判断ROI改用那个层的输出呢？论文的 K 使用如下公式，代码做了一点更改，替换为roi_level：

1 2	`# 代码里面的计算替换为以下计算方式：` `roi_level` `=` `min(5,` `max(2,` `4` `+` `log2(sqrt(w` `*` `h)` `/` `(` `224` `/` `sqrt(image_area)) ) ) )`

224是ImageNet的标准输入，k0是基准值，设置为5，代表P5层的输出（原图大小就用P5层），w和h是ROI区域的长和宽，image_area是输入图片的长乘以宽，即输入图片的面积，假设ROI是112 * 112的大小，那么k = k0-1 = 5-1 = 4，意味着该ROI应该使用P4的特征层。k值会做取整处理，防止结果不是整数。

5、上面得到的5个融合了不同层级的特征图怎么使用？

可以看到，这里只使用2-5四个特征图:

for i, level in enumerate(range(2, 6)):

# 先找出需要在第 level 层计算ROI

ix = tf.where(tf.equal(roi_level, level))

level_boxes = tf.gather_nd(boxes, ix)

# Box indicies for crop_and_resize.

box_indices = tf.cast(ix[:, 0], tf.int32)

# Keep track of which box is mapped to which level

box_to_level.append(ix)

# Stop gradient propogation to ROI proposals

level_boxes = tf.stop_gradient(level_boxes)

box_indices = tf.stop_gradient(box_indices)

# Crop and Resize

# From Mask R-CNN paper: "We sample four regular locations, so

# that we can evaluate either max or average pooling. In fact,

# interpolating only a single value at each bin center (without

# pooling) is nearly as effective."

#

# Here we use the simplified approach of a single value per bin,

# which is how it's done in tf.crop_and_resize()

# Result: [batch * num_boxes, pool_height, pool_width, channels]

# 使用 tf.image.crop_and_resize 进行 ROI pooling

pooled.append(tf.image.crop_and_resize(

feature_maps[i], level_boxes, box_indices, self.pool_shape,

method="bilinear"))

对每个 box，都提取其中每一层特征图上该box对应的特征，然后组成一个大的特征列表pooled。

6、金字塔结构中所有层级共享分类层是怎么回事？

先看代码：

# ROI Pooling

# Shape: [batch, num_boxes, pool_height, pool_width, channels]

# 得到经过 ROI pooling 之后的特征列表

x = PyramidROIAlign([pool_size, pool_size],

name="roi_align_classifier")([rois, image_meta] + feature_maps)

# 将上面得到的特征列表送入 2 个1024通道数的卷积层以及 2 个 rulu 激活层

# Two 1024 FC layers (implemented with Conv2D for consistency)

x = KL.TimeDistributed(KL.Conv2D(1024, (pool_size, pool_size), padding="valid"),

name="mrcnn_class_conv1")(x)

x = KL.TimeDistributed(BatchNorm(), name='mrcnn_class_bn1')(x, training=train_bn)

x = KL.Activation('relu')(x)

x = KL.TimeDistributed(KL.Conv2D(1024, (1, 1)),

name="mrcnn_class_conv2")(x)

x = KL.TimeDistributed(BatchNorm(), name='mrcnn_class_bn2')(x, training=train_bn)

x = KL.Activation('relu')(x)

shared = KL.Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2),

name="pool_squeeze")(x)

# 分类层

# Classifier head

mrcnn_class_logits = KL.TimeDistributed(KL.Dense(num_classes),

name='mrcnn_class_logits')(shared)

mrcnn_probs = KL.TimeDistributed(KL.Activation("softmax"),

name="mrcnn_class")(mrcnn_class_logits)

# BBOX 的位置偏移回归层

# BBox head

# [batch, boxes, num_classes * (dy, dx, log(dh), log(dw))]

x = KL.TimeDistributed(KL.Dense(num_classes * 4, activation='linear'),

name='mrcnn_bbox_fc')(shared)

# Reshape to [batch, boxes, num_classes, (dy, dx, log(dh), log(dw))]

s = K.int_shape(x)

mrcnn_bbox = KL.Reshape((s[1], num_classes, 4), name="mrcnn_bbox")(x)

这里的PyramidROIAlign得到的 x就是上面一步得到的从每个层的特征图上提取出来的特征列表，这里对这个特征列表先接两个1024通道数的卷积层，再分别送入分类层和回归层得到最终的结果。

也就是说，每个 ROI 都在P2-P5中的某一层得到了一个特征，然后送入同一个分类和回归网络得到最终结果。

FPN中每一层的heads 参数都是共享的，作者认为共享参数的效果也不错就说明FPN中所有层的语义都相似。

7、它的思想是什么？

把高层的特征传下来，补充低层的语义，这样就可以获得高分辨率、强语义的特征，有利于小目标的检测。

8、横向连接起什么作用？

如果不进行特征的融合（也就是说去掉所有的1x1侧连接），虽然理论上分辨率没变，语义也增强了，但是AR下降了10%左右！作者认为这些特征上下采样太多次了，导致它们不适于定位。Bottom-up的特征包含了更精确的位置信息。

大郎拱白菜

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

fpn的理解

本部分截取自知乎文章：从代码细节理解 FPN，作者使用Mask-RCNN的源码辅助理解FPN结构，项目地址见MRCNN，

1、 怎么做的上采样？

2、 怎么做的横向连接？

3、 FPN自上而下的网络结构代码怎么实现？

4、 如何确定某个 ROI 使用哪一层特征图进行 ROIpooling ?

5、 上面得到的5个融合了不同层级的特征图怎么使用？

6、 金字塔结构中所有层级共享分类层是怎么回事？

7、 它的思想是什么？

8、 横向连接起什么作用？