faster r-cnn的核心创新点就是RPN网络和对应的分类器网络,但实际上这两个网络的结构都非常简单,创新更多的是体现在思想上,即从原图中找Anchor,从feature map中找ROI,而难点则在于实现这个想法。
RPN网络和classifier的代码都是在相应的基础网络里面的,以VGG为例,这两个网络是放在keras_frcnn/vgg.py中
代码及注释如下:
#输入:base_layers(37,37,512),num_anchors=9 600/16=37
#输出:[x_class, x_regr, base_layers](37,37,9+36+512)
#x_class:(37,37,9),每个位置上的每个anchor的二分类的结果
#x_regr:(37,37,36=9*4),激活函数采用线性回归,得到每个anchor的回归参数
#base_layers:(37,37,512),vgg的feature map输出
def rpn(base_layers, num_anchors):
#先变成(37,37,256)
x = Conv2D(256, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers)
#用9个(1,1)的过滤器对(37,37,256)做卷积,得到(37,37,9),每个位置上的每个anchor的二分类的结果
x_class = Conv2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x)
#用36个(1,1)的过滤器对(37,37,9)做卷积,得到(37,37,36=9*4),激活函数采用线性回归,得到每个anchor的回归参数
x_regr = Conv2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x)
return [x_class, x_regr, base_layers]
#输入:base_layers vgg的feature map (37,37,512)
#input_rois输入的rois,一张图片对应的的特征图上的全部anchor,
#num_rois:使用的roi的数量
#nb_classes:所有的标注框的类别总和,20类前景物体+1类背景
#输出:[out_class, out_regr]
#out_class:anchor的分类
#out_regr:anchor的回归参数
#分类和回归用的是同一个网络,输入都是anchor
#这里的回归是再次对proposals进行bounding box regression,获取更高精度的rect box
def classifier(base_layers, input_rois, num_rois, nb_classes = 21, trainable=False):
# compile times on theano tend to be very high, so we use smaller ROI pooling regions to workaround
if K.backend() == 'tensorflow':
pooling_regions = 7
input_shape = (num_rois, 7, 7, 512)
elif K.backend() == 'theano':
pooling_regions = 7
input_shape = (num_rois, 512, 7, 7)
#RoiPoolingConv自定义的keras layer,定义一个RoiPoolingConv层,RoiPoolingConv就是将feature map上的anchor给缩放一下,就像pooling层从(4*4)-(1*1)一样,
#RoiPoolingConv将不同大小的anchor都统一成(7*7)大小
out_roi_pool = RoiPoolingConv(pooling_regions, num_rois)([base_layers, input_rois])
#out_roi_pool的输出形状是(1, self.num_rois, self.pool_size, self.pool_size, self.nb_channels)
#因此下面这句的意思就是对每个RoiPoolingConv后的anchor进行计算,即现在的out的input shape变成了(self.pool_size, self.pool_size, self.nb_channels)即(7,7,512)
out = TimeDistributed(Flatten(name='flatten'))(out_roi_pool)#TimeDistributed applies a layer to every temporal slice of an input.
out = TimeDistributed(Dense(4096, activation='relu', name='fc1'))(out)#然后接两个全连接层形成一个4096*1的向量作为最终输入softmax的向量
out = TimeDistributed(Dense(4096, activation='relu', name='fc2'))(out)
out_class = TimeDistributed(Dense(nb_classes, activation='softmax', kernel_initializer='zero'), name='dense_class_{}'.format(nb_classes))(out)
# note: no regression target for bg class
out_regr = TimeDistributed(Dense(4 * (nb_classes-1), activation='linear', kernel_initializer='zero'), name='dense_regress_{}'.format(nb_classes))(out)
return [out_class, out_regr]