Update:
建议先看从编程实现角度学习Faster R-CNN,比较直观。这里由于源代码抽象程度较高,显得比较混乱。
faster_rcnn_meta_arch.py中这两个对应知乎文章中RPN包含的3*3和1*1卷积:
rpn_box_predictor_features = slim.conv2d(rpn_features_to_crop
self._first_stage_box_predictor=box_predictor.ConvolutionalBoxPredictor知乎文章中的AnchorTargetCreator按照IoU将20000多个候选的anchor选出256个anchor进行分类和回归位置(计算RPN loss),对应:
target_assigner.batch_assign_targets;
self._first_stage_sampler=sampler.BalancedPositiveNegativeSampler,作用在first_stage_minibatch_size;
其中20000是RPN输入的feature map大小和anchor的种类决定的,256对应first_stage_minibatch_size(见protos/faster_rcnn.proto);
总之就是在def _loss_rpn。(proposal=2000)知乎文章中的ProposalCreator: 在RPN中,从上万个anchor中,按照概率选择一定数目(如12000/6000),并调整大小和位置,经过NMS,选出概率最大的2000/300个,生成RoIs,对应:
def _postprocess_rpn
first_stage_max_proposals=300知乎文章中ProposalTargetCreator从2000/300候选中选择一部分(比如128个)pooling出来用以训练Fast R-CNN,对应:
不使用hard_example_miner
_unpad_proposals_and_sample_box_classifier_batch
second_stage_batch_size
=64
Old:
'''RPN概况。注意,分析时很多术语直接采用了原始论文中的表述,和代码中不一样。
'''
FasterRCNNFeatureExtractor.extract_proposal_features实际调用的是
FasterRCNNResnetV1FeatureExtractor._extract_proposal_features
生成first stage RPN features作为RPN的输入。
class FasterRCNNMetaArch(model.DetectionModel)的_extract_rpn_feature_map:
调用上述特征提取器的_extract_proposal_features,并且返回
rpn_box_predictor_features: A 4-D float32 tensor with shape
[batch, height, width, depth] to be used for predicting proposal boxes
and corresponding objectness scores.'''sliding window得到的intermediate layer'''
rpn_features_to_crop: A 4-D float32 tensor with shape
[batch, height, width, depth] representing image features to crop using
the proposals boxes. '''其实就是前面特征提取器得到的feature map。'''
anchors: A BoxList representing anchors (for the RPN) in
absolute coordinates.
'''这里使用了grid_anchor_generator.GridAnchorGenerator,生成9个anchor boxes(3 different scales and 3 aspect ratios)。具体见下文分析。
'''
anchors = self._first_stage_anchor_generator.generate(
[(feature_map_shape[1], feature_map_shape[2])]
'''sliding window 作用于 conv feature map,得到intermediate layer. first_stage_box_predictor_kernel_size: Kernel size to use for the convolution op just prior to RPN box predictions.
'''
with slim.arg_scope(self._first_stage_box_predictor_arg_scope):
kernel_size = self._first_stage_box_predictor_kernel_size
rpn_box_predictor_features = slim.conv2d(
rpn_features_to_crop,
self._first_stage_box_predictor_depth,
kernel_size=[kernel_size, kernel_size],
rate=self._first_stage_atrous_rate,
activation_fn=tf.nn.relu6)
'''按照论文下面应该是intermediate layer进入cls和reg layer
'''
def _predict_rpn_proposals(self, rpn_box_predictor_features):
进入
self._first_stage_box_predictor.predict
self._first_stage_box_predictor = box_predictor.ConvolutionalBoxPredictor
'''Box predictors are classes that take a high level image feature map as input and produce two predictions, (1) a tensor encoding box locations, and (2) a tensor encoding classes for each box. 下文具体看class ConvolutionalBoxPredictor(BoxPredictor)
'''
'''再进一步是进loss了。这个predict函数可以同时返回两个阶段的一个prediction_dict,然后进loss。
'''
def predict(self, preprocessed_inputs)
def loss(self, prediction_dict, scope=None):
'''loss中调用了第一阶段loss的计算。下文细看。
'''
def _loss_rpn
'''anchor生成
object_detection/anchor_generators/grid_anchor_generator.py
'''
def _generate #通过父类core/anchor_generator.py的generate函数调用
grid_height, grid_width = feature_map_shape_list[0]
# Multidimensional analog of numpy.meshgrid
scales_grid, aspect_ratios_grid = ops.meshgrid(self._scales,
self._aspect_ratios)
scales_grid = tf.reshape(scales_grid, [-1])
aspect_ratios_grid = tf.reshape(aspect_ratios_grid, [-1])
return tile_anchors(grid_height,
grid_width,
scales_grid,
aspect_ratios_grid,
self._base_anchor_size,
self._anchor_stride,
self._anchor_offset)
'''去test脚本手算验证一下。
base_anchor_size = [10, 10]#default=[256, 256]
anchor_stride = [19, 19]#default=[16, 16]
anchor_offset = [0, 0]
scales = [0.5, 1.0, 2.0]
aspect_ratios = [1.0]
exp_anchor_corners = [[-2.5, -2.5, 2.5, 2.5], [-5., -5., 5., 5.],
[-10., -10., 10., 10.], [-2.5, 16.5, 2.5, 21.5],
[-5., 14., 5, 24], [-10., 9., 10, 29],
[16.5, -2.5, 21.5, 2.5], [14., -5., 24, 5],
[9., -10., 29, 10], [16.5, 16.5, 21.5, 21.5],
[14., 14., 24, 24], [9., 9., 29, 29]]
feature_map_shape_list=[(2, 2)] #asks for anchors that correspond
to an 2x2 layer
grid_height, grid_width = 2,2
scales_grid, aspect_ratios_grid 略,三种组合,导致整个feature map一共获得2*2*3=12个anchor
anchor的高和宽由scales, aspect_ratio和base_anchor_size决定,简单。
anchor的中心由range(grid),anchor_stride和anchor_offset决定,grid就是grid_height和grid_width。比如显然第一个是0,第二个是19。理解grid就是指在feature map上产生anchor的格点,然后就很简单。思考:base_anchor_size和anchor_stride等参数是怎么配置的?
anchor的中心怎么和sliding window的中心一致?答案是在输入的feature map上每一个格子都生成anchor:
feature_map_shape = tf.shape(rpn_features_to_crop)
anchors = self._first_stage_anchor_generator.generate(
[(feature_map_shape[1], feature_map_shape[2])])
由此可以得到anchor的stride应该是1×16=16(因为feature map的每个格点对应原图的感受野是16*16)。符合预期。基本的想法就是feature map还原回去感受野很大,形状单一,所以在每个格点引入了k种不同大小和形状的anchor box,以便在原图上更好的框住物体。这种思想在yolo2, ssd等paper中有进一步改进和扩展。
'''
def tile_anchors
"""生成locations and classes
object_detection/core/box_predictor.py.
作用于sliding window得到的intermediate layer,输出是每个anchor的tensor encoding box locations和tensor encoding classes for each box.
可能会引入额外的卷积层。
另外位置学习也没有什么特别的,就是卷积:
box_encodings = slim.conv2d(
net, num_predictions_per_location * self._box_code_size,
[self._kernel_size, self._kernel_size],
scope='BoxEncodingPredictor')
num_predictions_per_location是anchor数。类别学习也类似:
class_predictions_with_background = slim.conv2d(
net, num_predictions_per_location * num_class_slots,
[self._kernel_size, self._kernel_size], scope='ClassPredictor',
biases_initializer=tf.constant_initializer(
self._class_prediction_bias_init))
思考:这里学习到的box location是啥东西?其实就是predict函数调用_predict_rpn_proposals函数得到的'rpn_box_encodings'.
self._first_stage_box_predictor = box_predictor.ConvolutionalBoxPredictor
后面被用来计算Loss。和anchor本身的location是啥关系?它实际是论文中的predicted box与anchor box之间坐标的差值。
"""
class ConvolutionalBoxPredictor(BoxPredictor)
"""cls, reg loss
object_detection/meta_architectures/faster_rcnn_meta_arch.py
这里直接拿rpn_box_encodings计算loss了。推测batch_reg_targets是anchor box与ground truth的差值。查target_assigner.batch_assign_targets代码:
def assign(self, anchors, groundtruth_boxes, groundtruth_labels=None,
**params):
reg_targets = self._create_regression_targets(anchors,
groundtruth_boxes,
match)
_create_regression_targets
matched_reg_targets = self._box_coder.encode(matched_gt_boxes,
matched_anchors)
box_coders/faster_rcnn_box_coder.py
tx = (xcenter - xcenter_a) / wa
ty = (ycenter - ycenter_a) / ha
tw = tf.log(w / wa)
th = tf.log(h / ha)
顺藤摸瓜找到了,和论文中一致。
"""
def _loss_rpn
(batch_cls_targets, batch_cls_weights, batch_reg_targets,
batch_reg_weights, _) = target_assigner.batch_assign_targets(
self._proposal_target_assigner, box_list.BoxList(anchors),
groundtruth_boxlists, len(groundtruth_boxlists)*[None])
batch_cls_targets = tf.squeeze(batch_cls_targets, axis=2)
localization_losses = self._first_stage_localization_loss(
rpn_box_encodings, batch_reg_targets, weights=sampled_reg_indices)
下面看loss计算公式具体是怎么实现的。
"""The ground-truth label is 1 if the anchor is positive, and is 0 if the anchor is negative.
An anchor is labeled as positive if:
(a) the anchor is the one with highest IoU overlap with a ground-truth box
(b) the anchor has an IoU overlap with a ground-truth box higher than 0.7
Negative labels are assigned to anchors with IoU lower than 0.3 for all ground-truth
boxes.
50%/50% ratio of positive/negative anchors in a minibatch.
"""
经过之前的分析,相应的代码应该是
(batch_cls_targets, batch_cls_weights, batch_reg_targets,
batch_reg_weights, _) = target_assigner.batch_assign_targets(
self._proposal_target_assigner, box_list.BoxList(anchors),
groundtruth_boxlists, len(groundtruth_boxlists)*[None])
这里调用的target_assigner对象是这样构建的:
self._proposal_target_assigner = target_assigner.create_target_assigner(
'FasterRCNN', 'proposal')
进入core/target_assigner.py中的create_target_assigner函数:
elif reference == 'FasterRCNN' and stage == 'proposal':
similarity_calc = sim_calc.IouSimilarity()
matcher = argmax_matcher.ArgMaxMatcher(matched_threshold=0.7,
unmatched_threshold=0.3,
force_match_for_each_row=True)
box_coder = faster_rcnn_box_coder.FasterRcnnBoxCoder(
scale_factors=[10.0, 10.0, 5.0, 5.0])
具体实现在:
from object_detection.matchers import argmax_matcher