Update:
建议先看从编程实现角度学习Faster R-CNN,比较直观。这里由于源代码抽象程度较高,显得比较混乱。
Old:
之前看过检测api整体的代码架构,rpn部分和几个基本类,这次仔细看看Fast r-cnn部分,顺便把之前看过的东西串一串。
"""object_detection/meta_architectures/faster_rcnn_meta_arch.py"""
class FasterRCNNFeatureExtractor(object)类声明了_extract_proposal_features和_extract_box_classifier_features等抽象函数。
不过我们先看下restore_from_classification_checkpoint_fn函数,它返回的是variables_to_restore字典,看一下后面的restore_map函数就懂了,创建了模型中变量和checkpoint中变量名字的映射,且区分了分类的checkpoint(用于初始化)和检测的checkpoint。
class FasterRCNNMetaArch(model.DetectionModel)
注意到用了好多@property,是将方法当成属性来调用。在类的定义中使用@property修饰函数,可以让调用者写出简短的代码,同时保证对参数进行必要的检查。还可以定义只读属性,不定义setter方法就是一个只读属性。
"""这个max_num_proposals为什么这样设定?,暂时不太明白,后续研究"""
def max_num_proposals(self):
Max number of proposals (to pad to) for each image in the input batch.
At training time, this is set to be the `second_stage_batch_size` if hard
example miner is not configured, else it is set to
`first_stage_max_proposals`. At inference time, this is always set to
`first_stage_max_proposals`.
"""这个函数调用了image_resizer_fn,把归一化等其他预处理都留给feature_extractor来做"""
def preprocess(self, inputs)
"""这个函数比较复杂,首先是一些特殊的处理需要注意:
+ Anchor pruning vs. clipping: following the recommendation of the Faster
R-CNN paper, we prune anchors that venture outside the image window at
training time and clip anchors to the image window at inference time.
+ Proposal padding: as described at the top of the file, proposals are
padded to self._max_num_proposals and flattened so that proposals from all
images within the input batch are arranged along the same batch dimension.
"""
def predict(self, preprocessed_inputs)
返回的prediction_dict包含11项。
1) rpn_box_predictor_features: A 4-D float32 tensor with shape
[batch_size, height, width, depth] to be used for predicting proposal
boxes and corresponding objectness scores.
#sliding window得到的intermediate layer,directly fed to a box predictor。代码中也称作RPN feature map。
2) rpn_features_to_crop: A 4-D float32 tensor with shape
[batch_size, height, width, depth] representing image features to crop
using the proposal boxes predicted by the RPN.
#其实就是前面特征提取器得到的feature map。就是截断处block3的activations,然后过一个卷积得到rpn_box_predictor_features。
3) image_shape: a 1-D tensor of shape [4] representing the input
image shape.
6) anchors: A 2-D tensor of shape [num_anchors, 4] representing anchors
for the first stage RPN (in absolute coordinates). Note that
`num_anchors` can differ depending on whether the model is created in
training or inference mode.
#上述四个来自self._extract_rpn_feature_maps(preprocessed_inputs),调用的是FasterRCNNResnetV1FeatureExtractor._extract_proposal_features。
4) rpn_box_encodings: 3-D float tensor of shape
[batch_size, num_anchors, self._box_coder.code_size] containing
predicted boxes.
5) rpn_objectness_predictions_with_background: 3-D float tensor of shape
[batch_size, num_anchors, 2] containing class
predictions (logits) for each of the anchors. Note that this
tensor *includes* background class predictions (at class index 0).
#上述两个来自self._predict_rpn_proposals(rpn_box_predictor_features),具体实现在box_predictor中,如class ConvolutionalBoxPredictor(BoxPredictor),没什么特殊的,本质就是拿卷积拟合位置和类别。看起来结合之前看过的,rpn阶段的内容已经比较熟悉了(当然,一些外部参数如何引入到模型的部分还没看,应该在train脚本里)。注意:num_anchors_per_location=self._first_stage_anchor_generator.num_anchors_per_location()),是一个列表: [len(self._scales) * len(self._aspect_ratios)],通常就是3*3=9,表示rpn_features_to_crop中的每个位置取9个anchor,列表元素个数是1表示只在一张特征图上取anchor(可参考ssd paper还是yolo2 paper在多张特征图上取anchor的情况,后面我们也会读ssd的代码)。
(and if first_stage_only=False):
7) refined_box_encodings: a 3-D tensor with shape
[total_num_proposals, num_classes, 4] representing predicted
(final) refined box encodings, where
total_num_proposals=batch_size*self._max_num_proposals
8) class_predictions_with_background: a 3-D tensor with shape
[total_num_proposals, num_classes + 1] containing class
predictions (logits) for each of the anchors, where
total_num_proposals=batch_size*self._max_num_proposals.
Note that this tensor *includes* background class predictions
(at class index 0).
9) num_proposals: An int32 tensor of shape [batch_size] representing the
number of proposals generated by the RPN. `num_proposals` allows us
to keep track of which entries are to be treated as zero paddings and
which are not since we always pad the number of proposals to be
`self.max_num_proposals` for each image.
10) proposal_boxes: A float32 tensor of shape
[batch_size, self.max_num_proposals, 4] representing
decoded proposal bounding boxes in absolute coordinates.
11) mask_predictions: (optional) a 4-D tensor with shape
[total_num_padded_proposals, num_classes, mask_height, mask_width]
containing instance mask predictions.
#结合paper和代码,第一阶段的思路已经比较顺畅了。第二阶段的结果都来自self._predict_second_stage函数。注意到它的输入是rpn_box_encodings, rpn_objectness_predictions_with_background, rpn_features_to_crop, anchors, image_shape。函数的逻辑比较简单,就是将rpn的结果进行一些处理,然后提取特征。下面看几个要点。
"""self._postprocess_rpn: decodes the raw RPN predictions, runs non-max suppression."""
proposal_boxes_normalized, _, num_proposals = self._postprocess_rpn(
rpn_box_encodings, rpn_objectness_predictions_with_background,
anchors, image_shape)
"""_compute_second_stage_input_feature_maps函数中实现ROI pooling,注意self._maxpool_kernel_size是在config中设置的,然后通过model_builder传递给模型. 和faster r-cnn paper中不一样,是先将feature map resize到固定大小,然后用固定大小的Kernel进行池化,而不是使用自适应的kernel大小。这一点在作者的论文中有描述。
"""
flattened_proposal_feature_maps = (
self._compute_second_stage_input_feature_maps(
rpn_features_to_crop, proposal_boxes_normalized))
"""使用resnet的block4.
"""
box_classifier_features = (
self._feature_extractor.extract_box_classifier_features(
flattened_proposal_feature_maps,
scope=self.second_stage_feature_extractor_scope))
"""class MaskRCNNBoxPredictor(BoxPredictor),ValueError: if num_predictions_per_location is not 1,这个大概解释了为啥前面要把anchor维度merge到batch维度中去。The mask prediction head is based on the Mask RCNN paper with the following modifications: We replace the deconvolution layer with a bilinear resize and a convolution. 分类和定位,各用一个fc.可以看到比fast r-cnn paper中少了两个ROI pooling后的fc,可能因为原文使用的特征提取器是vgg,所以多了两个fc。
"""
box_predictions = self._mask_rcnn_box_predictor.predict(
box_classifier_features,
num_predictions_per_location=1,
scope=self.second_stage_box_predictor_scope)
"""最后,代码中提到了Mask r-cnn,等先读完paper再研究。"""
if self._predict_keypoints:
raise ValueError('Keypoint prediction is unimplemented.')
if self._predict_instance_masks:
with slim.arg_scope(self._conv_hyperparams):
upsampled_features = tf.image.resize_bilinear(
image_features,
[self._mask_height, self._mask_width],
align_corners=True)
upsampled_features = slim.conv2d(
upsampled_features,
num_outputs=self._mask_prediction_conv_depth,
kernel_size=[2, 2])
mask_predictions = slim.conv2d(upsampled_features,
num_outputs=self.num_classes,
activation_fn=None,
kernel_size=[3, 3])
instance_masks = tf.expand_dims(tf.transpose(mask_predictions,
perm=[0, 3, 1, 2]),
axis=1,
name='MaskPredictor')
predictions_dict[MASK_PREDICTIONS] = instance_masks