本文主要根据这版keras实现https://github.com/dishen12/keras_frcnn来梳理一下Faster RCNN的流程(原作者删了这个实现,这是别人fork的)。同时这个tensorflow实现的版本https://github.com/endernewton/tf-faster-rcnn也比较清楚(个人感觉不如keras版简单),可以对照着看。
数据处理
backbone用的是vgg16,输出feature_map相对于网络输入stride=16,我们知道feature_map每个点对应9个anchor(3个ratio*3个scale),假设feature_map的宽高分别为w,h,则一共有w*h*9个anchor,函数calc_rpn通过计算anchor与gt box的iou得出哪些anchor是分类正样本(包含目标的前景,不考虑具体的类别)、哪些是负样本、哪些是忽略的、以及哪些是参与计算边框回归的。
def calc_rpn(C, img_data, width, height, resized_width, resized_height, img_length_calc_function):
downscale = float(C.rpn_stride) # 16
anchor_sizes = C.anchor_box_scales # [128, 256, 512]
anchor_ratios = C.anchor_box_ratios # [[1, 1], [1./math.sqrt(2), 2./math.sqrt(2)], [2./math.sqrt(2), 1./math.sqrt(2)]]
num_anchors = len(anchor_sizes) * len(anchor_ratios) # 9
# calculate the output map size based on the network architecture
(output_width, output_height) = img_length_calc_function(resized_width, resized_height) # //16
n_anchratios = len(anchor_ratios) # 3
# initialize empty output objectives
y_rpn_overlap = np.zeros((output_height, output_width, num_anchors))
y_is_box_valid = np.zeros((output_height, output_width, num_anchors))
y_rpn_regr = np.zeros((output_height, output_width, num_anchors * 4))
num_bboxes = len(img_data['bboxes']) # 假设为2
num_anchors_for_bbox = np.zeros(num_bboxes).astype(int)
best_anchor_for_bbox = -1 * np.ones((num_bboxes, 4)).astype(int)
best_iou_for_bbox = np.zeros(num_bboxes).astype(np.float32)
best_x_for_bbox = np.zeros((num_bboxes, 4)).astype(int)
best_dx_for_bbox = np.zeros((num_bboxes, 4)).astype(np.float32)
# get the GT box coordinates, and resize to account for image resizing
gta = np.zeros((num_bboxes, 4))
for bbox_num, bbox in enumerate(img_data['bboxes']):
# get the GT box coordinates, and resize to account for image resizing
gta[bbox_num, 0] = bbox['x1'] * (resized_width / float(width))
gta[bbox_num, 1] = bbox['x2'] * (resized_width / float(width))
gta[bbox_num, 2] = bbox['y1'] * (resized_height / float(height))
gta[bbox_num, 3] = bbox['y2'] * (resized_height / float(height))
# rpn ground truth
for anchor_size_idx in range(len(anchor_sizes)): # 3
for anchor_ratio_idx in range(n_anchratios): # 3
anchor_x = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][0]
anchor_y = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][1]
for ix in range(output_width):
# x-coordinates of the current anchor box
x1_anc = downscale * (ix + 0.5) - anchor_x / 2
x2_anc = downscale * (ix + 0.5) + anchor_x / 2
# ignore boxes that go across image boundaries
if x1_anc < 0 or x2_anc > resized_width:
continue
for jy in range(output_height):
# y-coordinates of the current anchor box
y1_anc = downscale * (jy + 0.5) - anchor_y / 2
y2_anc = downscale * (jy + 0.5) + anchor_y / 2
# ignore boxes that go across image boundaries
if y1_anc < 0 or y2_anc > resized_height:
continue
# bbox_type indicates whether an anchor should be a target
bbox_type = 'neg'
# this is the best IOU for the (x,y) coord and the current anchor
# note that this is different from the best IOU for a GT bbox
best_iou_for_loc = 0.0 # one of two
for bbox_num in range(num_bboxes):
# get IOU of the current GT box and the current anchor box
curr_iou = iou([gta[bbox_num, 0], gta[bbox_num, 2], gta[bbox_num, 1], gta[bbox_num, 3]],
[x1_anc, y1_anc, x2_anc, y2_anc])
# calculate the regression targets if they will be needed
if curr_iou > best_iou_for_bbox[bbox_num] or curr_iou > C.rpn_max_overlap:
cx = (gta[bbox_num, 0] + gta[bbox_num, 1]) / 2.0
cy = (gta[bbox_num, 2] + gta[bbox_num, 3]) / 2.0
cxa = (x1_anc + x2_anc) / 2.0
cya = (y1_anc + y2_anc) / 2.0
tx = (cx - cxa) / (x2_anc - x1_anc)
ty = (cy - cya) / (y2_anc - y1_anc)
tw = np.log((gta[bbox_num, 1] - gta[bbox_num, 0]) / (x2_anc - x1_anc))
th = np.log((gta[bbox_num, 3] - gta[bbox_num, 2]) / (y2_anc - y1_anc))
if img_data['bboxes'][bbox_num]['class'] != 'bg':
# all GT boxes should be mapped to an anchor box,
# so we keep track of which anchor box was best
if curr_iou > best_iou_for_bbox[bbox_num]:
best_anchor_for_bbox[bbox_num] = [jy, ix, anchor_ratio_idx, anchor_size_idx] # 由此可以得到anchor的坐标
best_iou_for_bbox[bbox_num] = curr_iou
best_x_for_bbox[bbox_num, :] = [x1_anc, x2_anc, y1_anc, y2_anc] # anchor的坐标 好像用不着?
best_dx_for_bbox[bbox_num, :] = [tx, ty, tw, th]
# we set the anchor to positive if the IOU is > 0.7
# (it does not matter if there was another better box, it just indicates overlap)
if curr_iou > C.rpn_max_overlap:
bbox_type = 'pos'
num_anchors_for_bbox[bbox_num] += 1
# we update the regression layer target if this IOU
# is the best for the current (x,y) and anchor position
if curr_iou > best_iou_for_loc:
best_iou_for_loc = curr_iou
best_regr = (tx, ty, tw, th)
# if the IOU is > 0.3 and < 0.7, it is ambiguous and no included in the objective
if C.rpn_min_overlap < curr_iou < C.rpn_max_overlap:
# gray zone between neg and pos
if bbox_type != 'pos':
bbox_type = 'neutral'
# turn on or off outputs depending on IOUs
if bbox_type == 'neg':
y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
elif bbox_type == 'neutral':
y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
elif bbox_type == 'pos':
y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
start = 4 * (anchor_ratio_idx + n_anchratios * anchor_size_idx)
y_rpn_regr[jy, ix, start:start + 4] = best_regr
# we ensure that every bbox has at least one positive RPN region
for idx in range(num_anchors_for_bbox.shape[0]):
if num_anchors_for_bbox[idx] == 0:
# no box with an IOU greater than zero ...
if best_anchor_for_bbox[idx, 0] == -1:
continue
y_is_box_valid[best_anchor_for_bbox[idx, 0], best_anchor_for_bbox[idx, 1],
best_anchor_for_bbox[idx, 2] + n_anchratios * best_anchor_for_bbox[idx, 3]] = 1
y_rpn_overlap[best_anchor_for_bbox[idx, 0], best_anchor_for_bbox[idx, 1],
best_anchor_for_bbox[idx, 2] + n_anchratios * best_anchor_for_bbox[idx, 3]] = 1
start = 4 * (best_anchor_for_bbox[idx, 2] + n_anchratios * best_anchor_for_bbox[idx, 3])
y_rpn_regr[best_anchor_for_bbox[idx, 0], best_anchor_for_bbox[idx, 1], start:start + 4] \
= best_dx_for_bbox[idx, :]
y_rpn_overlap = np.transpose(y_rpn_overlap, (2, 0, 1))
y_rpn_overlap = np.expand_dims(y_rpn_overlap, axis=0)
y_is_box_valid = np.transpose(y_is_box_valid, (2, 0, 1))
y_is_box_valid = np.expand_dims(y_is_box_valid, axis=0)
y_rpn_regr = np.transpose(y_rpn_regr, (2, 0, 1))
y_rpn_regr = np.expand_dims(y_rpn_regr, axis=0)
pos_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 1, y_is_box_valid[0, :, :, :] == 1))
neg_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 0, y_is_box_valid[0, :, :, :] == 1))
num_pos = len(pos_locs[0])
# one issue is that the RPN has many more negative than positive regions, so we turn off some of the negative
# regions. We also limit it to 256 regions.
num_regions = 256
if len(pos_locs[0]) > num_regions / 2:
val_locs = random.sample(range(len(pos_locs[0])), len(pos_locs[0]) - num_regions / 2)
y_is_box_valid[0, pos_locs[0][val_locs], pos_locs[1][val_locs], pos_locs[2][val_locs]] = 0
num_pos = num_regions / 2
if len(neg_locs[0]) + num_pos > num_regions:
val_locs = random.sample(range(len(neg_locs[0])), len(neg_locs[0]) - num_pos)
y_is_box_valid[0, neg_locs[0][val_locs], neg_locs[1][val_locs], neg_locs[2][val_locs]] = 0
y_rpn_cls = np.concatenate([y_is_box_valid, y_rpn_overlap], axis=1)
y_rpn_regr = np.concatenate([np.repeat(y_rpn_overlap, 4, axis=1), y_rpn_regr], axis=1)
return np.copy(y_rpn_cls), np.copy(y_rpn_regr)
首先遍历输出feature_map上的每个点,乘以步长16映射回原始输入得到anchor的中心点,再遍历9种不同大小和宽高比的anchor,每个anchor与所有的gt box计算iou,根据iou与预先设定的thresh来判定anchor的类别。具体的规则如下:
- 忽略超出图片边界的anchor
- 分类:iou>0.7的anchor为正样本,0.3<iou<0.7的为忽略样本,iou<0.3的为负样本。
- 回归:只有iou>0.7的anchor才参与回归计算,若一个anchor和多个gt box的iou都大于0.7,取iou最大的gt box计算回归。若一个gt box和所有anchor的iou都小于0.7,则取iou最大的那个anchor计算回归(除非和所有anchor的iou都等于0)
- 因为负样本的数量远大于正样本,论文限制总样本数量为256,若正样本数大于128,则限制其为128,负样本为128,其余的忽略。若正样本数小于128,则保留所有正样本,负样本数为256-正样本数,其余的忽略。
RPN
假设网络输入的shape是(1,900,600,3),即(batch_size, w, h, channel),则经过vgg16的backbone输出的feature_map的shape为(1,56,38,512)
def rpn(base_layers, num_anchors):
x = Conv2D(512, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers)
x_class = Conv2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x)
x_regr = Conv2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x)
return [x_class, x_regr, base_layers]
代码里的base_layers即为backbone输出的feature_map,num_anchors=9,RPN的输出x_class的shape为(1,56,38,9),x_regr的shape为(1,56,38,36),即每个anchor的类别score和边框回归值。RPN的作用就是从数量众多的anchor(56*38*9=19152)中选出少量可能有目标的anchor即proposal,注意这里的分类只分前景和背景,并不分具体的类别,网络到这里即为第一阶段。在训练阶段RPN结束后会进行两步操作,一是与calc_rpn的输出进行loss的计算,其中分类是交叉熵loss,回归是smooth L1 loss。二是会从所有anchor中选出proposal,具体做法是首先根据RPN输出的回归值将anchor回归到真实预测框的坐标,对回归的结果进行裁剪,保证预测框在图内,并删除回归结果不合理的框(例如左上x坐标大于右下的)。然后根据分类score做nms,挑选出300个更有可能包含目标的候选框(结果可能会小于300)。
R = roi_helpers.rpn_to_roi(P_rpn[0], P_rpn[1], C, K.image_dim_ordering(), use_regr=True, overlap_thresh=0.7, max_boxes=300) # _proposal_layer
# 先用P_rpn[1]在特征图上做回归,再根据P_rpn[0]做nms
得到的300个proposal再和gt box计算iou,去掉iou<0.1即容易分类的背景,0.1<iou<0.5的类别标为背景,iou>0.5的类别标为具体的类别(注意和RPN的GT标注不同,那里只标正负样本),并再一次计算和gt box边框的差值。这里标的具体类别以及和gt box的差值即是第二阶段最终网络输出分类和回归的target。
X2, Y1, Y2, IouS = roi_helpers.calc_iou(R, img_data, C, class_mapping) # proposal_target_layer
# 去掉容易分类的背景,即和GT的IOU小于0.1的。并又和GT做了一次回归的差值
# (1,245,4), (1,245,21), (1,245,160),
ROI Pooling
这里是第二阶段的开始,在ROI Pooling之前,网络会再次限制参与计算的roi(proposal)数量,代码设定为32,和第一阶段一样,若正样本大于16则随机取16个,若小于16则全取,负样本随机取32-正样本个数。因为最终要用全连接层进行分类和回归,而全连接层的输入要固定大小,roi pooling的作用就是将前面挑选出的不同大小的roi统一固定大小。具体做法是首先取32个roi在输出feature_map上的对应区域,每个区域平分成pool_w*pool_h块,论文中pool_w=pool_h=7,然后每个小块做max_pooling,即将不同大小的roi都变成了7*7大小,然后接全连接层得到最终的分类预测和边框回归预测。