proposal_layer负责综合所有的变换量和foreground anchors,计算出精准的proposal,送入后续RoI Pooling Layer。
Proposal Layer forward(前向传递函数)按照以下顺序依次处理:
1.生成anchors,利用对所有的anchors做bbox regression回归(这里的anchors生成和训练时完全一致)
2.按照输入的foreground softmax scores由大到小排序anchors,提取前pre_nms_topN(e.g. 6000)个anchors,即提取修正位置后的foreground anchors。
3.限定超出图像边界的foreground anchors为图像边界(防止后续roi pooling时proposal超出图像边界)
4.剔除非常小(width<threshold or height<threshold)的foreground anchors
5.进行nonmaximum suppression
6.再次按照nms后的foreground softmax scores由大到小排序fg anchors,提取前post_nms_topN(e.g. 300)结果作为proposal输出。
7.输出的output的shape是[batch_size, 2000, 5],在第3个维度上,第0列表示当前的region proposal属于batch size中的哪一张图像编号的,第1~4列表示region proposal在经过变换之后的输入图像分辨率上的坐标 [xmin,ymin,xmax,ymax]
class _ProposalLayer(nn.Module):
"""
Outputs object detection proposals by applying estimated bounding-box
transformations to a set of regular boxes (called "anchors").
"""
def __init__(self, feat_stride, scales, ratios):
super(_ProposalLayer, self).__init__()
self.feat_stride = feat_stride
self._anchors = torch.from_numpy(generate_anchors(scales=np.array(scales),
ratios=np.array(ratios))).float()
self._num_anchors = self._anchors.size(0)
def forward(self, input):
# According to channel C, the frame of the RPN prediction is taken as the foreground score.
# Please note that among the 18 channels, the first 9 are the probability that the box belongs
# to the background, and the last 9 are the probability of belonging to the foreground.
# the shape of scores:[batch_size, 9, H, W]
# H = M / 16, W = N /16
scores = input[0][:, self._num_anchors:, :, :]
# bbox_deltas represents the coordinate transformation information of each frame of the RPN output
bbox_deltas = input[1]
im_info = input[2]
cfg_key = input[3]
pre_nms_topN = cfg[cfg_key].RPN_PRE_NMS_TOP_N
post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N
nms_thresh = cfg[cfg_key].RPN_NMS_THRESH
min_size = cfg[cfg_key].RPN_MIN_SIZE
batch_size = bbox_deltas.size(0)
# 1. Generate proposals from bbox deltas and shifted anchors
feat_height, feat_width = scores.size(2), scores.size(3)
# Enumerate all shifts
# [0, 16, 32, 48...]
shift_x = np.arange(0, feat_width) * self.feat_stride
# [0, 16, 32, 48...]
shift_y = np.arange(0, feat_height) * self.feat_stride
# generating grid on the feature map
# shift_x shape: [height, width], shift_y shape: [height, width]
# H and W respectively represent the images that are actually input into the network
# for the original image after the scaled and padding operations in the data set.
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
# shifts shape:[height * width, 4]
shifts = torch.from_numpy(np.vstack((shift_x.ravel(), shift_y.ravel(),
shift_x.ravel(), shift_y.ravel())).transpose())
shifts = shifts.contiguous().type_as(scores).float()
A = self._num_anchors # A=9
# Indicates the area of the feature map (equal to the height of the feature map * width)
# S=H*W
K = shifts.size(0)
# now the anchors shape:[batch_size, K * A, 4]
# K*A is the number of all anchor boxes in an input image
self._anchors = self._anchors.type_as(scores)
anchors = self._anchors.view(1, A, 4) + shifts.view(K, 1, 4)
anchors = anchors.view(1, K * A, 4).expand(batch_size, K * A, 4)
# Transpose and reshape predicted bbox transformations to get them
# into the same order as the anchors:
bbox_deltas = bbox_deltas.permute(0, 2, 3, 1).contiguous()
bbox_deltas = bbox_deltas.view(batch_size, -1, 4)
# Same story for the scores:
scores = scores.permute(0, 2, 3, 1).contiguous()
scores = scores.view(batch_size, -1)
# Convert anchors into proposals via bbox transformations
# According to the position offset predicted by the anchors and the RPN,
# the RPN prediction coordinate values are decoded to obtain the absolute
# coordinate values of the RPN prediction values on the training image
# resolution (actually input to the network), and the returned proposal
# is the Predicted coordinate values of boxes of RPN for all the anchors.
proposals = bbox_transform_inv(anchors, bbox_deltas, batch_size) # [xmin,ymin,xmax,ymax]
# 2. clip predicted boxes to image
proposals = clip_boxes(proposals, im_info, batch_size)
# proposals = clip_boxes_batch(proposals, im_info, batch_size)
scores_keep = scores # shape:[batch_size,(H/16)*(W/16)*9]
proposals_keep = proposals
# For each input image in the batch size, for (H/16)*(W/16)*9 region proposals are sorted
# in descending order according to the forward score of the RPN prediction, and the first
# parameter returned is the score after descending order. Value, the second parameter is
# the position index number in descending order
_, order = torch.sort(scores_keep, 1, True)
# output shape [batch_size, post_nms_topN, 5]
# Torch.new() method generates new tensor post_nms_topN=2000 with the same data type as the
# current torch.tensor, indicating that the final proposal is sent to the subsequent
# Fast R-CNN model (region proposal) 2000. This is obvious from the RPN. The function of
# the R-CNN model and the selective search algorithm in the Fast R-CNN model is that
# the selective search algorithm does not need to learn the training process, and
# the RPN is obtained by training the network to obtain the region proposal.
# The selective search algorithm is also The subsequent detector provides a
# subsequent frame. It can be seen that the R-CNN model and the Fast R-CNN model
# generate 2000 proposal proposals through the selective search algorithm,
# so it cannot be regarded as dense detection and the Faster R-CNN obtains
# the region proposal/ROI by training RPN, SSD/YOLO Furthermore, the anchor
# boxes are generated by the sliding window form, which is the absolute sense detection.
output = scores.new(batch_size, post_nms_topN, 5).zero_()
for i in range(batch_size):
# # 3. remove predicted boxes with either height or width < threshold
# # (NOTE: convert min_size to input image scale stored in im_info[2])
proposals_single = proposals_keep[i]
scores_single = scores_keep[i]
# # 4. sort all (proposal, score) pairs by score from highest to lowest
# # 5. take top pre_nms_topN (e.g. 6000)
order_single = order[i]
if pre_nms_topN > 0 and pre_nms_topN < scores_keep.numel():
order_single = order_single[:pre_nms_topN]
proposals_single = proposals_single[order_single, :]
scores_single = scores_single[order_single].view(-1, 1)
# 6. apply nms (e.g. threshold = 0.7)
# 7. take after_nms_topN (e.g. 300)
# 8. return the top proposals (-> RoIs top)
eep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh, force_cpu=not cfg.USE_GPU_NMS)
keep_idx_i = keep_idx_i.long().view(-1)
if post_nms_topN > 0:
keep_idx_i = keep_idx_i[:post_nms_topN]
proposals_single = proposals_single[keep_idx_i, :]
scores_single = scores_single[keep_idx_i, :]
# padding 0 at the end.
num_proposal = proposals_single.size(0)
output[i, :, 0] = i
output[i, :num_proposal, 1:] = proposals_single
# output shape:[batch_size,2000,5]
# In the third dimension, the 0th column indicates which image number in the current size
# belongs to the batch size, and the first to fourth columns indicate the coordinates of t
# he region proposal on the input image resolution after the transformation [xmin, ymin ,xmax,ymax]
return output
def backward(self, top, propagate_down, bottom):
"""This layer does not propagate gradients."""
pass
def reshape(self, bottom, top):
"""Reshaping happens during the call to forward."""
pass
def _filter_boxes(self, boxes, min_size):
"""Remove all boxes with any side smaller than min_size."""
ws = boxes[:, :, 2] - boxes[:, :, 0] + 1
hs = boxes[:, :, 3] - boxes[:, :, 1] + 1
keep = ((ws >= min_size.view(-1, 1).expand_as(ws)) & (hs >= min_size.view(-1, 1).expand_as(hs)))
return keep