Paper-info
- title : Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[2016-NIPS]
- author : Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun.
- Github : fasterrcnn-pytorch
Motivation
在引出Faster R-CNN之前,我们首先快速回顾一下R-CNN和Fast R-CNN的主要环节:
# R-CNN
ROIs = region_proposal(image) # Selective Search etc...
for ROI in ROIs
patch = get_patch(image, ROI) # extract feature
results = detector(patch)
# Fast R-CNN
feature_maps = process(image) # extract feature for image only once!
ROIs = region_proposal(image) # time consuming !!! [ Selective Search etc...]
for ROI in ROIs
patch = roi_pooling(feature_maps, ROI)
results = detector2(patch)
下面是R-CNN和Fast R-CNN的网络框架图:
- R-CNN
- Fast R-CNN
Fast R-CNN的测试速度依赖于region proposal的生成方法(eg: selective search等),而这些方法在GPU机器上运行很慢,在测试环节为:整个测试流程耗时2.3s,其中2000个ROIs的产生环节就耗时约2s!通过以上的回顾,我们知道Fast R-CNN将R-CNN的multi-stages整合到一个网络中,从而实现了end-to-end的训练模式,但是它的inference速度很慢(0.5 FPS,主要的原因是产生proposal环节耗时严重),这大大限制了模型的实际应用场景。针对Fast R-CNN的这一缺陷,Faster R-CNN创造性地提出了anchor-based - RPN(Region Proposal Net),优雅而高效地解决了Fast R-CNN的痛点,成为了R-CNN系列方法的巅峰之作,同时也是目前detection领域最常用的模型之一。
Idea
首先来看,selective search方法的大致做法[2],它的本质是图像区域由小到大地进行动态聚类,然后输出在这个动态聚类的过程中生成的各个区域对应的bbox作为proposal。Selective search工作机理的示意图如下:
为了加速proposal的生成过程,Faster RCNN中创造性地提出了RPN(一个内置的浅层网络),通过引入anchor机制,可以为产生多样化的proposal。具体操作为:
step - 1. 对于feature_map上的每个pixel,产生k个多样化[由[size] x [aspect_ratio]控制, 每个pixel的base_anchor都一样]base anchor,再根据feature_map与input_img的尺度关系,将base anchor转为基于原图尺寸的anchor。
step - 2. RPN中包含两个支路,分别输出anchor对应的objectness和offset
[为什么这里k个anchor要产生2k个score呢?]
step - 3. 根据offset对anchor进行修正,注意offset的内容是:
这里记得整理Faster RCNN的loss-function。
def decode_single(self, rel_codes, boxes):
'''
From a set of original boxes and encoded relative box offsets,
get the decoded boxes.
Arguments:
rel_codes (Tensor): bbox_offset
boxes (Tensor) : anchor boxes.
'''
boxes = boxes.to(rel_codes.dtype)
widths = boxes[:, 2] - boxes[:, 0]
heights = boxes[:, 3] - boxes[:, 1]
ctr_x = boxes[:, 0] + 0.5 * widths
ctr_y = boxes[:, 1] + 0.5 * heights
wx, wy, ww, wh = self.weights # 默认值为[1,1,1,1]
dx = rel_codes[:, 0::4] / wx
dy = rel_codes[:, 1::4] / wy
dw = rel_codes[:, 2::4] / ww
dh = rel_codes[:, 3::4] / wh
# Prevent sending too large values into torch.exp()
dw = torch.clamp(dw, max=self.bbox_xform_clip)
dh = torch.clamp(dh, max=self.bbox_xform_clip)
pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
pred_w = torch.exp(dw) * widths[:, None]
pred_h = torch.exp(dh) * heights[:, None]
pred_boxes = torch.zeros_like(rel_codes)
pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h
return pred_boxes
step - 4. 利用objectness、nms策略对anchor进行筛选,筛选剩下的即为RPN输出的proposal。
loss function
其中:
p_i 表示 anchor是一个object的预测概率;
p_star_i 表示 groundtruth(binary-label, 1:positive anchor; 0:negative anchor);
t_i 表示pred_bbox参数{t_x, t_y, t_w, t_h};
t_star_i 表示gt_bbox{t_star_x, t_star_y, t_star_w, t_star_h};
L_cls(p_i, p_star_i)表示分类损失,采用log-loss;
L_reg(t_i, t_star_i)表示回归损失,采用Fast R-CNN中定义的smooth L1损失函数:
Outline
feature_maps = process(image)
ROIs = RPN(image) # Faster!
for ROI in ROIs
patch = roi_pooling(feature_maps, ROI)
results = detector2(patch)
torchvision源码分析
Anchor 生成
class AnchorGenerator(nn.Module):
"""
Module that generates anchors for a set of feature maps and image sizes.
The module support computing anchors at multiple sizes and aspect ratios per feature map.
sizes and aspect_ratios should have the same number of elements, and it should
correspond to the number of feature maps.
sizes[i] and aspect_ratios[i] can have an arbitrary number of elements,
and AnchorGenerator will output a set of sizes[i] * aspect_ratios[i] anchors
per spatial location for feature map i.
Arguments:
sizes (Tuple[Tuple[int]]):
aspect_ratios (Tuple[Tuple[float]]):
"""
def __init__(self, sizes=(128, 256, 512), aspect_ratios=(0.5, 1.0, 2.0)):
super(AnchorGenerator, self).__init__()
...
self.sizes = sizes
self.aspect_ratios = aspect_ratios
self.cell_anchors = None
self._cache = {}
@staticmethod
def generate_anchors(scales, aspect_ratios, device="cpu"):
'''
Generate the base_anchor
'''
scales = torch.as_tensor(scales, dtype=torch.float32, device=device)
aspect_ratios = torch.as_tensor(aspect_ratios, dtype=torch.float32, device=device)
h_ratios = torch.sqrt(aspect_ratios)
w_ratios = 1 / h_ratios
ws = (w_ratios[:, None] * scales[None, :]).view(-1)
hs = (h_ratios[:, None] * scales[None, :]).view(-1)
base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2 #center_point
return base_anchors.round()
def set_cell_anchors(self, device):
''' Generate the anchors '''
if self.cell_anchors is not None:
return self.cell_anchors
cell_anchors = [
self.generate_anchors(sizes, aspect_ratios, device)
for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)
]
self.cell_anchors = cell_anchors
def num_anchors_per_location(self):
return [len(s) * len(a) for s, a in zip(self.sizes, self.aspect_ratios)]
def grid_anchors(self, grid_sizes, strides):
'''
Generate the anchors according to base_anchor and stride
grid_size : size of feature_map
stride : map_stride between img and feature_map
'''
anchors = list()
for size, stride, base_anchors in zip(grid_sizes, strides, self.cell_anchors):
grid_height, grid_width = size
stride_height, stride_width = stride
device = base_anchors.device
shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
shifts_y = torch.arange(0, grid_height,dtype=torch.float32, device=device) * stride_height
shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
shift_x = shift_x.reshape(-1)
shift_y = shift_y.reshape(-1)
shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)
anchors.append((shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4))
return anchors
def cached_grid_anchors(self, grid_sizes, strides):
key = tuple(grid_sizes) + tuple(strides)
if key in self._cache:
return self._cache[key]
anchors = self.grid_anchors(grid_sizes, strides) #
self._cache[key] = anchors
return anchors
def forward(self, image_list, feature_maps):
'''
step - 1. generate the base_anchor according to scales and aspect_ratios
step - 2. generate the anchors over all feature maps for each type of image_size
NOTE : there may be multi-scales feature_map
'''
# step - 1
self.set_cell_anchors(feature_maps[0].device)
# step - 2
grid_sizes = tuple([feature_map.shape[-2:] for feature_map in feature_maps])
image_size = image_list.tensors.shape[-2:]
# calculate the map-stride between img_size and feature_map
strides = tuple((image_size[0] / g[0], image_size[1] / g[1]) for g in grid_sizes)
anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)
anchors = []
# deal with each instance in batch
for i, (image_height, image_width) in enumerate(image_list.image_sizes):
anchors_in_image = []
for anchors_per_feature_map in anchors_over_all_feature_maps:
anchors_in_image.append(anchors_per_feature_map)
anchors.append(anchors_in_image)
anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
return anchors
- RPN-header
class RPNHead(nn.Module):
"""
Adds a simple RPN Head with classification and regression heads
Arguments:
in_channels (int): number of channels of the input feature
num_anchors (int): number of anchors to be predicted
"""
def __init__(self, in_channels, num_anchors):
super(RPNHead, self).__init__()
self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)
# 初始化卷积核
for l in self.children():
torch.nn.init.normal_(l.weight, std=0.01)
torch.nn.init.constant_(l.bias, 0)
def forward(self, x):
logits, bbox_reg = [], []
for feature in x:
t = F.relu(self.conv(feature)) # 得到out_channel=256,大小不变的feature_map
logits.append(self.cls_logits(t)) # 产生objectness-score
bbox_reg.append(self.bbox_pred(t)) # 产生anchor_box的offset
return logits, bbox_reg
- Region Proposal Networks
class RegionProposalNetwork(torch.nn.Module):
...
def forward(self, images, features, targets=None):
"""
Arguments:
images (ImageList): images for which we want to compute the predictions
features (List[Tensor]): features computed from the images that are
used for computing the predictions. Each tensor in the list
correspond to different feature levels
targets (List[Dict[Tensor]): ground-truth boxes present in the image (optional).
If provided, each element in the dict should contain a field `boxes`,
with the locations of the ground-truth boxes.
Returns:
boxes (List[Tensor]): the predicted boxes from the RPN, one Tensor per
image.
losses (Dict[Tensor]): the losses for the model during training. During
testing, it is an empty dict.
"""
# 主干网络抽得的feature_map
features = list(features.values())
# 基于该feature_map产生anchor
anchors = self.anchor_generator(images, features)
# 利用RPN-Header产生anchor对应的objectness、bbox_offset
objectness, pred_bbox_deltas = self.head(features)
# 这里的feature_map是过了FPN的,所以会有多个scale的feature_map, concat这些信息
num_images = len(anchors)
num_anchors_per_level = [obj[0].numel() for obj in objectness]
objectness, pred_bbox_deltas = concat_box_prediction_layers(objectness, pred_bbox_deltas)
# 利用RPN_header产生的bbox_offset来refine anchor,从而产生初始的proposal
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
proposals = proposals.view(num_images, -1, 4)
# 利用bbox的objectness以及nms来筛选掉不靠谱的bbox
boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, \
num_anchors_per_level)
losses = {}
if self.training:
labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
loss_objectness, loss_rpn_box_reg = self.compute_loss(objectness, pred_bbox_deltas, \
labels, regression_targets)
losses = {
"loss_objectness": loss_objectness,
"loss_rpn_box_reg": loss_rpn_box_reg,
}
return boxes, losses
Experiment
Conclusion
- RPN-class scores account for the accuracy of the highest ranked proposals;
- High-quality proposals are mainly due to RPN-regressed positions, anchor boxes alone are not sufficient for accurate detection
Reference
[1]. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[2016-NIPS]
[2]. J.R.R.Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders. Selective Search for Object Recognition[2013IJCV]