Fast RCNN has integrated feature extraction, proposal extraction, bounding box expression (rect refine) and classification into a network, which greatly improves the comprehensive performance, especially in the detection speed.
Fast RCNN has 4 layers
1.Conv layers。 As a CNN network target detection method, fast RCNN first uses a set of basic conv + relu + pooling layer to extract feature maps of image. The feature maps are shared for subsequent RPN layers and full connection layers
2.Region Proposal Networks。 RPN network is used to generate region proposals. In this layer, softmax is used to judge whether anchors belong to positive or negative, and then bounding box regression is used to modify anchors to obtain accurate proposals
3.Roi Pooling。 This layer collects the input feature maps and proposals, extracts the proposed feature maps after synthesizing these information, and sends them to the subsequent full connection layer to determine the target category
4.Classification。 Using the proposal feature maps to calculate the category of the proposal, and at the same time, the final exact position of the detection box is obtained by the bounding box expression again
Let’s take a look at the code implementation
Start with data
def inverse_normalize(img):
2 if opt.caffe_pretrain:
3 img = img + (np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1))
4 return img[::-1, :, :]
5 # approximate un-normalize for visualize
6 return (img * 0.225 + 0.45).clip(min=0, max=1) * 255
The function first reads opt.caffe’u Pretrain to determine whether to use caffe’u Pretrain for pre-training. If so, inverse regularization of the image is to process the image into the format required by Caffe model
def pytorch_normalze(img):
"""
https://github.com/pytorch/vision/issues/223
return appr -1~1 RGB
"""
normalize = tvtsf.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
img = normalize(t.from_numpy(img))
return img.numpy()
pytorch_normalize
The function first sets the normalization parameter to normalize = tvtsf. Normalize (mean = [0.485,0.456,0.406], STD = [0.229,0.224,0.225]) and then normalizes the image img = normalize
def caffe_normalize(img):
"""
return appr -125-125 BGR
"""
img = img[[2, 1, 0], :, :] # RGB-BGR
img = img * 255
mean = np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1)
img = (img - mean).astype(np.float32, copy=True)
return img
caffe_normalize(img)
Caffe’s picture format is BGR, so it needs img [[2,1,0],:,:] to convert RGB into BGR format, then img = img * 255, mean = NP. Array ([122.7717115.9465102.9801]). Reshape (3,1,1) to set the picture mean value
Then, the image is subtracted from the mean value to complete the normalization of Caffe form
And then let’s talk about the preprocessing model
def loc2bbox(src_bbox, loc):
"""Decode bounding boxes from bounding box offsets and scales.
Given bounding box offsets and scales computed by
:meth:`bbox2loc`, this function decodes the representation to
coordinates in 2D image coordinates.
Given scales and offsets :math:`t_y, t_x, t_h, t_w` and a bounding
box whose center is :math:`(y, x) = p_y, p_x` and size :math:`p_h, p_w`,
the decoded bounding box's center :math:`\\hat{g}_y`, :math:`\\hat{g}_x`
and size :math:`\\hat{g}_h`, :math:`\\hat{g}_w` are calculated
by the following formulas.
* :math:`\\hat{g}_y = p_h t_y + p_y`
* :math:`\\hat{g}_x = p_w t_x + p_x`
* :math:`\\hat{g}_h = p_h \\exp(t_h)`
* :math:`\\hat{g}_w = p_w \\exp(t_w)`
The decoding formulas are used in works such as R-CNN [#]_.
The output is same type as the type of the inputs.
.. [#] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. \
Rich feature hierarchies for accurate object detection and semantic \
segmentation. CVPR 2014.
Args:
src_bbox (array): A coordinates of bounding boxes.
Its shape is :math:`(R, 4)`. These coordinates are
:math:`p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}`.
loc (array): An array with offsets and scales.
The shapes of :obj:`src_bbox` and :obj:`loc` should be same.
This contains values :math:`t_y, t_x, t_h, t_w`.
Returns:
array:
Decoded bounding box coordinates. Its shape is :math:`(R, 4)`. \
The second axis contains four values \
:math:`\\hat{g}_{ymin}, \\hat{g}_{xmin},
\\hat{g}_{ymax}, \\hat{g}_{xmax}`.
"""
if src_bbox.shape[0] == 0:
return xp.zeros((0, 4), dtype=loc.dtype)
src_bbox = src_bbox.astype(src_bbox.dtype, copy=False)
src_height = src_bbox[:, 2] - src_bbox[:, 0]
src_width = src_bbox[:, 3] - src_bbox[:, 1]
src_ctr_y = src_bbox[:, 0] + 0.5 * src_height
src_ctr_x = src_bbox[:, 1] + 0.5 * src_width
dy = loc[:, 0::4]
dx = loc[:, 1::4]
dh = loc[:, 2::4]
dw = loc[:, 3::4]
ctr_y = dy * src_height[:, xp.newaxis] + src_ctr_y[:, xp.newaxis]
ctr_x = dx * src_width[:, xp.newaxis] + src_ctr_x[:, xp.newaxis]
h = xp.exp(dh) * src_height[:, xp.newaxis]
w = xp.exp(dw) * src_width[:, xp.newaxis]
dst_bbox = xp.zeros(loc.shape, dtype=loc.dtype)
dst_bbox[:, 0::4] = ctr_y - 0.5 * h
dst_bbox[:, 1::4] = ctr_x - 0.5 * w
dst_bbox[:, 2::4] = ctr_y + 0.5 * h
dst_bbox[:, 3::4] = ctr_x + 0.5 * w
return dst_bbox
loc2bbox()
class AnchorTargetCreator(object):
"""Assign the ground truth bounding boxes to anchors.
Assigns the ground truth bounding boxes to anchors for training Region
Proposal Networks introduced in Faster R-CNN [#]_.
Offsets and scales to match anchors to the ground truth are
calculated using the encoding scheme of
:func:`model.utils.bbox_tools.bbox2loc`.
.. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
Faster R-CNN: Towards Real-Time Object Detection with \
Region Proposal Networks. NIPS 2015.
Args:
n_sample (int): The number of regions to produce.
pos_iou_thresh (float): Anchors with IoU above this
threshold will be assigned as positive.
neg_iou_thresh (float): Anchors with IoU below this
threshold will be assigned as negative.
pos_ratio (float): Ratio of positive regions in the
sampled regions.
"""
def __init__(self,
n_sample=256,
pos_iou_thresh=0.7, neg_iou_thresh=0.3,
pos_ratio=0.5):
self.n_sample = n_sample
self.pos_iou_thresh = pos_iou_thresh
self.neg_iou_thresh = neg_iou_thresh
self.pos_ratio = pos_ratio
def __call__(self, bbox, anchor, img_size):
"""Assign ground truth supervision to sampled subset of anchors.
Types of input arrays and output arrays are same.
Here are notations.
* :math:`S` is the number of anchors.
* :math:`R` is the number of bounding boxes.
Args:
bbox (array): Coordinates of bounding boxes. Its shape is
:math:`(R, 4)`.
anchor (array): Coordinates of anchors. Its shape is
:math:`(S, 4)`.
img_size (tuple of ints): A tuple :obj:`H, W`, which
is a tuple of height and width of an image.
Returns:
(array, array):
#NOTE: it's scale not only offset
* **loc**: Offsets and scales to match the anchors to \
the ground truth bounding boxes. Its shape is :math:`(S, 4)`.
* **label**: Labels of anchors with values \
:obj:`(1=positive, 0=negative, -1=ignore)`. Its shape \
is :math:`(S,)`.
"""
img_H, img_W = img_size
n_anchor = len(anchor)
inside_index = _get_inside_index(anchor, img_H, img_W)
anchor = anchor[inside_index]
argmax_ious, label = self._create_label(
inside_index, anchor, bbox)
# compute bounding box regression targets
loc = bbox2loc(anchor, bbox[argmax_ious])
# map up to original set of anchors
label = _unmap(label, n_anchor, inside_index, fill=-1)
loc = _unmap(loc, n_anchor, inside_index, fill=0)
return loc, label
AnchorTargeCreator
Again, the function of training
def train(**kwargs):
opt._parse(kwargs)
dataset = Dataset(opt)
print('load data')
dataloader = data_.DataLoader(dataset, \
batch_size=1, \
shuffle=True, \
# pin_memory=True,
num_workers=opt.num_workers)
testset = TestDataset(opt)
test_dataloader = data_.DataLoader(testset,
batch_size=1,
num_workers=opt.test_num_workers,
shuffle=False, \
pin_memory=True
)
faster_rcnn = FasterRCNNVGG16()
print('model construct completed')
trainer = FasterRCNNTrainer(faster_rcnn).cuda()
if opt.load_path:
trainer.load(opt.load_path)
print('load pretrained model from %s' % opt.load_path)
trainer.vis.text(dataset.db.label_names, win='labels')
best_map = 0
lr_ = opt.lr
for epoch in range(opt.epoch):
trainer.reset_meters()
for ii, (img, bbox_, label_, scale) in tqdm(enumerate(dataloader)):
scale = at.scalar(scale)
img, bbox, label = img.cuda().float(), bbox_.cuda(), label_.cuda()
trainer.train_step(img, bbox, label, scale)#进行一次参数优化过程
if (ii + 1) % opt.plot_every == 0:
if os.path.exists(opt.debug_file):
ipdb.set_trace()
# plot loss
trainer.vis.plot_many(trainer.get_meter_data())
# plot groud truth bboxes
ori_img_ = inverse_normalize(at.tonumpy(img[0]))
gt_img = visdom_bbox(ori_img_,
at.tonumpy(bbox_[0]),
at.tonumpy(label_[0]))
trainer.vis.img('gt_img', gt_img)
# plot predicti bboxes
_bboxes, _labels, _scores = trainer.faster_rcnn.predict([ori_img_], visualize=True)
pred_img = visdom_bbox(ori_img_,
at.tonumpy(_bboxes[0]),
at.tonumpy(_labels[0]).reshape(-1),
at.tonumpy(_scores[0]))
trainer.vis.img('pred_img', pred_img)
# rpn confusion matrix(meter)
trainer.vis.text(str(trainer.rpn_cm.value().tolist()), win='rpn_cm')
# roi confusion matrix
trainer.vis.img('roi_cm', at.totensor(trainer.roi_cm.conf, False).float())
eval_result = eval(test_dataloader, faster_rcnn, test_num=opt.test_num)
trainer.vis.plot('test_map', eval_result['map'])
lr_ = trainer.faster_rcnn.optimizer.param_groups[0]['lr']
log_info = 'lr:{}, map:{},loss:{}'.format(str(lr_),
str(eval_result['map']),
str(trainer.get_meter_data()))
trainer.vis.log(log_info)
if eval_result['map'] > best_map:
best_map = eval_result['map']
best_path = trainer.save(best_map=best_map)
if epoch == 9:
trainer.load(best_path)
trainer.faster_rcnn.scale_lr(opt.lr_decay)
lr_ = lr_ * opt.lr_decay
if epoch == 13:
break
train(**kwargs)
Reference articles
https://www.cnblogs.com/kerwins-AC/p/9734381.html
This is for reference to other people’s articles as learning reports, not for business purposes