有幸参加了DataWhale举办的目标检测组队学习。收获颇多。
每天记录一些自己之前的知识盲点,需经常温习。
项目开源地址:https://datawhalechina.github.io/dive-into-cv-pytorch/#/chapter03_object_detection_introduction/3_5 。
SSD原文链接:https://arxiv.org/pdf/1512.02325.pdf 。
一、损失函数
1、基本概念
在人工智能机器学习、深度学习领域中,有三种函数非常重要。这三种函数分别是:激活函数、优化函数和损失函数。
激活函数:又称响应函数,其主要作用是使神经元的输出通过一个非线性函数实现,这样整个神经网络模型就不是线性的了。这个非线性函数就是激活函数。几种常见的激活函数如图1所示。
图1 几种常见的激活函数
优化函数:神经网络在反向传播过程中的优化方法,又名优化函数。优化函数会对模型收敛速度有直接影响。几种常见的优化函数如图2所示。
图2 几种常见的优化函数
损失函数:即大家津津乐道的loss,用于估算模型的预测值和真实值不一致的程度。损失函数的值越小,模型的鲁棒性就越高。几种常见的损失函数如图3所示。
图3 几种常见的损失函数
纵观以上三大函数,激活函数近年来发展缓慢,深度神经网络使用最多的是Leaky ReLU(ReLU可以看作是Leaky ReLU的特殊情况),优化函数基本上被Adam所垄断(高手还是会用mini-batch SGD + Momentum进行手动调参),只有损失函数还具有顽强的生命力。纵观各大CV顶会论文,凡是作者提出一种全新的算法,激活函数有可能是用烂了的ReLU,优化函数有可能是集大成者Adam,但是往往损失函数均由作者手动自行设计。损失函数设计的好坏,也反映出作者对于自己所发明的算法的理解程度。
2、Matching Strategy(匹配策略)
我们分配了许多prior bboxes,我们要想让其预测类别和目标框信息,我们先要知道每个prior bbox和哪个目标对应,从而才能判断预测的是否准确,从而将训练进行下去。
不同方法 ground truth boxes 与 prior bboxes 的匹配策略大致都是类似的,但是细节会有所不同。这里我们采用SSD中的匹配策略,具体如下:
第一个原则:从ground truth box出发,寻找与每一个ground truth box有最大的jaccard overlap的prior bbox,这样就能保证每一个groundtruth box一定与一个prior bbox对应起来(jaccard overlap就是IOU,如图4所示,前面介绍过)。 反之,若一个prior bbox没有与任何ground truth进行匹配,那么该prior bbox只能与背景匹配,就是负样本。
图4 IOU
IOU的代码在之前的博客中有所体现,现在再来一遍。
```
1、相交方式假设
假设两框相交方式如下:
——————————————————————————————————————————————————> x轴
|-------------------
|| |
|| bbox1 |
|| |
|| bbox2:(x1,y1) -----------
|| | | |
|| | | |
|| | | |
|| | | |
|| | | |
|| | | |
|| | | |
|| | | |
|| | | |
|| | | |
|----------bbox1:(x2,y2) |
| | |
| | |
| | |
| | bbox2 |
| | |
| -------------------
|
\|/
\/
y轴
2、目标检测评估指标IOU计算公式
两框交集部分长方形的面积
IOU = ——————————————————————————————————————————
框1面积+框2面积-两框交集部分长方形的面积
```
def get_iou(bb1, bb2):
assert bb1['x1'] < bb1['x2'] # assert : 如果bb1['x1'] < bb1['x2'],继续向下执行
assert bb1['y1'] < bb1['y2']
assert bb2['x1'] < bb2['x2']
assert bb2['y1'] < bb2['y2']
x_left = max(bb1['x1'], bb2['x1']) # 左上角x坐标(若图所示,则为bbox2的x1):如果框2左上角x坐标大于框1左上角x坐标,则选择框2的左上角x坐标
y_top = max(bb1['y1'], bb2['y1']) # 左上角y坐标(bbox2的y1),同上
x_right = min(bb1['x2'], bb2['x2']) # 右下角x坐标(bbox1的x2)
y_bottom = min(bb1['y2'], bb2['y2']) # 右下角y坐标(bbox1的y2)
if x_right < x_left or y_bottom < y_top: # 两框没有交集
return 0.0
intersection_area = (x_right - x_left) * (y_bottom - y_top)
bb1_area = (bb1['x2'] - bb1['x1']) * (bb1['y2'] - bb1['y1']) # 框1的面积
bb2_area = (bb2['x2'] - bb2['x1']) * (bb2['y2'] - bb2['y1']) # 框2的面积
iou = intersection_area / float(bb1_area + bb2_area - intersection_area) # IOU计算公式
assert iou >= 0.0
assert iou <= 1.0
return iou
一个图片中ground truth是非常少的,而prior bbox却很多,如果仅按第一个原则匹配,很多prior bbox会是负样本,正负样本极其不平衡,所以需要第二个原则。
第二个原则:从prior bbox出发,对剩余的还没有配对的prior bbox与任意一个ground truth box尝试配对,只要两者之间的jaccard overlap大于阈值(一般是0.5),那么该prior bbox也与这个ground truth进行匹配。这意味着某个ground truth可能与多个Prior box匹配,这是可以的。但是反过来却不可以,因为一个prior bbox只能匹配一个ground truth,如果多个ground truth与某个prior bbox的 IOU 大于阈值,那么prior bbox只与IOU最大的那个ground truth进行匹配。
注意:第二个原则一定在第一个原则之后进行,仔细考虑一下这种情况,如果某个ground truth所对应最大IOU的prior bbox小于阈值,并且所匹配的prior bbox却与另外一个ground truth的IOU大于阈值,那么该prior bbox应该匹配谁,答案应该是前者,首先要确保每个ground truth一定有一个prior bbox与之匹配。
用一个示例来说明上述的匹配原则:
图5
图5中有7个红色的框代表先验框,黄色的是ground truths,在这幅图像中有三个真实的目标。按照前面列出的步骤将生成以下匹配项:
图6
3、损失函数
下面来介绍如何设计损失函数。
由于该项目为目标检测,目标检测 = 定位 + 物体分类,故将总体的目标损失函数定义为 定位损失(loc)和置信度损失(conf)的加权和:
其中N是匹配到GT(Ground Truth)的prior bbox数量,如果N=0,则将损失设为0;而 α 参数用于调整confidence loss和location loss之间的比例,默认 α=1。
confidence loss是在多类别置信度(c)上的softmax loss,公式如下:
其中i指代搜索框序号,j指代真实框序号,p指代类别序号,p=0表示背景。其中 中取1表示第i个prior bbox匹配到第 j 个GT box,而这个GT box的类别为 p 。表示第i个搜索框对应类别p的预测概率。此处有一点需要关注,公式前半部分是正样本(Pos)的损失,即分类为某个类别的损失(不包括背景),后半部分是负样本(Neg)的损失,也就是类别为背景的损失。
而location loss(位置回归)是典型的smooth L1 loss。
且慢,先来了解下什么是smooth L1 loss。详见图7。
图7 smooth L1 loss的定义及其函数图像
通俗地讲,就是当-1 < x < 1时,为L2 loss;在其余地方,为L1 loss下移0.5个单位。这样smooth L1 loss在x = -1和x = 1处依然可以保持连续性,方便对loss进行求导。
那么,smooth L1 loss相比传统的L1 loss和L2 loss有何先进之处呢?
众所周知,learning_rate(学习率)的调整策略一般都是先大后小。但是对于L2 loss来说,开始设置较大学习率会导致下降速度过快,训练会极不稳定。对于L1 loss来说,其梯度的绝对值恒为1,这就导致当训练后期预测值与ground truth差异很小时,L1 loss对于预测值的导数的绝对值仍然为1,而如果学习率不变的话,损失函数将在稳定值附近波动,难以继续收敛达到更高的精度。
而smooth L1 loss完美地避开了L1 loss和L2 loss的缺陷。当训练初期可以采用较大的学习率,梯度的绝对值上限仅仅为1;等到训练后期loss较小的时候,就陷入了L2所在的区域,此时逐步减小学习率,loss的梯度也会逐步变小。
鉴于smooth L1 loss拥有如此优秀的性能,Faster R-CNN和SSD中均采用了该损失函数。
本项目location loss(位置回归)的定义如下:
其中,l为预测框,g为ground truth。(cx,xy)为补偿(regress to offsets)后的默认框d的中心,(w,h)为默认框的宽和高。更详细的解释看-看下图:
4、Hard negative mining(难负样本挖掘)
值得注意的是,一般情况下negative prior bboxes数量 >> positive prior bboxes数量,直接训练会导致网络过于重视负样本,预测效果很差。对于Faster R-CNN来说,是通过限制正负样本的数量来保持正、负样本的均衡性;而在SSD中,则是通过保证正、负样本的比例来实现样本均衡。为了保证正负样本尽量平衡,我们这里使用SSD使用的在线难例挖掘策略(hard negative mining),即依据confidience loss对属于负样本的prior bbox进行排序,只挑选其中confidience loss高的bbox进行训练,将正负样本的比例控制在positive:negative=1:3。其核心作用就是只选择负样本中容易被分错类的困难负样本来进行网络训练,来保证正负样本的平衡和训练的有效性。
举个例子:假设在这 441 个 prior bbox 里,经过匹配后得到正样本先验框P个,负样本先验框 441−P 个。将负样本prior bbox按照prediction loss从大到小顺序排列后选择最高的M个prior bbox。这个M需要根据我们设定的正负样本的比例确定,比如我们约定正负样本比例为1:3时。我们就取M=3P,这M个loss最大的负样本难例将会被作为真正参与计算loss的prior bboxes,其余的负样本将不会参与分类损失的loss计算。
5、小结
本小节介绍的内容围绕如何进行训练展开,主要是3块:
- 先验框与GT框的匹配策略
- 损失函数计算
- 难例挖掘
这3部分是需要结合在一起理解,我们再整个梳理下计算loss的步骤
1)先验框与GT框的匹配
按照我们介绍的方案,为每个先验框都分配好类别,确定是正样本还是负样本。
2)计算loss
按照我们定义的损失函数计算 分类loss 和 目标框回归loss
负样本不计算目标框的回归loss
3)难例挖掘
上面计算的loss中分类loss的部分还不是最终的loss。
因为负样本先验框过多,我们要按一定的预设比例,一般是1:3,将loss最高的那部分负样本先验框拿出来,其余的负样本忽略,重新计算分类loss。
完整loss计算过程的代码见model.py
中的 MultiBoxLoss
类。如下所示。
class MultiBoxLoss(nn.Module):
"""
The loss function for object detection.
对于Loss的计算,完全遵循SSD的定义,即 MultiBox Loss
This is a combination of:
(1) a localization loss for the predicted locations of the boxes.
(2) a confidence loss for the predicted class scores.
"""
def __init__(self, priors_cxcy, threshold=0.5, neg_pos_ratio=3, alpha=1.):
super(MultiBoxLoss, self).__init__()
self.priors_cxcy = priors_cxcy
self.priors_xy = cxcy_to_xy(priors_cxcy)
self.threshold = threshold
self.neg_pos_ratio = neg_pos_ratio
self.alpha = alpha
self.smooth_l1 = nn.L1Loss()
self.cross_entropy = nn.CrossEntropyLoss(reduce=False)
def forward(self, predicted_locs, predicted_scores, boxes, labels):
"""
Forward propagation.
:param predicted_locs: predicted locations/boxes w.r.t the 441 prior boxes, a tensor of dimensions (N, 441, 4)
:param predicted_scores: class scores for each of the encoded locations/boxes, a tensor of dimensions (N, 441, n_classes)
:param boxes: true object bounding boxes in boundary coordinates, a list of N tensors
:param labels: true object labels, a list of N tensors
:return: multibox loss, a scalar
"""
batch_size = predicted_locs.size(0)
n_priors = self.priors_cxcy.size(0)
n_classes = predicted_scores.size(2)
assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)
true_locs = torch.zeros((batch_size, n_priors, 4), dtype=torch.float).to(device) # (N, 441, 4)
true_classes = torch.zeros((batch_size, n_priors), dtype=torch.long).to(device) # (N, 441)
# For each image
for i in range(batch_size):
n_objects = boxes[i].size(0)
overlap = find_jaccard_overlap(boxes[i], self.priors_xy) # (n_objects, 441)
# For each prior, find the object that has the maximum overlap
overlap_for_each_prior, object_for_each_prior = overlap.max(dim=0) # (441)
# We don't want a situation where an object is not represented in our positive (non-background) priors -
# 1. An object might not be the best object for all priors, and is therefore not in object_for_each_prior.
# 2. All priors with the object may be assigned as background based on the threshold (0.5).
# To remedy this -
# First, find the prior that has the maximum overlap for each object.
_, prior_for_each_object = overlap.max(dim=1) # (N_o)
# Then, assign each object to the corresponding maximum-overlap-prior. (This fixes 1.)
object_for_each_prior[prior_for_each_object] = torch.LongTensor(range(n_objects)).to(device)
# To ensure these priors qualify, artificially give them an overlap of greater than 0.5. (This fixes 2.)
overlap_for_each_prior[prior_for_each_object] = 1.
# Labels for each prior
label_for_each_prior = labels[i][object_for_each_prior] # (441)
# Set priors whose overlaps with objects are less than the threshold to be background (no object)
label_for_each_prior[overlap_for_each_prior < self.threshold] = 0 # (441)
# Store
true_classes[i] = label_for_each_prior
# Encode center-size object coordinates into the form we regressed predicted boxes to
true_locs[i] = cxcy_to_gcxgcy(xy_to_cxcy(boxes[i][object_for_each_prior]), self.priors_cxcy) # (441, 4)
# Identify priors that are positive (object/non-background)
positive_priors = true_classes != 0 # (N, 441)
# LOCALIZATION LOSS
# Localization loss is computed only over positive (non-background) priors
loc_loss = self.smooth_l1(predicted_locs[positive_priors], true_locs[positive_priors]) # (), scalar
# Note: indexing with a torch.uint8 (byte) tensor flattens the tensor when indexing is across multiple dimensions (N & 441)
# So, if predicted_locs has the shape (N, 441, 4), predicted_locs[positive_priors] will have (total positives, 4)
# CONFIDENCE LOSS
# Confidence loss is computed over positive priors and the most difficult (hardest) negative priors in each image
# That is, FOR EACH IMAGE,
# we will take the hardest (neg_pos_ratio * n_positives) negative priors, i.e where there is maximum loss
# This is called Hard Negative Mining - it concentrates on hardest negatives in each image, and also minimizes pos/neg imbalance
# Number of positive and hard-negative priors per image
n_positives = positive_priors.sum(dim=1) # (N)
n_hard_negatives = self.neg_pos_ratio * n_positives # (N)
# First, find the loss for all priors
conf_loss_all = self.cross_entropy(predicted_scores.view(-1, n_classes), true_classes.view(-1)) # (N * 441)
conf_loss_all = conf_loss_all.view(batch_size, n_priors) # (N, 441)
# We already know which priors are positive
conf_loss_pos = conf_loss_all[positive_priors] # (sum(n_positives))
# Next, find which priors are hard-negative
# To do this, sort ONLY negative priors in each image in order of decreasing loss and take top n_hard_negatives
conf_loss_neg = conf_loss_all.clone() # (N, 441)
conf_loss_neg[positive_priors] = 0. # (N, 441), positive priors are ignored (never in top n_hard_negatives)
conf_loss_neg, _ = conf_loss_neg.sort(dim=1, descending=True) # (N, 441), sorted by decreasing hardness
hardness_ranks = torch.LongTensor(range(n_priors)).unsqueeze(0).expand_as(conf_loss_neg).to(device) # (N, 441)
hard_negatives = hardness_ranks < n_hard_negatives.unsqueeze(1) # (N, 441)
conf_loss_hard_neg = conf_loss_neg[hard_negatives] # (sum(n_hard_negatives))
# As in the paper, averaged over positive priors only, although computed over both positive and hard-negative priors
conf_loss = (conf_loss_hard_neg.sum() + conf_loss_pos.sum()) / n_positives.sum().float() # (), scalar
# return TOTAL LOSS
return conf_loss + self.alpha * loc_loss
学习真快乐啊~