代码下载:https://github.com/pakaqiu/yolov3_simple
视频链接:https://www.bilibili.com/video/BV1MK4y1X74Q?p=1
yolov1到v3损失函数都是逐步修改而来,特别是v4,v5损失函数的改动对于目标检测的性能和效果具有较大的提升;当然v3相比v1,v2还是有不少提升的,这种性能提升不仅仅是损失函数所带来的,网络结构的修改和优化也具有比较可观的效果。本文主要讲解v3损失函数的设计,这里首先回顾下v1,v2:
v1损失函数:
v2的损失函数:
v2只是在v1的基础上改动了box宽高的损失计算方式,即去除了w和h的根号:
v3相对v2最大的改动就是分类的损失以及box置信度的损失改为了二分交叉熵:
上式中S为网络输出层网格的数目,B为anchors的数目;网络的输出为SS大小的特征图,即网格SS;每个网格有B个anchor,一共会得到SSB个bounding box,那这么多个bounding box,损失函数是如何进行回归的呢?下面对v3的损失函数中的(a)-(e)一一进行解析:
I
i
j
o
b
j
I_{ij}^{obj}
Iijobj表示第i个网格中第j个anchor,如果第j个anchor负责这个object,那么
I
i
j
o
b
j
=
1
I_{ij}^{obj} = 1
Iijobj=1,否则的话就等于0。一个网络中有B个anchor,那么负责就是B个anchor中与ground truth box的IOU最大的那个anchor。
I
i
j
n
o
o
b
j
I_{ij}^{noobj}
Iijnoobj表示第i个网格中的第j个anchor不负责这个object。
上面Loss中的(a)表示目标物体的中心坐标的误差,在训练过程中回归的是是中心坐标的偏移量,这里需要结合代码去仔细体会;每个网格有B个anchors,在训练过程中只取iou最大的anchor才能负责当前网格的回归。
Loss中的(b)表示目标物体的宽高坐标误差,训练过程中并不是直接回归目标物体的宽高坐标,而是利用网格本身满足条件的最佳anchor去回归宽高坐标误差的。
Loss中 ( c) ,(d)表示目标物体置信度的误差,这里的损失是采用交叉熵来进行计算的,此外,不管网格是否有负责某个目标物体,都会计算置信度误差,但是输入的图像中大部分的空间不包含目标物体,只有少部分空间包含了目标物体,因此需要添加权重对不包含目标物体的置信度损失进行约束。其中
C
~
j
i
\tilde{C}_{j}^{i}
C~ji表示真实值,
C
~
j
i
\tilde{C}_{j}^{i}
C~ji的值由网格的bounding box是否负责预测某个对象决定;负责,则
C
~
j
i
\tilde{C}_{j}^{i}
C~ji = 1,反之,则相反;
C
j
i
{C}_{j}^{i}
Cji表示拟合值。
Loss中(e)表示目标物体的分类误差,这里的损失采用交叉熵进行计算,只有当第i个网格的第j个anchor负责某个真实目标物体,这个anchor产生的bounding box才会去计算分类损失精度。
具体代码实现:
def forward(self, x, targets=None, img_dim=None):
# Tensors for cuda support
FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if x.is_cuda else torch.ByteTensor
self.img_dim = img_dim
num_samples = x.size(0)
grid_size = x.size(2)
prediction = (
x.view(num_samples, self.num_anchors, self.num_classes + 5, grid_size, grid_size)
.permute(0, 1, 3, 4, 2)
.contiguous()
)
# Get outputs
x = torch.sigmoid(prediction[..., 0]) # Center x
y = torch.sigmoid(prediction[..., 1]) # Center y
w = prediction[..., 2] # Width
h = prediction[..., 3] # Height
pred_conf = torch.sigmoid(prediction[..., 4]) # Conf
pred_cls = torch.sigmoid(prediction[..., 5:]) # Cls pred.
# If grid size does not match current we compute new offsets
if grid_size != self.grid_size:
self.compute_grid_offsets(grid_size, cuda=x.is_cuda)
# Add offset and scale with anchors
pred_boxes = FloatTensor(prediction[..., :4].shape)
pred_boxes[..., 0] = x.data + self.grid_x
pred_boxes[..., 1] = y.data + self.grid_y
pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w
pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h
output = torch.cat(
(
pred_boxes.view(num_samples, -1, 4) * self.stride,
pred_conf.view(num_samples, -1, 1),
pred_cls.view(num_samples, -1, self.num_classes),
),
-1,
)
if targets is None:
return output, 0
else:
iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
pred_boxes=pred_boxes,
pred_cls=pred_cls,
target=targets,
anchors=self.scaled_anchors,
ignore_thres=self.ignore_thres,
)
# Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])
loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])
total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls
# Metrics
cls_acc = 100 * class_mask[obj_mask].mean()
conf_obj = pred_conf[obj_mask].mean()
conf_noobj = pred_conf[noobj_mask].mean()
conf50 = (pred_conf > 0.5).float()
iou50 = (iou_scores > 0.5).float()
iou75 = (iou_scores > 0.75).float()
detected_mask = conf50 * class_mask * tconf
precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)
self.metrics = {
"loss": to_cpu(total_loss).item(),
"x": to_cpu(loss_x).item(),
"y": to_cpu(loss_y).item(),
"w": to_cpu(loss_w).item(),
"h": to_cpu(loss_h).item(),
"conf": to_cpu(loss_conf).item(),
"cls": to_cpu(loss_cls).item(),
"cls_acc": to_cpu(cls_acc).item(),
"recall50": to_cpu(recall50).item(),
"recall75": to_cpu(recall75).item(),
"precision": to_cpu(precision).item(),
"conf_obj": to_cpu(conf_obj).item(),
"conf_noobj": to_cpu(conf_noobj).item(),
"grid_size": grid_size,
}
return output, total_loss
def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor
nB = pred_boxes.size(0)
nA = pred_boxes.size(1)
nC = pred_cls.size(-1)
nG = pred_boxes.size(2)
# Output tensors
obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0)
noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1)
class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
tx = FloatTensor(nB, nA, nG, nG).fill_(0)
ty = FloatTensor(nB, nA, nG, nG).fill_(0)
tw = FloatTensor(nB, nA, nG, nG).fill_(0)
th = FloatTensor(nB, nA, nG, nG).fill_(0)
tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)
# Convert to position relative to box
target_boxes = target[:, 2:6] * nG
gxy = target_boxes[:, :2]
gwh = target_boxes[:, 2:]
# Get anchors with best iou
ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors])
best_ious, best_n = ious.max(0) #求最大iou
# Separate target values
b, target_labels = target[:, :2].long().t()
gx, gy = gxy.t()
gw, gh = gwh.t()
gi, gj = gxy.long().t()
# Set masks
obj_mask[b, best_n, gj, gi] = 1 #最佳anchor对应的位置obj_mask设置为1
noobj_mask[b, best_n, gj, gi] = 0 #最佳anchor对应位置noobj_mask设置为0
# Set noobj mask to zero where iou exceeds ignore threshold
for i, anchor_ious in enumerate(ious.t()):
noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0 #anchor与gt的iou > ignore_thres的noobj_mask设置为0,即满足该条件的anchor不管
# Coordinates
tx[b, best_n, gj, gi] = gx - gx.floor() #中心坐标转换为偏移量
ty[b, best_n, gj, gi] = gy - gy.floor()
# Width and height
tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16) #目标物体宽高转换
th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
# One-hot encoding of label
tcls[b, best_n, gj, gi, target_labels] = 1
# Compute label correctness and iou at best anchor
class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
tconf = obj_mask.float()
return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf
水平有限,不当之处请指教,谢谢!