本文是认真研读loss.py后自己的理解,如有理解错了,还请指正,感谢~
文中指的cell就是指划分的小单元格
完整github:https://github.com/motokimura/yolo_v1_pytorch.git
损失函数定义:
https://towardsdatascience.com/yolov1-you-only-look-once-object-detection-e1f3ffec8a89
Loss主要包括(x,y),(w,h) ,c 类别,四个loss,其中又分为包含object的(x,y),(w,h) ,C 、不包含object的 C 和类别 loss。
1.针对上述公式,定义一个Loss类,继承自nn.Module,然后我们分别讲解类中的每个方法:
1. def __init__
class Loss(nn.Module):
def __init__(self, feature_size=7, num_bboxes=2, num_classes=20, lambda_coord=5.0, lambda_noobj=0.5):
""" Constructor.
Args:
feature_size: (int) size of input feature map.
num_bboxes: (int) number of bboxes per each cell.
num_classes: (int) number of the object classes.
lambda_coord: (float) weight for bbox location/size losses.
lambda_noobj: (float) weight for no-objectness loss.
"""
super(Loss, self).__init__()
self.S = feature_size
self.B = num_bboxes
self.C = num_classes
self.lambda_coord = lambda_coord #有object的惩罚系数
self.lambda_noobj = lambda_noobj #没有object的惩罚系数
2. def forward(self, pred_tensor, target_tensor)
这个方法的意思是,pred_tensor, target_tensor分别为batch_size*7*7*30维度,target_tensor是ground truth值,其根据最后一维度30个数据的置信度判断哪几个cell是含object的。30 = box1 + box2 + class,(5 + 5 + 20),这里取的是第一个box1的置信度,因为前面处理数据的时候,box2的值和第一个box1一样,所以这样取box1的置信度就可以。target_tensor[:, :, :, 4] = 0 取出来是 [n_batch, S, S],expand_as(n_batch, S, S, 30)后最后一个维度,30个数据都是相同值。如下图所示:
取出置信度大于0,和等于0就知道哪些cell是有object,哪些cell没有object。取出来后取对应的位置,进行转换,就得到对应的有object的box_conf坐标和置信度,和没有object的box_conf坐标和置信度,以及类别,接着就分别计算。
1.1 先计算不含object的cell中的 C (置信度)的loss
就是通过target_tensor置信度=0,的位置,去取pred_tensor相同位置处的置信度,这样取到后的 noobj_pred_conf,noobj_target_conf 维度都为1。
# Compute loss for the cells with no object bbox.
noobj_pred = pred_tensor[noobj_mask].view(-1, N) # pred tensor on the cells which do not contain objects. [n_noobj, N]
# n_noobj: number of the cells which do not contain objects.
noobj_target = target_tensor[noobj_mask].view(-1, N) # target tensor on the cells which do not contain objects. [n_noobj, N]
# n_noobj: number of the cells which do not contain objects.
noobj_conf_mask = torch.cuda.ByteTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
for b in range(B):
noobj_conf_mask[:, 4 + b*5] = 1 # noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1
noobj_pred_conf = noobj_pred[noobj_conf_mask] # [n_noobj, 2=len([conf1, conf2])]
noobj_target_conf = noobj_target[noobj_conf_mask] # [n_noobj, 2=len([conf1, conf2])]
loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum') # F.mse_loss 是均方损失函数 https://blog.csdn.net/hao5335156/article/details/81029791
1.2 计算包含object的cell的loss
1.2.1 计算IOU
1.先找到预测框的起点和终点坐标,由于target和pred的结果都是[x_cell_center,y_cell_center,w,h],x_cell_center,y_cell_center是相对于cell的偏移,但是w,h都是相对于整张图片的 Width 和 Height, 所以这样需要转换即除以S,因为事先的数据转换是乘以S的,这样把x,y都转换为相对于整张图片的Width,Height,这样就统一标准了,可以比较。由于计算IOU需要起点和终点,所以做了减0.5*w,0.5*h,和加减0.5*w,0.5*h,这样就得到起点和终点坐标。
接着就是计算IOU
# Compute loss for the cells with objects.
coord_response_mask = torch.cuda.ByteTensor(bbox_target.size()).fill_(0) # [n_coord x B, 5]
# coord_not_response_mask = torch.cuda.ByteTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
bbox_target_iou = torch.zeros(bbox_target.size()).cuda() # [n_coord x B, 5], only the last 1=(conf,) is used
# Choose the predicted bbox having the highest IoU for each target bbox.
for i in range(0, bbox_target.size(0), B):
pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
pred_xyxy[:, :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]
target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
target_xyxy[:, :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]
iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
max_iou, max_index = iou.max(0)
max_index = max_index.data.cuda()
coord_response_mask[i+max_index] = 1
# coord_not_response_mask[i+max_index] = 0
# "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
# from the original paper of YOLO.
bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
bbox_target_iou = Variable(bbox_target_iou).cuda()
对于上面的:coord_response_mask = torch.cuda.ByteTensor(bbox_target.size()).fill_(0) 完全可以改写为 coord_response_mask = torch.zeros(bbox_target.size()).cuda()
至于为什么用Variable,是因为Variable是对Tensor的封装,且可以用GPU计算,Tensor只能用CPU计算。
通过上面的方法,就得到了x,y坐标和w,h
这样就可以参与计算。
接着获取得到对应的置信度C,因为每个置信度等于IOU,原论文这么说的,刚刚写到这还不是很明白,后来查看了yolov1论文,(中英文)发现,Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth. 确实是等于预测的和ground truth 的交集,然后计算C的时候,就用预测置信度的和这个IOU进行loss。(这里我有个疑问,这边ground truth已经知道了,为什么ground truth的置信度不能为1呢?即: bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda())
# BBox location/size and objectness loss for the response bboxes.
bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5) # [n_response, 5]
bbox_target_response = bbox_target[coord_response_mask].view(-1, 5) # [n_response, 5], only the first 4=(x, y, w, h) are used
target_iou = bbox_target_iou[coord_response_mask].view(-1, 5) # [n_response, 5], only the last 1=(conf,) is used
loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')
# Class probability loss for the cells which contain objects.
loss_class = F.mse_loss(class_pred, class_target, reduction='sum')
# Total loss
loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
loss = loss / float(batch_size)
def forward(self, pred_tensor, target_tensor):
""" Compute loss for YOLO training.
Args:
pred_tensor: (Tensor) predictions, sized [n_batch, S, S, Bx5+C], 5=len([x, y, w, h, conf]).
target_tensor: (Tensor) targets, sized [n_batch, S, S, Bx5+C].
Returns:
(Tensor): loss, sized [1, ].
"""
# TODO: Romove redundant dimensions for some Tensors.
S, B, C = self.S, self.B, self.C
N = 5 * B + C # 5=len([x, y, w, h, conf]
batch_size = pred_tensor.size(0)
coord_mask = target_tensor[:, :, :, 4] > 0 # mask for the cells which contain objects. [n_batch, S, S]
noobj_mask = target_tensor[:, :, :, 4] == 0 # mask for the cells which do not contain objects. [n_batch, S, S]
# cuda tensor 有一个好处就是,这样处理后,原先多少维,结果只会少一维,如果是普通的list,或者nump.array则就只有取到数据的那一维,结果就只有一维。
coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor) # [n_batch, S, S] -> [n_batch, S, S, N]
noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor) # [n_batch, S, S] -> [n_batch, S, S, N]
coord_pred = pred_tensor[coord_mask].view(-1, N) # pred tensor on the cells which contain objects. [n_coord, N]
# n_coord: number of the cells which contain objects.
bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5) # [n_coord x B, 5=len([x, y, w, h, conf])] 塑造成几行5列,一个box包含5个值,xmin,ymin,xmax,ymax,confidence
class_pred = coord_pred[:, 5*B:] # [n_coord, C]
coord_target = target_tensor[coord_mask].view(-1, N) # target tensor on the cells which contain objects. [n_coord, N]
# n_coord: number of the cells which contain objects.
bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5)# [n_coord x B, 5=len([x, y, w, h, conf])]
class_target = coord_target[:, 5*B:] # [n_coord, C]
# Compute loss for the cells with no object bbox.
noobj_pred = pred_tensor[noobj_mask].view(-1, N) # pred tensor on the cells which do not contain objects. [n_noobj, N]
# n_noobj: number of the cells which do not contain objects.
noobj_target = target_tensor[noobj_mask].view(-1, N) # target tensor on the cells which do not contain objects. [n_noobj, N]
# n_noobj: number of the cells which do not contain objects.
noobj_conf_mask = torch.cuda.ByteTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
for b in range(B):
noobj_conf_mask[:, 4 + b*5] = 1 # noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1
noobj_pred_conf = noobj_pred[noobj_conf_mask] # [n_noobj, 2=len([conf1, conf2])]
noobj_target_conf = noobj_target[noobj_conf_mask] # [n_noobj, 2=len([conf1, conf2])]
loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')
# Compute loss for the cells with objects.
coord_response_mask = torch.cuda.ByteTensor(bbox_target.size()).fill_(0) # [n_coord x B, 5]
# coord_not_response_mask = torch.cuda.ByteTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
bbox_target_iou = torch.zeros(bbox_target.size()).cuda() # [n_coord x B, 5], only the last 1=(conf,) is used
# Choose the predicted bbox having the highest IoU for each target bbox.
for i in range(0, bbox_target.size(0), B):
pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
pred_xyxy[:, :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]
target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
target_xyxy[:, :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]
iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
max_iou, max_index = iou.max(0)
max_index = max_index.data.cuda()
coord_response_mask[i+max_index] = 1
# coord_not_response_mask[i+max_index] = 0
# "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
# from the original paper of YOLO.
bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
bbox_target_iou = Variable(bbox_target_iou).cuda()
# BBox location/size and objectness loss for the response bboxes.
bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5) # [n_response, 5]
bbox_target_response = bbox_target[coord_response_mask].view(-1, 5) # [n_response, 5], only the first 4=(x, y, w, h) are used
target_iou = bbox_target_iou[coord_response_mask].view(-1, 5) # [n_response, 5], only the last 1=(conf,) is used
loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')
# Class probability loss for the cells which contain objects.
loss_class = F.mse_loss(class_pred, class_target, reduction='sum')
# Total loss
loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
loss = loss / float(batch_size)
return loss