Yolo-v3测试过程代码理解

最新推荐文章于 2024-05-15 16:18:03 发布

哞哞哞咩咩咩

最新推荐文章于 2024-05-15 16:18:03 发布

阅读量637

点赞数 1

分类专栏： cv 文章标签：深度学习计算机视觉

本文链接：https://blog.csdn.net/weixin_42341040/article/details/106075781

版权

cv 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

util.py

predict_transform函数：

对每个yolo层输出进行尺度变化，Darknet类的forward函数里会调用这个函数
主要针对以下图中4个公式进行
在这里插入图片描述
（1）读取信息

batch_size = prediction.size(0)
# stride表示的是整个网络的步长，等于图像原始尺寸与yolo层输入的feature map尺寸相除，因为输入图像是正方形，所以用高相除即可
stride = inp_dim // prediction.size(2)
# feature map每条边格子的数量，416//32=13
grid_size = inp_dim // stride
# 一个方框属性个数，等于5+类别数量 coco:85
bbox_attrs = 5 + num_classes
# anchors数量： 3个
num_anchors = len(anchors)
# (b, 85*3, 13*13)
prediction = prediction.view(batch_size, bbox_attrs * num_anchors, grid_size * grid_size)
prediction = prediction.transpose(1, 2).contiguous()
# (b,13*13*3,85)
prediction = prediction.view(batch_size, grid_size * grid_size * num_anchors, bbox_attrs)
# 变换后的anchors是相对于最终的feature map的尺寸
anchors = [(a[0] / stride, a[1] / stride) for a in anchors]

（2）进行尺度变换（根据图上的4个公式）
prediction第3维的前4个是cx,cy,w,h

# Sigmoid the  centre_X, centre_Y. and object confidencce
prediction[:, :, 0] = torch.sigmoid(prediction[:, :, 0])
prediction[:, :, 1] = torch.sigmoid(prediction[:, :, 1])
prediction[:, :, 4] = torch.sigmoid(prediction[:, :, 4])

# 每个grid cell的尺寸均为1，故grid范围是[0,12]（假如当前的特征图13*13）
grid = np.arange(grid_size)
a, b = np.meshgrid(grid, grid)
# x_offset即cx,y_offset即cy，表示当前cell左上角坐标
# (13, 1)
x_offset = torch.FloatTensor(a).view(-1, 1)
y_offset = torch.FloatTensor(b).view(-1, 1)

if CUDA:
   x_offset = x_offset.cuda()
   y_offset = y_offset.cuda()

# (13,2)->(13,6)->(39,2)->(1,39,2)
x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1, num_anchors).view(-1, 2).unsqueeze(0)
# ！！！调整中心坐标
# bx=sigmoid(tx)+cx,by=sigmoid(ty)+cy
# 预测的坐标：中心坐标 + offset（0-1）
prediction[:, :, :2] += x_y_offset

# log space transform height and the width
anchors = torch.FloatTensor(anchors)

if CUDA:
   anchors = anchors.cuda()

# (1, 13*13*3,2)
anchors = anchors.repeat(grid_size * grid_size, 1).unsqueeze(0)
# ！！！调整宽高
# 得到目标的方框的宽高，这里得到的宽高是相对于在feature map的尺寸
prediction[:, :, 2:4] = torch.exp(prediction[:, :, 2:4]) * anchors
# 这里得到每个anchor中每个类别的得分。将网络预测的每个得分用sigmoid()函数计算得到
prediction[:, :, 5: 5 + num_classes] = torch.sigmoid((prediction[:, :, 5: 5 + num_classes]))
# 将相对于最终feature map的方框坐标和尺寸映射回输入网络图片(416x416)，即将方框的坐标乘以网络的stride即可
prediction[:, :, :4] *= stride

write_results函数：

输入：
prediction ： (b, 10647, 85), prediction的前5列（cx,cy,w,h,score）
confidence：objectness 分数阈值
num_classes：80
nms_conf：0.4

(1) 分类置信度小于阈值confidence的条目值全设置为0, 剩下部分不变

conf_mask = (prediction[:, :, 4] > confidence).float().unsqueeze(2)  # (bz, 10647, 1)
prediction = prediction * conf_mask     # (b, 10647, 85)

(2) prediction的前五个数据分别表示 (Cx, Cy, w, h, score)，我们可以将我们的框的 (中心 x, 中心 y, 高度, 宽度) 属性转换成 (左上角 x, 左上角 y, 右下角 x, 右下角 y) 这样做用每个框的两个对角坐标能更轻松地计算两个框的 IoU

box_corner = prediction.new(prediction.shape)
  
box_corner[:, :, 0] = (prediction[:, :, 0] - prediction[:, :, 2] / 2)   # x1 = Cx - w/2
box_corner[:, :, 1] = (prediction[:, :, 1] - prediction[:, :, 3] / 2)   # y1 = Cy - h/2
box_corner[:, :, 2] = (prediction[:, :, 0] + prediction[:, :, 2] / 2)   # x2 = Cx + w/2
box_corner[:, :, 3] = (prediction[:, :, 1] + prediction[:, :, 3] / 2)   # y2 = Cy + h/2
prediction[:, :, :4] = box_corner[:, :, :4]                             # 计算后的新坐标复制回去

(3.1) 对每个batch循环进行NMS
准备工作，找出分类置信度小于阈值框的label(即该框属于80个类中的哪一类)
这里假设大于阈值的共有15个框

image_pred = prediction[ind]  # 第ind个batch, 10647x85
# 我们只关心有最大值的类别分数，prediction[:, 5:]表示每一分类的分数
max_conf, max_conf_score = torch.max(image_pred[:, 5:5 + num_classes], 1)
# 最大值max_conf变成二维tensor，尺寸为10647x1
max_conf = max_conf.float().unsqueeze(1)
# 最大值索引max_conf_score变成二维tensor，尺寸为10647x1
max_conf_score = max_conf_score.float().unsqueeze(1)
# 我们移除了每一行的这 80 个类别分数，只保留bbox4个坐标以及objectnness分数，转而增加了有最大值的类别分数及索引。
seq = (image_pred[:, :5], max_conf, max_conf_score)
# 将每个方框的(x1,y1,x2,y2,s)与得分最高的这个类的分数s_cls(max_conf)和对应类的序号index_cls(max_conf_score)在列维度上连接起来，
# 即将10647x5,10647x1,10647x1三个tensor 在列维度进行concatenate操作，得到一个10647x7的tensor,(x1,y1,x2,y2,s,s_cls,index_cls)。
image_pred = torch.cat(seq, 1)  # shape=(10647, 5+1+1=7)
# image_pred[:,4]是长度为10647的一维tensor,维度为4的列是置信度分数。假设有15个框含有目标的得分非0，返回15x1的tensor
non_zero_ind = (torch.nonzero(image_pred[:, 4]))    # torch.nonzero返回的是索引，会让non_zero_ind是个2维tensor：15*1

(3.2) 15个可能有目标的框读出来，try-expect处理万一一个框也没读出的情况

 try:
	# try-except模块的目的是处理无检测结果的情况
    image_pred_ = image_pred[non_zero_ind.squeeze(), :].view(-1, 7)	# 15*7
except:
    continue
# 当没有检测到时目标时，我们使用 continue 来跳过对本图像的循环，即进行下一次循环。

(3.3) 15个框的所有类别

 img_classes = unique(image_pred_[:, -1])

(3.4)对每个类别分开进行NMS

1.属于第cls类的预测值保持不变，其余的全部变成0
2.找出物体类别分数非0的，即属于第cls类的框，假设4个
3.对这4个NMS，根据每个框含有有目标的分数降序排序
4.对每个类循环，完成这15个框的NMS，假设剩下3个最终有效框
5.每个框的第一列加一个参数，ind(指这张图属于第ind个batch)
6.输出返回shape:(3, 8) : (ind,x1,y1,x2,y2,s,s_cls,index_cls)

for cls in img_classes:
	# 7列具体内容(cx,cy,w,h,s,s_cls,index_cls)
    # image_pred_:(15,7)
    # cls_mask:(15,7)
    cls_mask = image_pred_ * (image_pred_[:, -1] == cls).float().unsqueeze(1)
    # 假设class_mask_ind:(4,)
    class_mask_ind = torch.nonzero(cls_mask[:, -2]).squeeze()  # cls_mask[:,-2]为cls_mask倒数第二列,是物体类别分数。

    # 从prediction中取出属于cls类别的所有结果，为下一步的nms的输入.
    # image_pred_class：（4,7）
    image_pred_class = image_pred_[class_mask_ind].view(-1, 7)
    ''' 到此步 prediction_class 已经存在了我们需要进行非极大值抑制的数据 '''
    # 开始 nms 
    # [0]是排序结果, [1]是排序结果的索引
    conf_sort_index = torch.sort(image_pred_class[:, 4], descending=True)[1]
    # 根据排序后的索引对应出的bbox的坐标与分数，依然为4x7的tensor
    image_pred_class = image_pred_class[conf_sort_index]
    idx = image_pred_class.size(0)  # Number of detections
    '''开始执行 "非极大值抑制" 操作'''
    for i in range(idx):
        # Get the IOUs of all boxes that come after the one we are looking at
        # in the loop

        try:
            # image_pred_class[i].unsqueeze(0)，为什么要加unsqueeze(0)？这里image_pred_class为4x7的tensor，image_pred_class[i]是一个长度为7的tensor，
            # 要变成1x7的tensor，在第0维添加一个维度
            ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i + 1:])
        except ValueError:
            '''
               在for i in range(idx):这个循环中，因为有一些框(在image_pred_class对应一行)会被去掉，image_pred_class行数会减少，
               这样在后面的循环中，idx序号会超出image_pred_class的行数的范围，出现ValueError错误。
               所以当抛出这个错误时，则跳出这个循环，因为此时已经没有更多可以去掉的方框了。
           '''

            break

        except IndexError:
            break

        # Zero out all the detections that have IoU > treshhold
        # 计算出需要保留的item（保留ious < nms_conf的框）而ious < nms_conf得到的是torch.uint8类型，用float()将它们转换为float类型。
        # 因为要与image_pred_class[i+1:]相乘，故长度为7的tensor，要变成1x7的tensor，需添加一个维度
        iou_mask = (ious < nms_conf).float().unsqueeze(1)
        # 将iou_mask与比序号i大的框的预测值相乘，其中IOU大于阈值的框的预测值全部变成0.得出需要保留的框
        image_pred_class[i + 1:] *= iou_mask

        # 开始移除
        # torch.nonzero返回的是索引，是2维tensor。将经过iou_mask掩码后的每个方框含有目标的得分为非0的方框的索引提取出来，
        # non_zero_ind经squeeze后为一维tensor，含有目标的得分非0的索引
        non_zero_ind = torch.nonzero(image_pred_class[:, 4]).squeeze()
        # 得到含有目标的得分非0的方框的预测值(x1, y1, x2, y2, s,  s_class,index_cls)，为1x7的tensor
        image_pred_class = image_pred_class[non_zero_ind].view(-1, 7)

    
    batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(
        ind)  # Repeat the batch_id for as many detections of the class cls in the image
    seq = batch_ind, image_pred_class
    if not write:
        output = torch.cat(seq, 1)
        write = True
    else:
        out = torch.cat(seq, 1)
        output = torch.cat((output, out))

detdct.py

调用完write_results函数，11张测试图片共产生了34个bbox
把框映射会原图

# im_dim_list维度34x4
# 为何是34？因为这11张测试图共检测出34个目标！
im_dim_list = torch.index_select(im_dim_list, 0, output[:, 0].long())  # pytorch 切片torch.index_select(data, dim, indices)
# scaling_factor = torch.min(416/im_dim_list,1)[0].view(-1,1)#这是源代码，下面是我修改的代码
scaling_factor = torch.min(int(args.reso) / im_dim_list, 1)[0].view(-1, 1)
# scaling_factor*img_w和scaling_factor*img_h是图片按照纵横比不变进行缩放后的图片，即原图是768x576按照纵横比长边不变缩放到了416*372。
# 经坐标换算,得到的坐标还是在输入网络的图片(416x416)坐标系下的绝对坐标，但是此时已经是相对于416*372这个区域的坐标了，而不再相对于(0,0)原点。
# inp_dim=416 im_dim_list：原图尺度
output[:, [1, 3]] -= (inp_dim - scaling_factor * im_dim_list[:, 0].view(-1, 1)) / 2  # x1=x1−(416−scaling_factor*img_w)/2,x2=x2-(416−scaling_factor*img_w)/2

output[:, [2, 4]] -= (inp_dim - scaling_factor * im_dim_list[:, 1].view(-1, 1)) / 2  # y1=y1-(416−scaling_factor*img_h)/2,y2=y2-(416−scaling_factor*img_h)/2

# 将方框坐标(x1,y1,x2,y2)映射到原始图片尺寸上，直接除以缩放系数即可。output[:,1:5]维度为34x4，
# scaling_factor维度是34x1.相除时会利用广播性质将scaling_factor扩展为34x4的tensor
output[:, 1:5] /= scaling_factor

# 如果映射回原始图片中的坐标超过了原始图片的区域，则x1,x2限定在[0,img_w]内，img_w为原始图片的宽度。如果x1,x2小于0.0，令x1,x2为0.0，如果x1,x2大于原始图片宽度，令x1,x2大小为图片的宽度。
# 同理，y1,y2限定在0,img_h]内，img_h为原始图片的高度。clamp()函数就是将第一个输入对数的值限定在后面两个数字的区间

for i in range(output.shape[0]):
    output[i, [1, 3]] = torch.clamp(output[i, [1, 3]], 0.0, im_dim_list[i, 0])
    output[i, [2, 4]] = torch.clamp(output[i, [2, 4]], 0.0, im_dim_list[i, 1])