编码实现yolov3过程详解_yolo数据集目标框编码-CSDN博客

本文链接：https://blog.csdn.net/qq_41617884/article/details/125810817

一.参数设置
要研究一个网络，首先要搞懂各种参数的设置，因为网络训练和测试都依赖于这些超参数。除了一些超参数，u版的yolov3把网络的参数也放在了config文件里，加大了阅读的难度。我选择了把网络参数直接放在网络定义的文件里，config文件只保存超参数。

# custom
cfg.annotations_path       = "./data/annotations/annotations.txt"   #标签的path
cfg.class_path             = "./data/data.names"                    #类别名文件
cfg.image_path             = "./data/images/"                       #存放图像的path
cfg.mean_and_val           = "./data/mean_and_val.txt"              #数据集均值和方差
cfg.tensorboard_path       = "./log/"                               #存放tensorboard的log输出
cfg.checkpoint_save_path   = "./checkpoint/"                        #存放训练参数
cfg.num_classes            = 1                                      #有多少类
cfg.strides                = [8,16,32]                              #输入与三个分支的大小比例
cfg.device                 = "cuda"                                 #cpu
cfg.anchors                = [[[1.25,1.625],[2.0,3.75],[4.125,2.875]],
                              [[1.875,3.8125],[3.875,2.8125],[3.6875,7.4375]],
                             [[3.625,2.8125],[4.875,6.1875],[11.65625,10.1875]]]

# train
cfg.batch_size             = 2        #每次训练的batch size
cfg.input_sizes            = [320,352,384,416,448,480,512,544,576,608]   #随机选择的输入图像大小
cfg.max_boxes_per_scale    = 150      #label每个scale最多有多少个box
cfg.if_pad                 = True     #对输入resize是否进行补空
cfg.random_horizontal_flip = True     #随机水平翻转
cfg.random_crop            = True     #随机裁剪
cfg.max_epoch              = 300      #最多学习的epoch数
cfg.lr_start               = 1e-4     #初始學習率
cfg.lr_end                 = 1e-6     #結束學習率
cfg.warmup                 = 200      #前多少iter採取warmup測略
cfg.momentum               = 0.9      #动量参数
cfg.weight_decay           = 0.0005   #权重衰减正则项防止过拟合
cfg.iou_thresh             = 0.225    #计算loss时的iou thresh
cfg.focal_gamma            = 2        #计算conf loss的focal loss的gamma参数
cfg.focal_alpha            = 0.5      #计算conf loss的focal loss的alpha参数

# test
cfg.input_size             = 416      #输入大小
cfg.conf_thresh            = 0.3
cfg.cls_thresh             = 0.5
cfg.nms_thresh             = 0.5

针对yolov3的三个分支，分别为每个分支提供了三种预先选择的anchor大小。这里的anchors是手动设置的，可以根据图像和标签的特征用k-means得到，这里是coco数据集的anchors设置，如果要跑自己的数据集的话，需要自己得到适合自己数据集的anchors，也可以用coco默认的。

二.网络结构
花了一天时间整理的网络结构图：
在这里插入图片描述我们可以看到yolov3最后的输出有三个分支，每个分支分别预测大中小三种scale的目标，而且输出的大小其中一维为(3*(5+num classes))，这个3的意思就是每种scale的输出都提供了3种不同的anchor，可以更好地预测多尺度的目标。当然，如果为了训练自己的数据集，可以对yolov3的网络结构进行魔改，我们可以输出四个分支甚至更多，而且每个分支我们也可以不仅仅只提供三种anchor，如果要改这些的话就不是调参那么简单了，需要对网络结构比较了解。

三.网络代码
花了一下午时间按照画的模型结构图，用pytorch重构了yolov3的代码。官方的yolov3代码网络结构比较不好理解，我重构后的结构比较好理解一些。
1.为了提高代码复用率，每一层的卷积、池化等操作都封装在了convolution类里：

class convolution(nn.Module):
    def __init__(self,in_channel,out_channel,kernel_size,stride,padding,if_bn,if_activity,if_pooling=False):
        super(convolution, self).__init__()
        self.if_bn=if_bn
        self.if_activity = if_activity
        self.if_pooling = if_pooling
        self.conv = nn.Conv2d(in_channels=in_channel, out_channels=out_channel,
                              kernel_size=kernel_size, stride=stride,padding=padding,bias= not if_bn)
        if if_bn:
            self.bn = nn.BatchNorm2d(num_features=out_channel,momentum=0.9,eps=1e-5)
        self.activity = nn.LeakyReLU(negative_slope=0.1)
        self.pooling = nn.MaxPool2d(kernel_size=2,stride=2)
    def forward(self,x):
        x = self.pooling(x) if self.if_pooling else x
        if self.if_bn:
            x = self.conv(x)
            x = self.bn(x)
        else:
            x = self.conv(x)
        return self.activity(x) if self.if_activity else x

2.残差模块：

class residual_block(nn.Module):
    def __init__(self,channel):
        super(residual_block,self).__init__()
        self.conv_1x1 = convolution(channel,channel//2,1,1,0,True,True)
        self.conv_3x3 = convolution(channel//2,channel,3,1,1,True,True)
    def forward(self,x):
        res = self.conv_1x1(x)
        res = self.conv_3x3(res)
        return res+x

3.darknet53主干网络，大量复用了convolution类和residual_block类：

class darknet53(nn.Module):
    def __init__(self):
        super(darknet53,self).__init__()
        self.first_stage = convolution(3,32,3,1,1,True,True)
        self.second_stage = convolution(32,64,3,1,1,True,True,True)
        self.third_stage = convolution(64,128,3,1,1,True,True,True)
        self.forth_stage = convolution(128,256,3,1,1,True,True,True)
        self.fifth_stage = convolution(256,512,3,1,1,True,True,True)
        self.sixth_stage = convolution(512,1024,3,1,1,True,True,True)

        self.first_residual = residual_block(64)
        self.second_residual = [residual_block(128) for i in range(2)]
        self.third_residual = [residual_block(256) for i in range(8)]
        self.forth_residual = [residual_block(512) for i in range(8)]
        self.fifth_residual = [residual_block(1024) for i in range(8)]

    def forward(self,img):
        x = self.first_stage(img)
        x = self.second_stage(x)
        x = self.first_residual(x)
        x = self.third_stage(x)
        for i in range(2):
            x = self.second_residual[i](x)
        x = self.forth_stage(x)
        for i in range(8):
            x = self.third_residual[i](x)
        out1 = x
        x = self.fifth_stage(x)
        for i in range(8):
            x = self.forth_residual[i](x)
        out2 = x
        x = self.sixth_stage(x)
        for i in range(4):
            x = self.fifth_residual[i](x)
        return out1,out2,x

4.yolov3，输出为代表三种大小的框的预测向量，shape分别为(n,255,13,13)、(n,255,26,26)、(n,255,52,52)：

class yolov3(nn.Module):
    def __init__(self):
        super(yolov3,self).__init__()
        self.darknet53 = darknet53()
        self.bobj_stage = nn.Sequential(
            convolution(1024 ,512, 1, 1, 0, True, True),
            convolution(512, 1024, 3, 1, 1, True, True),
            convolution(1024, 512, 1, 1, 0, True, True),
            convolution(512, 1024, 3, 1, 1, True, True),
            convolution(1024, 512, 1, 1, 0, True, True)
        )
        self.bobj_out_stage = nn.Sequential(
            convolution(512 ,1024, 3, 1, 1, True, True),
            convolution(1024, 3*(5+80), 1, 1, 0, False, False)
        )
        self.mobj_stage = nn.Sequential(
            convolution(768, 256, 1, 1, 0, True, True),
            convolution(256, 512, 3, 1, 1, True, True),
            convolution(512, 256, 1, 1, 0, True, True),
            convolution(256, 512, 3, 1, 1, True, True),
            convolution(512, 256, 1, 1, 0, True, True)
        )
        self.mobj_stage_conv = convolution(512, 256, 1, 1, 0, True, True)
        self.mobj_out_stage = nn.Sequential(
            convolution(256, 512, 3, 1, 1, True, True),
            convolution(512, 3 * (5 + 80), 1, 1, 0, False, False)
        )
        self.sobj_stage = nn.Sequential(
            convolution(384, 128, 1, 1, 0, True, True),
            convolution(128, 256, 3, 1, 1, True, True),
            convolution(256, 128, 1, 1, 0, True, True),
            convolution(128, 256, 3, 1, 1, True, True),
            convolution(256, 128, 1, 1, 0, True, True)
        )
        self.sobj_stage_conv = convolution(256, 128, 1, 1, 0, True, True)
        self.sobj_out_stage = nn.Sequential(
            convolution(128, 256, 3, 1, 1, True, True),
            convolution(256, 3 * (5 + 80), 1, 1, 0, False, False)
        )

    def forward(self,img):
        route1,route2,x = self.darknet53(img)
        # big object
        x = self.bobj_stage(x)
        bobj_output = self.bobj_out_stage(x)
        # middle object
        x = self.mobj_stage_conv(x)
        x = nn.functional.interpolate(x,scale_factor=2)
        x = torch.cat((x,route2),dim=1)
        x = self.mobj_stage(x)
        mobj_output = self.mobj_out_stage(x)
        # small object
        x = self.sobj_stage_conv(x)
        x = nn.functional.interpolate(x, scale_factor=2)
        x = torch.cat((x, route1), dim=1)
        x = self.sobj_stage(x)
        sobj_output = self.sobj_out_stage(x)

        return bobj_output,mobj_output,sobj_output

四.dataloader部分

此部分借鉴了tensorflow版本的yolov3，每次返回image batch和label。

1.dataloader需要的一些超参数。

def __init__(self):
    self.image_path = cfg.image_path                 #图像保存路径
    self.annotations_path = cfg.annotations_path     #标签保存路径
    self.class_path = cfg.class_path                 #类别名保存路径
    self.class_names = self.get_class_names()        #类别名
    self.num_classes = len(self.class_names)         #类别数
    self.bacth_size = cfg.batch_size                 #batch size
    self.anchors = np.array(cfg.anchors)             #三种不同尺度的三种anchors,一共九个
    self.annotations = self.get_annotations()        #图像名和属于此图的bboxes
    self.num_annotations = len(self.annotations)     #样本数量
    self.num_batches = np.ceil(len(self.annotations)/self.bacth_size)     #一个epoch有多少个batch
    self.input_sizes = cfg.input_sizes               #一个list,从中随机选取输入图像大小
    self.output_size = [52,26,13]                #yolo输出大小,根据input size来计算
    self.strides = cfg.strides
    self.max_boxes_per_scale = cfg.max_boxes_per_scale

    self.iter = 0                                    #当前迭代次数

2.在数据输入网络前做的数据增强和数据预处理。数据增强包括随机裁剪和随机翻转，预处理包括将图像resize到合适的大小，这个大小也是随机选择的，但是要满足可以被32整除，然后要进行归一化处理。

#对image进行归一化操作
def normalization(self,image):
    image = image/255.
    return image

#对图像resize以符合输入要求,可选择pad和no pad方式
def resize_image(self,image,bboxes,input_size):
    h, w, _ = image.shape    #(h,w,c)
    if not cfg.if_pad:    #直接resize,可能会导致图像变形
        new_image = cv2.resize(image,(input_size,input_size))
        bboxes[:,[0,2]] = bboxes[:,[0,2]]*input_size/w
        bboxes[:,[1,3]] = bboxes[:,[1,3]]*input_size/h
    else:                 #补空保证图像不变形
        scale = input_size/max(w,h)     #得到input size/图像的宽和高较小的那一个scale
        w,h = int(scale*w),int(scale*h)   #将原图像resize到这个大小,不改变原来的形状

        image = cv2.resize(image,(w,h))
        fill_value = 0        #选择边缘补空的像素值
        new_image = np.ones((input_size,input_size,3)) *  fill_value     #新的符合输入大小的图像
        dw,dh = (input_size-w)//2,(input_size-h)//2
        new_image[dh:dh+h,dw:dw+w,:] = image

        bboxes[:, 0] = bboxes[:, 0] * scale + dw
        bboxes[:, 2] = bboxes[:, 2] * scale + dw
        bboxes[:, 1] = bboxes[:, 1] * scale + dh
        bboxes[:, 3] = bboxes[:, 3] * scale + dh

    return new_image,bboxes

#随机水平翻转
def random_horizontal_flip(self,image,bboxes):
    flip_image = np.copy(image)
    flip_bboxes = np.copy(bboxes)
    if random.random() < 0.5:
        _, w, _ = image.shape
        flip_image = image[:, ::-1, :]
        flip_bboxes[:,0] = w-bboxes[:,2]
        flip_bboxes[:,2] = w-bboxes[:,0]
    return flip_image,flip_bboxes

#随机裁剪
def random_crop(self,image,bboxes):
    if random.random() < 0.5:
        h, w, _ = image.shape
        max_bbox = np.concatenate([np.min(bboxes[:, 0:2], axis=0), np.max(bboxes[:, 2:4], axis=0)], axis=-1)

        max_l_trans = max_bbox[0]
        max_u_trans = max_bbox[1]
        max_r_trans = w - max_bbox[2]
        max_d_trans = h - max_bbox[3]

        crop_xmin = max(0, int(max_bbox[0] - random.uniform(0, max_l_trans)))
        crop_ymin = max(0, int(max_bbox[1] - random.uniform(0, max_u_trans)))
        crop_xmax = max(w, int(max_bbox[2] + random.uniform(0, max_r_trans)))
        crop_ymax = max(h, int(max_bbox[3] + random.uniform(0, max_d_trans)))

        image = image[crop_ymin: crop_ymax, crop_xmin: crop_xmax]

        bboxes[:, [0, 2]] = bboxes[:, [0, 2]] - crop_xmin
        bboxes[:, [1, 3]] = bboxes[:, [1, 3]] - crop_ymin

    return image, bboxes

3.每次取batch所做的操作。

    input_size = random.choice(self.input_sizes)   #每次随机选取输入图像的大小
    self.output_size = [input_size//stride for stride in self.strides]    #yolo输出大小
    batch_images = np.zeros((self.bacth_size,input_size,input_size,3)).astype(np.float32)
    batch_mask_bboxes = [np.zeros((self.bacth_size,self.output_size[i],self.output_size[i],
                                  len(self.anchors[0]), 5 + self.num_classes)).astype(np.float32) for i in range(3)]
    batch_list_bboxes = [np.zeros((self.bacth_size,self.max_boxes_per_scale,4)).astype(np.float32) for _ in range(3)]
    annotation_count = 0    #这个batch已经处理了多少个annotation
    if self.iter<self.num_batches:    #迭代次数小于一个epoch的batch数量
        while annotation_count < self.bacth_size:    #已处理的annotation数量小于batch size
            index = self.iter*self.bacth_size + annotation_count   #计算annotation的index
            index = index if(index < self.num_annotations) else (index-self.num_annotations)  #如果index大于样本量,则从样本第一个开始继续取

            image_and_labels = self.annotations[index]    #取image name和labels
            image = self.get_image_array(image_and_labels[0])    #image -> np.array
            bboxes = self.get_bbox_array(image_and_labels[1:])   #str -> np.array

            image, bboxes = self.data_augmentation(image,bboxes)   #数据增强
            image, bboxes = self.resize_image(image,bboxes,input_size)        #resize到随机随机选取的图像大小
            image = self.normalization(image)               #归一化以加快收敛速度
            batch_images[annotation_count] = image          #预处理后的image放入batch
            label = self.extract_label(bboxes)              #对bbox进行处理
            batch_mask_bboxes[0][annotation_count] = label[0][0]
            batch_mask_bboxes[1][annotation_count] = label[0][1]
            batch_mask_bboxes[2][annotation_count] = label[0][2]
            batch_list_bboxes[0][annotation_count] = label[1][0]
            batch_list_bboxes[1][annotation_count] = label[1][1]
            batch_list_bboxes[2][annotation_count] = label[1][2]
            #self.show_image_and_bboxes(np.copy(image), np.copy(bboxes))    #可视化查看数据增强的正确性
            annotation_count += 1      #一个batch里已处理的数目加一
        self.iter += 1
        batch_images = batch_images.transpose([0,3,1,2])    #转置成(n,c,h,w)
        batch_images = torch.from_numpy(batch_images)       #转为tensor
        batch_mask_small_bboxes = torch.from_numpy(batch_mask_bboxes[0])
        batch_mask_middle_bboxes = torch.from_numpy(batch_mask_bboxes[1])
        batch_mask_big_bboxes = torch.from_numpy(batch_mask_bboxes[2])
        batch_list_small_bboxes = torch.from_numpy(batch_list_bboxes[0])
        batch_list_middle_bboxes = torch.from_numpy(batch_list_bboxes[1])
        batch_list_big_bboxes = torch.from_numpy(batch_list_bboxes[2])
        return batch_images,batch_mask_small_bboxes,batch_mask_middle_bboxes,batch_mask_big_bboxes,\
               batch_list_small_bboxes,batch_list_middle_bboxes,batch_list_big_bboxes

    else:
        self.iter = 0     #重置迭代次数
        np.random.shuffle(self.annotations)   #将annotation打乱
        raise StopIteration

四.计算loss
1.在计算loss前，需要对yolov3的原始输出进行解码，decode过程如下：

def decode(output,stride,anchors):
    decice = torch.device(cfg.device)
    batch_size,output_size = output.shape[0:2]
    anchors = anchors.to(torch.device(cfg.device))

    output_xy = output[...,0:2]      #中心点x和y
    output_wh = output[...,2:4]      #w和h
    output_conf = output[...,4:5]    #置信度
    output_prob = output[...,5:]     #概率分布

    y_stride = torch.arange(0, output_size).unsqueeze(1).repeat(1, output_size).to(torch.float32) #每个网格y的偏移量
    x_offset = torch.arange(0, output_size).unsqueeze(0).repeat(output_size, 1).to(torch.float32) #每个网格x的偏移量
    xy_offset = torch.stack([x_offset, y_stride], dim=-1)
    xy_offset = xy_offset.unsqueeze(0).unsqueeze(3).repeat(batch_size, 1, 1, 3, 1).to(decice)

    output_xy = (torch.sigmoid(output_xy)+xy_offset)*stride    #x和y加上偏移量并乘以stride

    output_wh = (torch.exp(output_wh)*anchors)*stride     #w和h乘以三种不同的anchors并乘以stride
    output_conf = torch.sigmoid(output_conf)
    output_prob = torch.sigmoid(output_prob)

    pred = torch.cat((output_xy,output_wh,output_conf,output_prob),-1)
    return pred

这里解释一下边框偏移公式：
pred_x = sigmoid(out_x) + offset_x
pred_y = sigmoid(out_y) + offset_y
pred_w = anchor_w * exp(out_w)
pred_h = anchor_h * exp(out_h)
在这里，out_*为yolo原始的xywh偏移量，offset_x和offset_y为该方格左上角到特征图左上角的距离，anchor_w和anchor_h为该ceil的anchor的w和h。
需要注意的是，对out_x和out_y使用sigmoid是为了将x和y偏移量限制(0,1)之内，保证了预测的中心点在方格内，有利于模型收敛。

2.box loss
yolov3论文里的box loss定义如下：
在这里插入图片描述其中，l^obj代表该网格中是否有目标，如果有的话为1，没有的话为0；
而后面的预测框和真实框的偏移量现在普遍使用giou，giou定义为iou - (area_c - union) / union,iou为两个box的交并比，union为两个box的并集面积，area_c代表两个box的最小外接矩形的面积。
这部分loss的计算代码如下：

# giou loss
giou = box_iou(pred_xywh,mask_xywh,giou=True)   #先计算解码后的output与标签的box的giou(n,ceil size,ceil size,num anchors,1)
bbox_loss = 2.0 - 1.0 * mask_xywh[:, :, :, :, 2:3] * mask_xywh[:, :, :, :, 3:4] / (input_size ** 2)
giou_loss = mask_conf * bbox_loss * (1 - giou)

3.obj loss
yolov3论文里的obj loss定义如下：
在这里插入图片描述其中，l^noobj代表该网络中是否没有目标，如果没有的话为1，有的话为0。而预测概率和真实概率的loss这里使用focal loss计算，这部分loss的计算代码如下：

# conf loss
iou = box_iou(pred_xywh.unsqueeze(4),label_xywh.unsqueeze(1).unsqueeze(1).unsqueeze(1),
              giou=False).squeeze(-1)    #(n,size,size,num anchors,150)
iou_max = iou.max(-1, keepdim=True)[0]   #(n,size,size,num anchors,1)
label_noobj_mask = (1.0 - mask_conf) * (iou_max < cfg.iou_thresh)
conf_loss = (mask_conf * Focal_loss(input=output_conf,target=mask_conf,gamma=2,alpha=1) +
             label_noobj_mask * Focal_loss(input=output_conf,target=mask_conf,gamma=2,alpha=1))

需要注意的是，这里和真实框的iou小于阈值的预测框才会被认为是负样本，这个iou thresh通常设为0.3。

4.cls loss
yolov3论文里的cls loss定义如下：
在这里插入图片描述这里就没什么好讲的了，如果是单类别使用BCE LOSS，如果是多类别使用CrossEntropy LOSS。
这部分loss的计算代码如下：

# cls loss
cls_loss = mask_conf * BCE_loss(output_cls,mask_cls)

至此，yolov3的loss function部分介绍完毕，接下来就可以进行反向传播更新参数了，在训练代码中也有很多方法帮助训练得更好，在第五部分train中会有所体现。

5.训练网路
有了前面的准备工作，现在就可以训练自己的yolov3了，其实训练部分真的没有什么好介绍的，不同网络的训练部分大同小异，无非就是使用什么训练策略，使用什么优化器，使用什么学习率策略等等。
终于把代码写的差不多了，这几天尝试了训练和测试，因为我的电脑要训练完voc或者coco这种数据集需要时间太久了，所以我制作了一个小数据集，一共22张图像，用于检测人脸。
在这里插入图片描述我训练使用的batch size为2，训练大概需要3800mb显存，所以一般的显卡还是可以训练的。训练的loss和learning rate如下:

下面展示几张测试的效果，因为数据集实在太少，所以训练的效果还有待提升。