Yolov1从头到脚实战

最新推荐文章于 2024-03-18 18:03:27 发布

BubbleCodes

最新推荐文章于 2024-03-18 18:03:27 发布

阅读量4.5k

点赞数

分类专栏： ML&DL&CV 文章标签：深度学习计算机视觉目标检测

本文链接：https://blog.csdn.net/BubbleCodes/article/details/121920787

版权

ML&DL&CV 专栏收录该内容

23 篇文章 7 订阅

订阅专栏

文章目录

前言 Introduction
结构 Structure
结果 Results

前言 Introduction

Yolov1论文，数据集，Github仓库，参考视频
环境：win10 + python + pytorch + GPU(1050) + vscode
理解Yolo的关键步骤：理解模型结构 + 理解损失函数 + 手动实现 + blog解析
注意：yolov1是anchor-free的方法

结构 Structure

数据集 Dataset

数据集文件结构认识：数据集包含images + labels + test + train，其他类似或者不在本此考虑范围内，其中前两个为文件夹，后两个为csv文件，包含图像和标签文件名。标签文件内容一行为一个图片的一个框信息：类别，全局x坐标，全局y坐标，全局width，全局height。
dataset.py中主要是自定义VOCDataset类，其中包括__init__(), __len__(), __getitem__()
__init__(self, csv_file, img_dir, label_dir, S=7, B=2, C=20, transform=None)：
- 需要数据集的images, labels路径地址，并且通过csv文件来索引图像和标签。
- 将图像划分cells需要参数S->split_size，每个cell预测物体个数B->bounding box，分类类别个数C-classes。
- 此外如果要做图像增强，则还需要transform。
__len__(self):
- 直接返回csv文件中图像和标签对的个数即可，表示数据的总量。
__getitem__(self, index):
- 方法的目的是根据索引index返回相应的image, label，具体图像的路径通过os的方法完成，图像特征使用PIL中的Image.open方法获得，并且如果有transform则使用即可。
- 标签的信息获得则和yolo本身的特点有关，标签的标准格式是：one hot code + B*(p, x, y, w, h)，其中类别用独热编码，后续五个参数分别为是否有物体，相对于cell的坐标和宽高。因此需要做一些转换，将全局变换成相对于cell。

import torch
from PIL import Image
import os
import pandas as pd

class VOCDataset(torch.utils.data.Dataset):
    def __init__(
        self, csv_file, img_dir, label_dir, S=7, B=2, C=20, transform=None):
        
        self.annotations = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.label_dir = label_dir
        self.transform = transform
        self.S = S
        self.B = B
        self.C = C

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        label_path = os.path.join(self.label_dir, self.annotations.iloc[index, 1])
        boxes = []
        with open(label_path) as f:
            for label in f.readlines():
                class_label, x, y, width, height = [
                    float(x) if float(x) != int(float(x)) else int(x)
                    for x in label.replace("\n", "").split()
                ]

                boxes.append([class_label, x, y, width, height])

        img_path = os.path.join(self.img_dir, self.annotations.iloc[index, 0])
        image = Image.open(img_path)
        boxes = torch.tensor(boxes)

        if self.transform:
            image, boxes = self.transform(image, boxes)

        label_matrix = torch.zeros((self.S, self.S, self.C+self.B*5))
        for box in boxes:
            class_label, x, y, width, height = box.tolist()
            class_label = int(class_label)

            i, j = int(self.S * y), int(self.S * x)
            x_cell, y_cell = self.S*x-j, self.S*y-i
            width_cell, height_cell = (
                width * self.S,
                height * self.S,
            )

            if label_matrix[i, j, 20] == 0:
                label_matrix[i, j, 20] = 1

                box_coordinates = torch.tensor(
                    [x_cell, y_cell, width_cell, height_cell]
                )
                
                label_matrix[i, j, 21:25] = box_coordinates
                
                label_matrix[i, j, class_label] = 1

        return image, label_matrix

模型定义 Model

说明：这里的model我们通过自己使用torch.nn中的方法来定义，而model的配置信息通过architecture_config来获得。
model.py内容主要由architecture_config, CNNBlock, Yolov1组成:
architecture_config定义了模型的配置信息，其中tuple表示卷积块；str代表池化；list代表多个重复卷积块，最后的参数代表重复卷积块的个数。
CNNBlock：卷积块类中同时实现了卷积运算、BatchNorm、LeakyRelu预算。
Yolov1：其中根据模型配置信息和卷积块将整个模型构造出来。

import torch
import torch.nn as nn

# not inclueded fc
architecture_config = [
    # tuple(f, c, s, p)
    (7, 64, 2, 3),
    'M',    # maxpool
    (3, 192, 1, 1),
    'M',
    (1, 128, 1, 0),
    (3, 256, 1, 1),
    (1, 256, 1, 0),
    (3, 512, 1, 1),
    'M',
    # list: tuples*repeats
    [(1, 256, 1, 0), (3, 512, 1, 1), 4],
    (1, 512, 1, 0), 
    (3, 1024, 1, 1),
    'M',
    [(1, 512, 1, 0), (3, 1024, 1, 1), 2],
    (3, 1024, 1, 1),
    (3, 1024, 2, 1),
    (3, 1024, 1, 1),
    (3, 1024, 1, 1),
]

class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)

    def forward(self, x):
        return self.leakyrelu(self.batchnorm(self.conv(x)))

class Yolov1(nn.Module):
    def __init__(self, in_channels=3, **kwargs):
        super(Yolov1, self).__init__()
        self.architecture = architecture_config
        self.in_channels = in_channels
        self.darknet = self._create_conv_layers(self.architecture)
        self.fcs = self._creat_fcs(**kwargs)

    def forward(self, x):
        x = self.darknet(x)
        return self.fcs(torch.flatten(x, start_dim=1))

    def _create_conv_layers(self, architecture):
        layers = []
        in_channels = self.in_channels

        for x in architecture:
            if type(x) == tuple:
                layers += [
                    CNNBlock(
                        in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3],
                    )
                ]
                in_channels = x[1]
            elif type(x) == str:
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            elif type(x) == list:
                conv1 = x[0] # tuple
                conv2 = x[1] # tuple
                num_repeats = x[2]    # Integer
                
                for _ in range(num_repeats):
                    layers += [
                        CNNBlock(
                            in_channels,
                            conv1[1],
                            kernel_size=conv1[0],
                            stride=conv1[2],
                            padding=conv1[3],
                        )
                    ]

                    layers += [
                        CNNBlock(
                            conv1[1],
                            conv2[1],
                            kernel_size=conv2[0],
                            stride=conv2[2],
                            padding=conv2[3],
                        )
                    ]
                    
                    in_channels = conv2[1]
        
        return nn.Sequential(*layers)
        
    def _creat_fcs(self, split_size, num_boxes, num_classes):
        S, B, C = split_size, num_boxes, num_classes
        return nn.Sequential(
            nn.Flatten(),
            nn.Linear(1024*S*S, 496),   # Original paper this should be 4096
            nn.Dropout(0.0),
            nn.LeakyReLU(0.1),
            nn.Linear(496, S*S*(C+B*5)),
        )

# def test(S=7, B=2, C=20):
#     model = Yolov1(split_size=S, num_boxes=B, num_classes=C)
#     x = torch.randn(2, 3, 448, 448)
#     print(model(x).shape)

# test()

损失函数 Loss

Yolov1中的损失函数定义如下：
这里理解loss的计算过程时，必须要结合模型、训练训练过程：模型在每个cell中得到一个输出，其中包含每个类别概率编码，bounding box的置信度、bounding box的坐标大小。因此，我们损失的来源也确定了：类别、置信度、坐标大小。同时，我们在计算各个需要各个损失来源之间的相关性；对于一个cell，只有当cell内存在obj，并且该bounding是我们的最终预测的box（IOU最大）时，才计入boundining box的坐标大小带来的损失；
这里对于w、h使用二次方根来计算损失，是因为不同的bounding box对于误差敏感度不同，如1之于100、1之于10。
这里设置了不同的权值，下面给出解析：如果将类别损失和坐标损失的权值设置成一样，那么由于很多的cell实际上都不包含obj，这样会使得模型偏向于得到类别概率编码很小的结果，结果导致模型不稳定、发散。
损失函数同样的使用自定义类来实现，当直接使用对象时，在__call__()中会自动调用forward()函数。

import torch
import torch.nn as nn
from utils import intersection_over_union


class YoloLoss(nn.Module):
   """
   Calculate the loss for yolo (v1) model
   """

   def __init__(self, S=7, B=2, C=20):
       super(YoloLoss, self).__init__()
       self.mse = nn.MSELoss(reduction="sum")

       """
       S is split size of image (in paper 7),
       B is number of boxes (in paper 2),
       C is number of classes (in paper and VOC dataset is 20),
       """
       self.S = S
       self.B = B
       self.C = C

       # These are from Yolo paper, signifying how much we should
       # pay loss for no object (noobj) and the box coordinates (coord)
       self.lambda_noobj = 0.5
       self.lambda_coord = 5

   def forward(self, predictions, target):
       # predictions are shaped (BATCH_SIZE, S*S(C+B*5) when inputted
       predictions = predictions.reshape(-1, self.S, self.S, self.C + self.B * 5)

       # Calculate IoU for the two predicted bounding boxes with target bbox
       iou_b1 = intersection_over_union(predictions[..., 21:25], target[..., 21:25])
       iou_b2 = intersection_over_union(predictions[..., 26:30], target[..., 21:25])
       ious = torch.cat([iou_b1.unsqueeze(0), iou_b2.unsqueeze(0)], dim=0)

       # Take the box with highest IoU out of the two prediction
       # Note that bestbox will be indices of 0, 1 for which bbox was best
       iou_maxes, best_box = torch.max(ious, dim=0)
       exists_box = target[..., 20].unsqueeze(3)  # in paper this is Iobj_i

       # ======================== #
       #   FOR BOX COORDINATES    #
       # ======================== #

       # Set boxes with no object in them to 0. We only take out one of the two 
       # predictions, which is the one with highest Iou calculated previously.
       box_predictions = exists_box * (
           (
               best_box * predictions[..., 26:30]
               + (1 - best_box) * predictions[..., 21:25]
           )
       )

       box_targets = exists_box * target[..., 21:25]

       # Take sqrt of width, height of boxes to ensure that
       box_predictions[..., 2:4] = torch.sign(box_predictions[..., 2:4]) * torch.sqrt(
           torch.abs(box_predictions[..., 2:4] + 1e-6)
       )
       box_targets[..., 2:4] = torch.sqrt(box_targets[..., 2:4])

       box_loss = self.mse(
           torch.flatten(box_predictions, end_dim=-2),
           torch.flatten(box_targets, end_dim=-2),
       )

       # ==================== #
       #   FOR OBJECT LOSS    #
       # ==================== #

       # pred_box is the confidence score for the bbox with highest IoU
       pred_box = (
           best_box * predictions[..., 25:26] + (1 - best_box) * predictions[..., 20:21]
       )

       object_loss = self.mse(
           torch.flatten(exists_box * pred_box),
           torch.flatten(exists_box * target[..., 20:21]),
       )

       # ======================= #
       #   FOR NO OBJECT LOSS    #
       # ======================= #

       #max_no_obj = torch.max(predictions[..., 20:21], predictions[..., 25:26])
       #no_object_loss = self.mse(
       #    torch.flatten((1 - exists_box) * max_no_obj, start_dim=1),
       #    torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1),
       #)

       no_object_loss = self.mse(
           torch.flatten((1 - exists_box) * predictions[..., 20:21], start_dim=1),
           torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1),
       )

       no_object_loss += self.mse(
           torch.flatten((1 - exists_box) * predictions[..., 25:26], start_dim=1),
           torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1)
       )

       # ================== #
       #   FOR CLASS LOSS   #
       # ================== #

       class_loss = self.mse(
           torch.flatten(exists_box * predictions[..., :20], end_dim=-2,),
           torch.flatten(exists_box * target[..., :20], end_dim=-2,),
       )

       loss = (
           self.lambda_coord * box_loss  # first two rows in paper
           + object_loss  # third row in paper
           + self.lambda_noobj * no_object_loss  # forth row
           + class_loss  # fifth row
       )

       return loss

模型训练 Training

模型的保存，check_point和save_point起着读取和保存训练模型的作用。

import torch
import torchvision.transforms as transforms
import torch.optim as optim
import torchvision.transforms.functional as FT
from torch.utils.data import DataLoader
from loss import YoloLoss
from model import Yolov1
from dataset import VOCDataset
from tqdm import tqdm
from utils import (
    intersection_over_union,
    non_max_suppression,
    mean_average_precision,
    cellboxes_to_boxes,
    get_bboxes,
    plot_image,
    save_checkpoint,
    load_checkpoint,
)

seed = 123
torch.manual_seed(seed)

# Hyperparameters etc.
LEARN_RATE = 2e-5
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 4
WEIGHT_DECAY = 0
EPOCHS = 100
NUM_WORKERS = 2
PIN_MEMORY = True
LOAD_MODEL = False
LOAD_MODEL_FILE = 'overfit.pth.tar'
IMG_DIR = 'data/images'
LABEL_DIR = 'data/labels'

class Compose(object):
    def __init__(self, transforms):
        self.transforms = transforms
    
    def __call__(self, img, bboxes):
        for t in self.transforms:
            img, bboxes = t(img), bboxes

        return img, bboxes

transform = Compose([
    transforms.Resize((448, 448)), transforms.ToTensor()
])

def train_fn(train_loader, model, optimizer, loss_fn):
    loop = tqdm(train_loader, leave=True)
    mean_loss = []

    for batch_idx, (x, y) in enumerate(loop):
        x, y = x.to(DEVICE), y.to(DEVICE)
        
        out = model(x)
        loss = loss_fn(out, y)
        mean_loss.append(loss.item())
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update the progress bar
        loop.set_postfix(loss = loss.item())

    print(f"Mean loss war {sum(mean_loss)/len(mean_loss)}")

def main():
    model = Yolov1(split_size=7, num_boxes=2, num_classes=20).to(DEVICE)
    optimizer = optim.Adam(
        model.parameters(), lr=LEARN_RATE, weight_decay=WEIGHT_DECAY
    )
    loss_fn = YoloLoss()

    if LOAD_MODEL:
        load_checkpoint(torch.load(LOAD_MODEL_FILE), model, optimizer)

    train_dataset = VOCDataset(
        "data/8examples.csv",
        transform=transform,
        img_dir=IMG_DIR,
        label_dir=LABEL_DIR,
    )

    test_dataset = VOCDataset(
        "data/test.csv",
        transform=transform,
        img_dir=IMG_DIR,
        label_dir=LABEL_DIR,
    )

    train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        pin_memory=PIN_MEMORY,
        shuffle=True,
        drop_last=False,
    )

    test_loader = DataLoader(
        dataset=test_dataset,
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        pin_memory=PIN_MEMORY,
        shuffle=True,
        drop_last=True,
    )

    for epoch in range(EPOCHS):

        # for x, y in train_loader:
        #    x = x.to(DEVICE)
        #    for idx in range(8):
        #        bboxes = cellboxes_to_boxes(model(x))
        #        bboxes = non_max_suppression(bboxes[idx], iou_threshold=0.5, threshold=0.4, box_format="midpoint")
        #        plot_image(x[idx].permute(1,2,0).to("cpu"), bboxes)

        #    import sys
        #    sys.exit()
        
        pred_boxes, target_boxes = get_bboxes(
            train_loader, model, iou_threshold=0.5, threshold=0.4
        )

        mean_avg_prec = mean_average_precision(
            pred_boxes, target_boxes, iou_threshold=0.5, box_format='midpoint'
        )

        print(f"Train mAP: {mean_avg_prec}")

        #if mean_avg_prec > 0.9:
        #    checkpoint = {
        #        "state_dict": model.state_dict(),
        #        "optimizer": optimizer.state_dict(),
        #    }
        #    save_checkpoint(checkpoint, filename=LOAD_MODEL_FILE)
        #    import time
        #    time.sleep(10)

        train_fn(train_loader, model, optimizer, loss_fn)

# 做程序测试用
if __name__ == "__main__":
    main()