组队学习之目标检测

最新推荐文章于 2024-06-11 09:39:05 发布

尘归尘-北尘

最新推荐文章于 2024-06-11 09:39:05 发布

阅读量894

点赞数

分类专栏：小白学习笔记文章标签：机器学习深度学习目标检测

本文链接：https://blog.csdn.net/hu_hao/article/details/111188485

版权

小白学习笔记专栏收录该内容

40 篇文章 1 订阅

订阅专栏

1.Task1 目标检测基础和VOC数据集

1.1.1目标检测基础

目标检测，即在图片上，检测出感兴趣的物体的位置，并给出类别的置信度，实现思路基本是预选框加分类模型，以下是教材的讲解：

目标检测是计算机视觉中的一个重要任务，近年来传统目标检测方法已经难以满足人们对目标检测效果的要求，随着深度学习在计算机视觉任务上取得的巨大进展，目前基于深度学习的目标检测算法已经成为主流。

相比较于基于深度学习的图像分类任务，目标检测任务更具难度。

具体区别如图3-1所示。

在这里插入图片描述

自2012年Alex Krizhevsky凭借Alex在ImageNet图像分类挑战赛中拿下冠军之后，深度学习在图像识别尤其是图像分类领域开始大放异彩，大众的视野也重新回到深度神经网络中。紧接着，不断有更深更复杂的网络出现，一再刷新ImageNet图像分类比赛的记录。

大家发现，通过合理的构造，神经网络可以用来预测各种各样的实际问题。于是人们开始了基于CNN的目标检测研究, 但是随着进一步的探索大家发现，似乎CNN并不善于直接预测坐标信息。并且一幅图像中可能出现的物体个数也是不定的，模型如何构建也比较棘手。

因此，人们就想，如果知道了图中某个位置存在物体，再将对应的局部区域送入到分类网络中去进行判别，那我不就可以知道图像中每个物体的位置和类别了吗？

但是，怎么样才能知道每个物体的位置呢？显然我们是没办法知道的，但是我们可以去猜啊！所谓猜，其实就是通过滑窗的方式，罗列图中各种可能的区域，一个个去试，分别送入到分类网络进行分类得到其类别，同时我们会对当前的边界框进行微调，这样对于图像中每个区域都能得到（class,x1,y1,x2,y2）五个属性，汇总后最终就得到了图中物体的类别和坐标信息。

总结一下我们的这种方案思路：先确立众多候选框，再对候选框进行分类和微调。

在这里插入图片描述

1.1.2 目标框定义方式

任何图像任务的训练数据都要包括两项，图片和真实标签信息，通常叫做GT。

图像分类中，标签信息是类别。目标检测的标签信息除了类别label以外，需要同时包含目标的位置信息，也就是目标的外接矩形框bounding box。

用来表达bbox的格式通常有两种，(x1, y1, x2, y2) 和 (c_x, c_y, w, h) ，如图3-3所示：
在这里插入图片描述

1.2 VOC数据集

VOC数据集是目标检测最常用的数据集之一，另外还有微软新出的coco数据集。在使用数据集时，需要构建VOC数据集的dataloader，而这一步，已经有很完善的工具了。
下面是教程讲解：
VOC数据集是目标检测领域最常用的标准数据集之一，几乎所有检测方向的论文，如faster_rcnn、yolo、SSD等都会给出其在VOC数据集上训练并评测的效果。因此我们我们的教程也基于VOC来开展实验，具体地，我们使用VOC2007和VOC2012这两个最流行的版本作为训练和测试的数据。

数据集类别

VOC数据集在类别上可以分为4大类，20小类，其类别信息如图3-5所示。
在这里插入图片描述

将下载得到的压缩包解压，可以得到如图3-9所示的一系列文件夹，由于VOC数据集不仅被拿来做目标检测，也可以拿来做分割等任务，因此除了目标检测所需的文件之外，还包含分割任务所需的文件，比如SegmentationClass,SegmentationObject。

Task2 网络设计

在上一章了解了目标检测的大致原理后，这章是进一步理解其实际算法原理。前面说到先设置先验框，然后滑动这个框，对图片进行扫描，对每个框进行分类，那么这个先验框如何处理才最科学呢，目前的做法是对图片进行分区采样，得到特征图映射，然后对每个特征区框出不同的先验框，然后分类预测，最后对目标先验框进行微调，具体做法如下。

1.1 锚框 or 先验框

关于先验框
在众多经典的目标检测模型中，均有先验框的说法，有的paper(如Faster RCNN)中称之为anchor(锚点)，有的paper(如SSD)称之为prior bounding box(先验框)，实际上是一个概念。

那么，为什么要有先验框这个概念呢？按理说我们的图片输入模型，模型给出检测结果就好了，为什么还要有先验框？那么关于它的作用，我们不妨回顾一下前面在2.1节所说的那个目标检测最初的解决方案，我们说，我们要遍历图片上每一个可能的目标框，再对这些框进行分类和微调，就可以完成目标检测任务。

你脑中目前很可能没有清晰的概念，因为这个描述很模糊，本节介绍的先验框就是在解决如何定义哪些位置是候选目标框的问题。

接下来需要介绍3个概念：

设置不同尺度的先验框
先验框与特征图的对应
先验框类别信息的确定
设置不同尺度的先验框

通常，为了覆盖更多可能的情况，在图中的同一个位置，我们会设置几个不同尺度的先验框。这里所说的不同尺度，不单单指大小，还有长宽比，如下面的示意图所示：
在这里插入图片描述
可以看到，通过设置不同的尺度的先验框，就有更高的概率出现对于目标物体有良好匹配度的先验框（体现为高IoU）

先验框与特征图的对应

除了不同尺度，我们肯定要将先验框铺洒在图片中不同位置上面。

但是遍历原图每个像素，设置的先验框就太多了，完全没必要。如图3-13所示。一个224x224的图片，假设每个位置设置3个不同尺寸的先验框，那么就有224x224x3=150528个，但是如果我们不去遍历原图，而是去遍历原图下采样得到的feature map呢？以vgg16的backbone为例，下采样了5次，得到7x7的feature map，那就只需要得到7x7x3=147个先验,这样的设置大大减少了先验框的数量，同时也能覆盖大多数情况。
因此，我们就将先验框的设置位置与特征图建立一一对应的关系。而且，通过建立这种映射关系，我们可以通过特征图，直接一次性的输出所有先验框的类别信息以及坐标信息，而不是想前面一直描述的那样，每个候选框都去独立的进行一次分类的预测，这样太慢了(阅读后面的章节后，你将会深刻理解这段话的含义，以及建立这种一一映射的重要意义）。

我们铺设了很多的先验框，我们先要给出这些先验框的类别信息，才能让模型学着去预测每个先验框是否对应着一个目标物体。

这些先验框中有很多是和图片中我们要检测的目标完全没有交集或者有很小的交集，

我们的做法是，设定一个IoU阈值，例如iou=0.5，与图片中目标的iou<0.5的先验框，这些框我们将其划分为背景，Iou>=0.5的被归到目标先验框，通过这样划分，得到供模型学习的ground truth信息，如图3-14所示:

图3-14 先验框划分
3.3.2 先验框的生成
这里，我们来结合代码介绍先验框是如何生成的，更加具体的先验框的使用以及一些训练技巧如先验框的筛选在后面的章节会进一步的介绍。

model.py 脚本下有一个 tiny_detector 类，是本章节介绍的目标检测网络的定义函数，其内部实现了一个 create_prior_boxes 函数，该函数便是用来生成先验框的。

“”"
设置细节介绍：

离散程度 fmap_dims = 7： VGG16最后的特征图尺寸为 7*7
在上面的举例中我们是假设了三种尺寸的先验框，然后遍历坐标。在先验框生成过程中，先验框的尺寸是提前设置好的，
本教程为特征图上每一个cell定义了共9种不同大小和形状的候选框（3种尺度*3种长宽比=9）

生成过程：
0. cx， cy表示中心点坐标

遍历特征图上每一个cell，i+0.5是为了从坐标点移动至cell中心，/fmap_dims目的是将坐标在特征图上归一化
这个时候我们已经可以在每个cell上各生成一个框了，但是这个不是我们需要的，我们称之为base_prior_bbox基准框。
根据我们在每个cell上得到的长宽比1:1的基准框，结合我们设置的3种尺度obj_scales和3种长宽比aspect_ratios就得到了每个cell的9个先验框。
最终结果保存在prior_boxes中并返回。

需要注意的是，这个时候我们的到的先验框是针对特征图的尺寸并归一化的，因此要映射到原图计算IOU或者展示，需要：
img_prior_boxes = prior_boxes * 图像尺寸
“”"

def create_prior_boxes():
“”"
Create the 441 prior (default) boxes for the network, as described in the tutorial.
VGG16最后的特征图尺寸为 77
我们为特征图上每一个cell定义了共9种不同大小和形状的候选框（3种尺度3种长宽比=9）
因此总的候选框个数 = 7 * 7 * 9 = 441
:return: prior boxes in center-size coordinates, a tensor of dimensions (441, 4)
“”"
fmap_dims = 7
obj_scales = [0.2, 0.4, 0.6]
aspect_ratios = [1., 2., 0.5]

    prior_boxes = []
    for i in range(fmap_dims):
        for j in range(fmap_dims):
            cx = (j + 0.5) / fmap_dims
            cy = (i + 0.5) / fmap_dims

            for obj_scale in obj_scales:
                for ratio in aspect_ratios:
                    prior_boxes.append([cx, cy, obj_scale * sqrt(ratio), obj_scale / sqrt(ratio)])

    prior_boxes = torch.FloatTensor(prior_boxes).to(device)  # (441, 4)
    prior_boxes.clamp_(0, 1)  # (441, 4)

    return prior_boxes

Copy to clipboardErrorCopied
根据上面的代码，我们得到了先验框，那么接下来进行一下可视化吧，为了便于观看，仅展示特征图中间那个cell对应的先验框。

这里为了对比，我们设置两组obj_scales尺度参数。

obj_scales = [0.1, 0.2, 0.3]
这里的参数是归一化的，0.1代表anchor的基准大小为原图长/宽的0.1那么大。

图3-15 obj_scales = [0.1, 0.2, 0.3]的先验框可视化
可以看到，我们在图片中心得到了各个尺度和宽高比的先验框。

obj_scales = [0.2, 0.4, 0.6]

图3-16 obj_scales = [0.2, 0.4, 0.6]的先验框可视化
这里对比两组不同的尺度设置，是想展示一个需要注意的小问题，那就是越界，可以看到第二组可视化部分蓝色和绿色的先验框都超出图片界限了，这种情况其实是非常容易出现的，越靠近四周的位置的先验框越容易越界，那么这个问题怎么处理呢？这里我们一般用图片尺寸将越界的先验框进行截断，比如某个先验框左上角坐标是（-5， -9），那么就截断为（0，0），某个先验框右下角坐标是（324，134），当我们的图片大小为（224，224）时，就将其截断为（224，134）。

对应于代码中是这行，prior_boxes.clamp_(0, 1)，由于进行了归一化，所以使用0-1进行截断。

1.2 模型结构

本章教程所介绍的网络，后面我们称其为Tiny_Detector，是为了本教程特意设计的网络，而并不是某个经典的目标检测网络。如果一定要溯源的话，由于代码是由一个外国的开源SSD教程改编而来，因此很多细节上也更接近SSD网络，可以认为是一个简化后的版本，目的是帮助大家更好的入门。

那么下面，我们就开始介绍Tiny_Detector的模型结构

为了使结构简单易懂，我们使用VGG16作为backbone，即完全采用vgg16的结构作为特征提取模块，只是去掉fc6和fc7两个全连接层。如图3-17所示：

在这里插入图片描述

对于网络的输入尺寸的确定，由于vgg16的ImageNet预训练模型是使用224x224尺寸训练的，因此我们的网络输入也固定为224x224，和预训练模型尺度保持一致可以更好的发挥其作用。通常来说，这样的网络输入大小，对于检测网络来说还是偏小，在完整的进行完本章的学习后，不妨尝试下将输入尺度扩大，看看会不会带来更好的效果。

特征提取模块对应代码模块在model.py中的VGGBase类进行了定义：

class VGGBase(nn.Module):
“”"
VGG base convolutions to produce feature maps.
完全采用vgg16的结构作为特征提取模块，丢掉fc6和fc7两个全连接层。
因为vgg16的ImageNet预训练模型是使用224×224尺寸训练的，因此我们的网络输入也固定为224×224
“”"

def __init__(self):
    super(VGGBase, self).__init__()

    # Standard convolutional layers in VGG16
    self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)  # stride = 1, by default
    self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
    self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)    # 224->112

    self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
    self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
    self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)    # 112->56

    self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
    self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
    self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
    self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)    # 56->28

    self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
    self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
    self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
    self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)    # 28->14

    self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
    self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
    self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
    self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2)    # 14->7

    # Load pretrained weights on ImageNet
    self.load_pretrained_layers()


def forward(self, image):
    """
    Forward propagation.

    :param image: images, a tensor of dimensions (N, 3, 224, 224)
    :return: feature maps pool5
    """
    out = F.relu(self.conv1_1(image))  # (N, 64, 224, 224)
    out = F.relu(self.conv1_2(out))  # (N, 64, 224, 224)
    out = self.pool1(out)  # (N, 64, 112, 112)

    out = F.relu(self.conv2_1(out))  # (N, 128, 112, 112)
    out = F.relu(self.conv2_2(out))  # (N, 128, 112, 112)
    out = self.pool2(out)  # (N, 128, 56, 56)

    out = F.relu(self.conv3_1(out))  # (N, 256, 56, 56)
    out = F.relu(self.conv3_2(out))  # (N, 256, 56, 56)
    out = F.relu(self.conv3_3(out))  # (N, 256, 56, 56)
    out = self.pool3(out)  # (N, 256, 28, 28)

    out = F.relu(self.conv4_1(out))  # (N, 512, 28, 28)
    out = F.relu(self.conv4_2(out))  # (N, 512, 28, 28)
    out = F.relu(self.conv4_3(out))  # (N, 512, 28, 28)
    out = self.pool4(out)  # (N, 512, 14, 14)

    out = F.relu(self.conv5_1(out))  # (N, 512, 14, 14)
    out = F.relu(self.conv5_2(out))  # (N, 512, 14, 14)
    out = F.relu(self.conv5_3(out))  # (N, 512, 14, 14)
    out = self.pool5(out)  # (N, 512, 7, 7)

    # return 7*7 feature map                                                                                                                                  
    return out


def load_pretrained_layers(self):
    """
    we use a VGG-16 pretrained on the ImageNet task as the base network.
    There's one available in PyTorch, see https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.vgg16
    We copy these parameters into our network. It's straightforward for conv1 to conv5.
    """
    # Current state of base
    state_dict = self.state_dict()
    param_names = list(state_dict.keys())

    # Pretrained VGG base
    pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()
    pretrained_param_names = list(pretrained_state_dict.keys())

    # Transfer conv. parameters from pretrained model to current model
    for i, param in enumerate(param_names):  
        state_dict[param] = pretrained_state_dict[pretrained_param_names[i]]

    self.load_state_dict(state_dict)
    print("\nLoaded base model.\n")

Copy to clipboardErrorCopied
因此，我们的Tiny_Detector特征提取层输出的是7x7的feature map，下面我们要在feature_map上设置对应的先验框，或者说anchor。

关于先验框的概念，上节已经做了介绍，在本实验中，anchor的配置如下：

将原图均匀分成7x7个cell
设置3种不同的尺度：0.2, 0.4, 0.6
设置3种不同的长宽比：1:1, 1:2, 2:1
因此，我们对这 7x7 的 feature map 设置了对应的7x7x9个anchor框，其中每一个cell有9个anchor框.

对于每个anchor，我们需要预测两类信息，一个是这个anchor的类别信息，一个是物体的边界框信息。如图3-19：

在我们的实验中，类别信息由21类别的得分组成（VOC数据集的20个类别 + 一个背景类），模型最终会选择预测得分最高的类作为边界框对象的类别。

而边界框信息是指，我们大致知道了当前anchor中包含一个物体的情况下，如何对anchor进行微调，使得最终能够准确预测出物体的bbox。

这两种预测我们分别成为分类头和回归头，那么分类头预测和回归头预测是怎么得到的？

其实我们只需在7x7的feature map后，接上两个3x3的卷积层，即可分别完成分类和回归的预测。

下面我们就对分类头和回归头的更多细节进行介绍。

Tiny_Detector并不是直接预测目标框，而是回归对于anchor要进行多大程度的调整，才能更准确的预测出边界框的位置。那么我们的目标就是需要找一种方法来量化计算这个偏差。

对于一只狗的目标边界框和先验框的示例如下图3-21所示：

我们的模型要预测anchor与目标框的偏移，并且这个偏移会进行某种形式的归一化，这个过程我们称为边界框的编码。

目标框编码与解码的实现位于utils.py中，代码如下：

def cxcy_to_gcxgcy(cxcy, priors_cxcy):
“”"
Encode bounding boxes (that are in center-size form) w.r.t. the corresponding prior boxes (that are in center-size form).

For the center coordinates, find the offset with respect to the prior box, and scale by the size of the prior box.
For the size coordinates, scale by the size of the prior box, and convert to the log-space.

In the model, we are predicting bounding box coordinates in this encoded form.

:param cxcy: bounding boxes in center-size coordinates, a tensor of size (n_priors, 4)
:param priors_cxcy: prior boxes with respect to which the encoding must be performed, a tensor of size (n_priors, 4)
:return: encoded bounding boxes, a tensor of size (n_priors, 4)
"""

# The 10 and 5 below are referred to as 'variances' in the original SSD Caffe repo, completely empirical
# They are for some sort of numerical conditioning, for 'scaling the localization gradient'
# See https://github.com/weiliu89/caffe/issues/155
return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:] / 10),  # g_c_x, g_c_y
                  torch.log(cxcy[:, 2:] / priors_cxcy[:, 2:]) * 5], 1)  # g_w, g_h

def gcxgcy_to_cxcy(gcxgcy, priors_cxcy):
“”"
Decode bounding box coordinates predicted by the model, since they are encoded in the form mentioned above.

They are decoded into center-size coordinates.

This is the inverse of the function above.

:param gcxgcy: encoded bounding boxes, i.e. output of the model, a tensor of size (n_priors, 4)
:param priors_cxcy: prior boxes with respect to which the encoding is defined, a tensor of size (n_priors, 4)
:return: decoded bounding boxes in center-size form, a tensor of size (n_priors, 4)
"""

return torch.cat([gcxgcy[:, :2] * priors_cxcy[:, 2:] / 10 + priors_cxcy[:, :2],  # c_x, c_y
                  torch.exp(gcxgcy[:, 2:] / 5) * priors_cxcy[:, 2:]], 1)  # w, h

Copy to clipboardErrorCopied

为了得到我们想预测的类别和偏移量，我们需要在feature map后分别接上两个卷积层：

1）一个分类预测的卷积层采用3x3卷积核padding和stride都为1，每个anchor需要分配21个卷积核，每个位置有9个anchor，因此需要21x9个卷积核。

2）一个定位预测卷积层，每个位置使用3x3卷积核padding和stride都为1，每个anchor需要分配4个卷积核，因此需要4x9个卷积核。

分类头和回归头结构的定义，由 model.py 中的 PredictionConvolutions 类实现，代码如下：

class PredictionConvolutions(nn.Module):

    Convolutions to predict class scores and bounding boxes using feature maps.

    The bounding boxes (locations) are predicted as encoded offsets w.r.t each of the 441 prior (default) boxes.
    See 'cxcy_to_gcxgcy' in utils.py for the encoding definition.
    这里预测坐标的编码方式完全遵循的SSD的定义

    The class scores represent the scores of each object class in each of the 441 bounding boxes located.
    A high score for 'background' = no object.
    """

    def __init__(self, n_classes):
        """ 
        :param n_classes: number of different types of objects
        """
        super(PredictionConvolutions, self).__init__()

        self.n_classes = n_classes

        # Number of prior-boxes we are considering per position in the feature map
        # 9 prior-boxes implies we use 9 different aspect ratios, etc.
        n_boxes = 9 

        # Localization prediction convolutions (predict offsets w.r.t prior-boxes)
        self.loc_conv = nn.Conv2d(512, n_boxes * 4, kernel_size=3, padding=1)

        # Class prediction convolutions (predict classes in localization boxes)
        self.cl_conv = nn.Conv2d(512, n_boxes * n_classes, kernel_size=3, padding=1)

        # Initialize convolutions' parameters
        self.init_conv2d()


    def init_conv2d(self):
        """
        Initialize convolution parameters.
        """
        for c in self.children():
            if isinstance(c, nn.Conv2d):
                nn.init.xavier_uniform_(c.weight)
                nn.init.constant_(c.bias, 0.)


    def forward(self, pool5_feats):
        """
        Forward propagation.

        :param pool5_feats: conv4_3 feature map, a tensor of dimensions (N, 512, 7, 7)
        :return: 441 locations and class scores (i.e. w.r.t each prior box) for each image
        """
        batch_size = pool5_feats.size(0)

        # Predict localization boxes' bounds (as offsets w.r.t prior-boxes)
        l_conv = self.loc_conv(pool5_feats)  # (N, n_boxes * 4, 7, 7)
        l_conv = l_conv.permute(0, 2, 3, 1).contiguous()  
        # (N, 7, 7, n_boxes * 4), to match prior-box order (after .view())
        # (.contiguous() ensures it is stored in a contiguous chunk of memory, needed for .view() below)
        locs = l_conv.view(batch_size, -1, 4)  # (N, 441, 4), there are a total 441 boxes on this feature map

        # Predict classes in localization boxes
        c_conv = self.cl_conv(pool5_feats)  # (N, n_boxes * n_classes, 7, 7)
        c_conv = c_conv.permute(0, 2, 3, 1).contiguous()  # (N, 7, 7, n_boxes * n_classes), to match prior-box order (after .view())
        classes_scores = c_conv.view(batch_size, -1, self.n_classes)  # (N, 441, n_classes), there are a total 441 boxes on this feature map

        return locs, classes_scores

按照上面的介绍，我们的模型输出的shape应该为：

分类头 batch_size x 7 x 7 x 189
回归头 batch_size x 7 x 7 x 36
但是为了方便后面的处理，我们肯定更希望每个anchor的预测独自成一维，也就是：

分类头 batch_size x 441 x 21
回归头 batch_size x 441 x 4
441是因为我们的模型定义了总共441=7x7x9个先验框，这个转换对应了这两行代码:

locs = l_conv.view(batch_size, -1, 4)

classes_scores = c_conv.view(batch_size, -1, self.n_classes)

# Task3

Task3 损失函数

0 总览

目标检测预测包含两部分，目标位置预测和类别预测，因此其损失函数也由这两部分组成，即位置损失部分和类别置信度损失，详细定义见教程：

1.1 Matching strategy (匹配策略)：

我们分配了许多prior bboxes，我们要想让其预测类别和目标框信息，我们先要知道每个prior bbox和哪个目标对应，从而才能判断预测的是否准确，从而将训练进行下去。

不同方法 ground truth boxes 与 prior bboxes 的匹配策略大致都是类似的，但是细节会有所不同。这里我们采用SSD中的匹配策略，具体如下：

第一个原则：从ground truth box出发，寻找与每一个ground truth box有最大的jaccard overlap的prior bbox，这样就能保证每一个groundtruth box一定与一个prior bbox对应起来(jaccard overlap就是IOU，如图3-26所示，前面介绍过)。反之，若一个prior bbox没有与任何ground truth进行匹配，那么该prior bbox只能与背景匹配，就是负样本。

在这里插入图片描述

一个图片中ground truth是非常少的，而prior bbox却很多，如果仅按第一个原则匹配，很多prior bbox会是负样本，正负样本极其不平衡，所以需要第二个原则。

第二个原则：从prior bbox出发，对剩余的还没有配对的prior bbox与任意一个ground truth box尝试配对，只要两者之间的jaccard overlap大于阈值（一般是0.5），那么该prior bbox也与这个ground truth进行匹配。这意味着某个ground truth可能与多个Prior box匹配，这是可以的。但是反过来却不可以，因为一个prior bbox只能匹配一个ground truth，如果多个ground truth与某个prior bbox的 IOU 大于阈值，那么prior bbox只与IOU最大的那个ground truth进行匹配。

注意：第二个原则一定在第一个原则之后进行，仔细考虑一下这种情况，如果某个ground truth所对应最大IOU的prior bbox小于阈值，并且所匹配的prior bbox却与另外一个ground truth的IOU大于阈值，那么该prior bbox应该匹配谁，答案应该是前者，首先要确保每个ground truth一定有一个prior bbox与之匹配。

用一个示例来说明上述的匹配原则：

在这里插入图片描述

图像中有7个红色的框代表先验框，黄色的是ground truths，在这幅图像中有三个真实的目标。按照前面列出的步骤将生成以下匹配项：

在这里插入图片描述

1.2 小结

匹配策略以iou为衡量标准，首要需要保证每个真值标记框，必须有一个匹配框，然后是多个预测框，选择其中iou最大的为预测框，剩余的预测框只要iou超过阈值，都算该类的正例。最后需要保证如果一个真值标记框预测较差，需要首相先保证该真值有预测框。

2 损失函数

下面来介绍如何设计损失函数，见教程公式：
在这里插入图片描述

其中，l为预测框，g为ground truth。(cx,xy)为补偿(regress to offsets)后的默认框d的中心,(w,h)为默认框的宽和高。更详细的解释看-看下图：

3 Hard negative mining:

值得注意的是，一般情况下negative prior bboxes数量 >> positive prior bboxes数量，直接训练会导致网络过于重视负样本，预测效果很差。为了保证正负样本尽量平衡，我们这里使用SSD使用的在线难例挖掘策略(hard negative mining)，即依据confidience loss对属于负样本的prior bbox进行排序，只挑选其中confidience loss高的bbox进行训练，将正负样本的比例控制在positive：negative=1:3。其核心作用就是只选择负样本中容易被分错类的困难负样本来进行网络训练，来保证正负样本的平衡和训练的有效性。

举个例子：假设在这 441 个 prior bbox 里，经过匹配后得到正样本先验框P个，负样本先验框 441−P 个。将负样本prior bbox按照prediction loss从大到小顺序排列后选择最高的M个prior bbox。这个M需要根据我们设定的正负样本的比例确定，比如我们约定正负样本比例为1:3时。我们就取M=3P，这M个loss最大的负样本难例将会被作为真正参与计算loss的prior bboxes，其余的负样本将不会参与分类损失的loss计算。

小结

本小节介绍的内容围绕如何进行训练展开，主要是3块：

先验框与GT框的匹配策略
损失函数计算
难例挖掘
这3部分是需要结合在一起理解，我们再整个梳理下计算loss的步骤

1）先验框与GT框的匹配

按照我们介绍的方案，为每个先验框都分配好类别，确定是正样本还是负样本。

计算loss
按照我们定义的损失函数计算分类loss 和目标框回归loss

负样本不计算目标框的回归loss

难例挖掘
上面计算的loss中分类loss的部分还不是最终的loss

因为负样本先验框过多，我们要按一定的预设比例，一般是1:3，将loss最高的那部分负样本先验框拿出来，其余的负样本忽略，重新计算分类loss

完整loss计算过程的代码见model.py中的 MultiBoxLoss 类。

本次学习内容，来自Datawhale组队学习动手学cv ——>传送门

Task4 训练与测试

1 模型训练

前面的章节，我们已经对目标检测训练的各个重要的知识点进行了讲解，下面我们需要将整个流程串起来，对模型进行训练。

目标检测网络的训练大致是如下的流程：

设置各种超参数
定义数据加载模块 dataloader
定义网络 model
定义损失函数 loss
定义优化器 optimizer
遍历训练数据，预测-计算loss-反向传播
首先，我们导入必要的库，然后设定各种超参数

import time                                                                                                                                    
import torch.backends.cudnn as cudnn
import torch.optim
import torch.utils.data
from model import tiny_detector, MultiBoxLoss
from datasets import PascalVOCDataset
from utils import *

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cudnn.benchmark = True

# Data parameters
data_folder = '../../../dataset/VOCdevkit'  # data files root path
keep_difficult = True  # use objects considered difficult to detect?
n_classes = len(label_map)  # number of different types of objects

# Learning parameters
total_epochs = 230 # number of epochs to train
batch_size = 32  # batch size
workers = 4  # number of workers for loading data in the DataLoader
print_freq = 100  # print training status every __ batches
lr = 1e-3  # learning rate
decay_lr_at = [150, 190]  # decay learning rate after these many epochs
decay_lr_to = 0.1  # decay learning rate to this fraction of the existing learning rate
momentum = 0.9  # momentum
weight_decay = 5e-4  # weight decay
Copy to clipboardErrorCopied
按照上面梳理的流程，编写训练代码如下：

def main():
    """
    Training.
    """
    # Initialize model and optimizer
    model = tiny_detector(n_classes=n_classes)
    criterion = MultiBoxLoss(priors_cxcy=model.priors_cxcy)
    optimizer = torch.optim.SGD(params=model.parameters(),
                                lr=lr, 
                                momentum=momentum,
                                weight_decay=weight_decay)

    # Move to default device
    model = model.to(device)
    criterion = criterion.to(device)

    # Custom dataloaders
    train_dataset = PascalVOCDataset(data_folder,
                                     split='train',
                                     keep_difficult=keep_difficult)
    train_loader = torch.utils.data.DataLoader(train_dataset,   
                                    batch_size=batch_size,
                                    shuffle=True,
                                    collate_fn=train_dataset.collate_fn, 
                                    num_workers=workers,
                                    pin_memory=True) 

    # Epochs
    for epoch in range(total_epochs):
        # Decay learning rate at particular epochs
        if epoch in decay_lr_at:
            adjust_learning_rate(optimizer, decay_lr_to)

        # One epoch's training                                                                                                                 
        train(train_loader=train_loader,
              model=model,
              criterion=criterion,
              optimizer=optimizer,
              epoch=epoch)

        # Save checkpoint
        save_checkpoint(epoch, model, optimizer)
Copy to clipboardErrorCopied
其中，我们对单个epoch的训练逻辑进行了封装，其具体实现如下：

def train(train_loader, model, criterion, optimizer, epoch):
    """
    One epoch's training.

    :param train_loader: DataLoader for training data
    :param model: model
    :param criterion: MultiBox loss
    :param optimizer: optimizer
    :param epoch: epoch number
    """
    model.train()  # training mode enables dropout

    batch_time = AverageMeter()  # forward prop. + back prop. time
    data_time = AverageMeter()  # data loading time
    losses = AverageMeter()  # loss

    start = time.time()

    # Batches
    for i, (images, boxes, labels, _) in enumerate(train_loader):
        data_time.update(time.time() - start)

        # Move to default device
        images = images.to(device)  # (batch_size (N), 3, 224, 224)
        boxes = [b.to(device) for b in boxes]
        labels = [l.to(device) for l in labels]

        # Forward prop.
        predicted_locs, predicted_scores = model(images)  # (N, 441, 4), (N, 441, n_classes)

        # Loss
        loss = criterion(predicted_locs, predicted_scores, boxes, labels)  # scalar

        # Backward prop.
        optimizer.zero_grad()
        loss.backward()

        # Update model
        optimizer.step()

        losses.update(loss.item(), images.size(0))
        batch_time.update(time.time() - start)

        start = time.time()

        # Print status
        if i % print_freq == 0:
            print('Epoch: [{0}][{1}/{2}]\t'
                  'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'Data Time {data_time.val:.3f} ({data_time.avg:.3f})\t'
                  'Loss {loss.val:.4f} ({loss.avg:.4f})\t'.format(epoch,
                                                                  i, 
                                                                  len(train_loader),
                                                                  batch_time=batch_time,
                                                                  data_time=data_time, 
                                                                  loss=losses))
    del predicted_locs, predicted_scores, images, boxes, labels  # free some memory since their histories may be stored

完成了代码的编写后，我们就可以开始训练模型了，训练过程类似下图所示：

$ python train.py

Loaded base model.
Epoch: [0][0/518] Batch Time 6.556 (6.556) Data Time 3.879 (3.879) Loss 27.7129 (27.7129)
Epoch: [0][100/518] Batch Time 0.185 (0.516) Data Time 0.000 (0.306) Loss 6.1569 (8.4569)
Epoch: [0][200/518] Batch Time 1.251 (0.487) Data Time 1.065 (0.289) Loss 6.3175 (7.3364)
Epoch: [0][300/518] Batch Time 1.207 (0.476) Data Time 1.019 (0.282) Loss 5.6598 (6.9211)
Epoch: [0][400/518] Batch Time 1.174 (0.470) Data Time 0.988 (0.278) Loss 6.2519 (6.6751)
Epoch: [0][500/518] Batch Time 1.303 (0.468) Data Time 1.117 (0.276) Loss 5.4864 (6.4894)
Epoch: [1][0/518] Batch Time 1.061 (1.061) Data Time 0.871 (0.871) Loss 5.7480 (5.7480)
Epoch: [1][100/518] Batch Time 0.189 (0.227) Data Time 0.000 (0.037) Loss 5.8557 (5.6431)
Epoch: [1][200/518] Batch Time 0.188 (0.225) Data Time 0.000 (0.036) Loss 5.2024 (5.5586)
Epoch: [1][300/518] Batch Time 0.190 (0.225) Data Time 0.000 (0.036) Loss 5.5348 (5.4957)
Epoch: [1][400/518] Batch Time 0.188 (0.226) Data Time 0.000 (0.036) Loss 5.2623 (5.4442)
Epoch: [1][500/518] Batch Time 0.190 (0.225) Data Time 0.000 (0.035) Loss 5.3105 (5.3835)
Epoch: [2][0/518] Batch Time 1.156 (1.156) Data Time 0.967 (0.967) Loss 5.3755 (5.3755)
Epoch: [2][100/518] Batch Time 0.206 (0.232) Data Time 0.016 (0.042) Loss 5.6532 (5.1418)
Epoch: [2][200/518] Batch Time 0.197 (0.226) Data Time 0.007 (0.036) Loss 4.6704 (5.0717)
Copy to clipboardErrorCopied
剩下的就是等待了～

2 后处理

2.1 目标框信息解码

之前我们的提到过，模型不是直接预测的目标框信息，而是预测的基于anchor的偏移，且经过了编码。因此后处理的第一步，就是对模型的回归头的输出进行解码，拿到真正意义上的目标框的预测结果。

后处理还需要做什么呢？由于我们预设了大量的先验框，因此预测时在目标周围会形成大量高度重合的检测框，而我们目标检测的结果只希望保留一个足够准确的预测框，所以就需要使用某些算法对检测框去重。这个去重算法叫做NMS，下面我们详细来讲一讲。

2.2 NMS非极大值抑制

NMS的大致算法步骤如下：

按照类别分组，依次遍历每个类别。

当前类别按分类置信度排序，并且设置一个最低置信度阈值如0.05，低于这个阈值的目标框直接舍弃。

当前概率最高的框作为候选框，其它所有与候选框的IOU高于一个阈值（自己设定，如0.5）的框认为需要被抑制，从剩余框数组中删除。

然后在剩余的框里寻找概率第二大的框，其它所有与第二大的框的IOU高于设定阈值的框被抑制。

依次类推重复这个过程，直至遍历完所有剩余框，所有没被抑制的框即为最终检测框。

图2-29 NMS过程

2.3 代码实现:

整个后处理过程的代码实现位于model.py中tiny_detector类的detect_objects函数中


def detect_objects(self, predicted_locs, predicted_scores, min_score, max_overlap, top_k):
    """                                                                                                                                                       
    Decipher the 441 locations and class scores (output of the tiny_detector) to detect objects.

    For each class, perform Non-Maximum Suppression (NMS) on boxes that are above a minimum threshold.

    :param predicted_locs: predicted locations/boxes w.r.t the 441 prior boxes, a tensor of dimensions (N, 441, 4)
    :param predicted_scores: class scores for each of the encoded locations/boxes, a tensor of dimensions (N, 441, n_classes)
    :param min_score: minimum threshold for a box to be considered a match for a certain class
    :param max_overlap: maximum overlap two boxes can have so that the one with the lower score is not suppressed via NMS
    :param top_k: if there are a lot of resulting detection across all classes, keep only the top 'k'
    :return: detections (boxes, labels, and scores), lists of length batch_size
    """
    batch_size = predicted_locs.size(0)
    n_priors = self.priors_cxcy.size(0)
    predicted_scores = F.softmax(predicted_scores, dim=2)  # (N, 441, n_classes)

    # Lists to store final predicted boxes, labels, and scores for all images in batch
    all_images_boxes = list()
    all_images_labels = list()
    all_images_scores = list()

    assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)

    for i in range(batch_size):
        # Decode object coordinates from the form we regressed predicted boxes to
        decoded_locs = cxcy_to_xy(                                                                                                                            
            gcxgcy_to_cxcy(predicted_locs[i], self.priors_cxcy))  # (441, 4), these are fractional pt. coordinates

        # Lists to store boxes and scores for this image
        image_boxes = list()
        image_labels = list()
        image_scores = list()

        max_scores, best_label = predicted_scores[i].max(dim=1)  # (441)

        # Check for each class
        for c in range(1, self.n_classes):
            # Keep only predicted boxes and scores where scores for this class are above the minimum score
            class_scores = predicted_scores[i][:, c]  # (441)
            score_above_min_score = class_scores > min_score  # torch.uint8 (byte) tensor, for indexing
            n_above_min_score = score_above_min_score.sum().item()
            if n_above_min_score == 0:
                continue
            class_scores = class_scores[score_above_min_score]  # (n_qualified), n_min_score <= 441
            class_decoded_locs = decoded_locs[score_above_min_score]  # (n_qualified, 4)

            # Sort predicted boxes and scores by scores
            class_scores, sort_ind = class_scores.sort(dim=0, descending=True)  # (n_qualified), (n_min_score)
            class_decoded_locs = class_decoded_locs[sort_ind]  # (n_min_score, 4)

            # Find the overlap between predicted boxes
            overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs)  # (n_qualified, n_min_score)

            # Non-Maximum Suppression (NMS)

            # A torch.uint8 (byte) tensor to keep track of which predicted boxes to suppress
            # 1 implies suppress, 0 implies don't suppress
            suppress = torch.zeros((n_above_min_score), dtype=torch.uint8).to(device)  # (n_qualified)

            # Consider each box in order of decreasing scores
            for box in range(class_decoded_locs.size(0)):
                # If this box is already marked for suppression
                if suppress[box] == 1:
                    continue

                # Suppress boxes whose overlaps (with current box) are greater than maximum overlap
                # Find such boxes and update suppress indices
                suppress = torch.max(suppress, (overlap[box] > max_overlap).to(torch.uint8))
                # The max operation retains previously suppressed boxes, like an 'OR' operation

                # Don't suppress this box, even though it has an overlap of 1 with itself
                suppress[box] = 0

            # Store only unsuppressed boxes for this class
            image_boxes.append(class_decoded_locs[1 - suppress])
            image_labels.append(torch.LongTensor((1 - suppress).sum().item() * [c]).to(device))
            image_scores.append(class_scores[1 - suppress])

        # If no object in any class is found, store a placeholder for 'background'
        if len(image_boxes) == 0:
            image_boxes.append(torch.FloatTensor([[0., 0., 1., 1.]]).to(device))
            image_labels.append(torch.LongTensor([0]).to(device))
            image_scores.append(torch.FloatTensor([0.]).to(device))

        # Concatenate into single tensors
        image_boxes = torch.cat(image_boxes, dim=0)  # (n_objects, 4)
        image_labels = torch.cat(image_labels, dim=0)  # (n_objects)
        image_scores = torch.cat(image_scores, dim=0)  # (n_objects)
        n_objects = image_scores.size(0)

        # Keep only the top k objects
        if n_objects > top_k:
            image_scores, sort_ind = image_scores.sort(dim=0, descending=True)
            image_scores = image_scores[:top_k]  # (top_k)
            image_boxes = image_boxes[sort_ind][:top_k]  # (top_k, 4)
            image_labels = image_labels[sort_ind][:top_k]  # (top_k)

        # Append to lists that store predicted boxes and scores for all images
        all_images_boxes.append(image_boxes)
        all_images_labels.append(image_labels)
        all_images_scores.append(image_scores)

    return all_images_boxes, all_images_labels, all_images_scores  # lists of length batch_size

我们的后处理代码中NMS的部分着实有些绕，大家可以参考下Fast R-CNN中的NMS实现，更简洁清晰一些


# --------------------------------------------------------
# Fast R-CNN
# Copyright (c) 2015 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Written by Ross Girshick
# --------------------------------------------------------
import numpy as np
# dets: 检测的 boxes 及对应的 scores；
# thresh: 设定的阈值

def nms(dets,thresh):
    # boxes 位置
    x1 = dets[:,0] 
    y1 = dets[:,1] 
    x2 = dets[:,2]
    y2 = dets[:,3]
    # boxes scores
    scores = dets[:,4]
    areas = (x2-x1+1)*(y2-y1+1)   # 各box的面积
    order = scores.argsort()[::-1]  # 分类置信度排序
    keep = []                        # 记录保留下的 boxes
    while order.size > 0:
        i = order[0]               # score最大的box对应的 index
        keep.append(i)        # 将本轮score最大的box的index保留
        \# 计算剩余 boxes 与当前 box 的重叠程度 IoU
        xx1 = np.maximum(x1[i],x1[order[1:]])
        yy1 = np.maximum(y1[i],y1[order[1:]])
        xx2 = np.minimum(x2[i],x2[order[1:]])
        yy2 = np.minimum(y2[i],y2[order[1:]])
        w = np.maximum(0.0,xx2-xx1+1) # IoU
        h = np.maximum(0.0,yy2-yy1+1)
        inter = w*h
        ovr = inter/(areas[i]+areas[order[1:]]-inter)
        \# 保留 IoU 小于设定阈值的 boxes
        inds = np.where(ovr<=thresh)[0]
        order = order[inds+1]
    return keep

3 单图预测推理

当模型已经训练完成后，下面我们来看下如何对单张图片进行推理，得到目标检测结果。

首先我们需要导入必要的python包，然后加载训练好的模型权重。

随后我们需要定义预处理函数。为了达到最好的预测效果，测试环节的预处理方案需要和训练时保持一致，仅去除掉数据增强相关的变换即可。

因此，这里我们需要进行的预处理为：

将图片缩放为 224 * 224 的大小
转换为 Tensor 并除 255
进行减均值除方差的归一化


# Set detect transforms (It's important to be consistent with training)
resize = transforms.Resize((224, 224))
to_tensor = transforms.ToTensor()
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

接着我们就来进行推理，过程很简单，核心流程可以概括为：

读取一张图片
预处理
模型预测
对模型预测进行后处理
核心代码如下：



# Transform the image
image = normalize(to_tensor(resize(original_image)))

# Move to default device
image = image.to(device)

# Forward prop.
predicted_locs, predicted_scores = model(image.unsqueeze(0))

# Post process, get the final detect objects from our tiny detector output
det_boxes, det_labels, det_scores = model.detect_objects(predicted_locs, predicted_scores, min_score=min_score, max_overlap=max_overlap, top_k=top_k)

这里的detect_objects 函数完成模型预测结果的后处理，主要工作有两个，首先对模型的输出进行解码，得到代表具体位置信息的预测框，随后对所有预测框按类别进行NMS，来过滤掉一些多余的检测框，也就是我们上一小节介绍的内容。

最后，我们将最终得到的检测框结果进行绘制，得到类似如下图的检测结果：

完整代码见 detect.py 脚本，下面是更多的一些VOC测试集中图片的预测结果展示：

可以看到，我们的 tiny_detector 模型对于一些简单的测试图片检测效果还是不错的。一些更难的图片的预测效果如下：

可以看到，当面对一些稍微有挑战性的图片的时候，我们的检测器就开始暴露出各种个样的问题，包括但不限于：

漏框（右图有很多瓶子没有检测出来）
误检（右图误检了一个瓶子）
重复检测（左图的汽车和右图最前面的人）
定位不准，尤其是对小物体
不妨运行下 detect.py，赶快看看你训练的模型效果如何吧，你观察到了哪些问题，有没有什么优化思路呢？

4 VOC测试集评测

4.1 介绍map指标

以分类模型中最简单的二分类为例，对于这种问题，我们的模型最终需要判断样本的结果是0还是1，或者说是positive还是negative。我们通过样本的采集，能够直接知道真实情况下，哪些数据结果是positive，哪些结果是negative。同时，我们通过用样本数据跑出分类模型的结果，也可以知道模型认为这些数据哪些是positive，哪些是negative。因此，我们就能得到这样四个基础指标，称他们是一级指标（最底层的）：

1）真实值是positive，模型认为是positive的数量（True Positive=TP）

2）真实值是positive，模型认为是negative的数量（False Negative = FN）：这就是统计学上的第二类错误（Type II Error）

3）真实值是negative，模型认为是positive的数量（False Positive = FP）：这就是统计学上的第一类错误（Type I Error）

4）真实值是negative，模型认为是negative的数量（True Negative = TN）

在机器学习领域，混淆矩阵（confusion matrix），又称为可能性表格或错误矩阵。它是一种特定的矩阵用来呈现算法性能的可视化效果，通常用于监督学习（非监督学习，通常用匹配矩阵：matching matrix）。其每一列代表预测值，每一行代表的是实际的类别。这个名字来源于它可以非常容易的表明多个类别是否有混淆（也就是一个class被预测成另一个class）。

Example 假设有一个用来对猫（cats）、狗（dogs）、兔子（rabbits）进行分类的系统，混淆矩阵就是为了进一步分析性能而对该算法测试结果做出的总结。假设总共有27只动物：8只猫、6条狗、13只兔子。结果的混淆矩阵如下表：

表3-30
二级指标：混淆矩阵里面统计的是个数，有时候面对大量的数据，光凭算个数，很难衡量模型的优劣。因此混淆矩阵在基本的统计结果上又延伸了如下4个指标，我称他们是二级指标（通过最底层指标加减乘除得到的）：

1）准确率（Accuracy）-----针对整个模型

2）精确率（Precision）

3）灵敏度（Sensitivity）：就是召回率（Recall）

4）特异度（Specificity）

用表格的方式将这四种指标的定义、计算、理解进行汇总：

表3-31
通过上面的四个二级指标，可以将混淆矩阵中数量的结果转化为0-1之间的比率。便于进行标准化的衡量。

三级指标：这个指标叫做F1 Score。他的计算公式是：

F1 Score = 2PR / P+R

其中，P代表Precision，R代表Recall（召回率）。F1-Score指标综合了Precision与Recall的产出的结果。F1-Score的取值范围从0到1,1代表模型的输出最好，0代表模型的输出结果最差。

AP指标即Average Precision 即平均精确度。

mAP即Mean Average Precision即平均AP值，是对多个验证集个体求平均AP值，作为object detection中衡量检测精度的指标。

在目标检测场景如何计算AP呢，这里需要引出P-R曲线，即以precision和recall作为纵、横轴坐标的二维曲线。通过选取不同阈值时对应的精度和召回率画出，如下图所示：

图3-32 PR曲线
P-R曲线的总体趋势是，精度越高，召回越低，当召回到达1时，对应概率分数最低的正样本，这个时候正样本数量除以所有大于等于该阈值的样本数量就是最低的精度值。另外，P-R曲线围起来的面积就是AP值，通常来说一个越好的分类器，AP值越高。

总结：在目标检测中，每一类都可以根据recall和precision绘制P-R曲线，AP就是该曲线下的面积，mAP就是所有类的AP的平均值。(这里说的是VOC数据集的mAP指标的计算方法，COCO数据集的计算方法略有差异）

4.2 Tiny-Detection VOC测试集评测

运行 eval.py 脚本，评估模型在VOC2007测试集上的效果，结果如下：

python eval.py

$ python eval.py

Evaluating: 100%|███████████████████████████████| 78/78 [00:57<00:00, 1.35it/s]
{‘aeroplane’: 0.6086561679840088,
‘bicycle’: 0.7144593596458435,
‘bird’: 0.5847545862197876,
‘boat’: 0.44902321696281433,
‘bottle’: 0.2160634696483612,
‘bus’: 0.7212041616439819,
‘car’: 0.629608154296875,
‘cat’: 0.8124480843544006,
‘chair’: 0.3599272668361664,
‘cow’: 0.5980824828147888,
‘diningtable’: 0.6459739804267883,
‘dog’: 0.7577021718025208,
‘horse’: 0.7861635088920593,
‘motorbike’: 0.702280580997467,
‘person’: 0.5821948051452637,
‘pottedplant’: 0.2793791592121124,
‘sheep’: 0.5655995607376099,
‘sofa’: 0.708049476146698,
‘train’: 0.7575671672821045,
‘tvmonitor’: 0.5641061663627625}