详解PASCAL VOC数据集及基于Python和PyTorch的下载、解析及可视化【目标检测+类别分割】

原创已于 2024-08-27 14:04:19 修改

· 4.1k 阅读

34 ·

版权

文章标签：

#python #pytorch #目标检测

于 2024-08-27 12:12:25 首次发布

机器学习数据集专栏收录该内容

4 篇文章

订阅专栏

PASCAL VOC数据集简介

PASCAL VOC数据集是计算机视觉领域中 目标检测（object detection） 任务和 分割（segmentation） 任务的基准数据集。PASCAL VOC数据和比赛发源于由欧盟资助的PASCAL2 Network of Excellence on Pattern Analysis, Statistical Modelling and Computational Learning项目。该比赛从2005年至2012年每年举办一次，并已经于2012年停办。因此，PASCAL VOC数据集是一系列数据集的集合，从2005年至2012年这八年按年发布，每年的数据集可以简写为VOC2005、VOC2006，以此类推。值得注意的是，VOC2007以后便不再发布test数据集。并且PASCAL VOC数据集中的图片来源于flickr网站和Microsoft Research Cambrige (MSRC)数据集，因此使用时要注意遵守flickr的使用条款。

* 笔者认为官方采用PASCAL命名而不是pattern analysis, statistical modelling, and computational learning visual object classes的首字母缩写PASMCL，应该是PASCAL更加简洁和易于识别，因为pascal是物理力学中的标准压力单位。

PASCAL VOC主页：http://host.robots.ox.ac.uk/pascal/VOC/
Visual Object Classes Challenge 2012 (VOC2012)：http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html
The PASCAL Visual Object Classes Challenge 2007：http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html

PASCAL VOC各年份数据集摘要

年份	统计数据	任务类别	备注
2005	4类别：bicycles、cars、motorbikes、people 1578张图片，2209个标注子数据集：train、validation、test	classification、segmentation
2006	10类别：bicycle、bus、car、cat、cow、dog、horse、motorbike、person、sheep 2618张图片、4754个标注子数据集：train、validation、test	classification、segmentation
2007	20类别： person；bird、cat、cow、dog、horse、sheep、aeroplane、bicycle、boat、bus、car、motorbike、train；bottle、chair、dining table、potted plant、sofa、tv/monitor 9963张图片、24640个标注子数据集：train、validation、test	classification、segmentation、person layout	最后一年公开test数据集类别固定为20个标注中增加了`truncation`标签评价指标由ROC-AUC变为AP
2008	同2007的20类别 train+validation子数据集与test子数据集的划分比例约为1:1 4340张图片、10363个标注子数据集：train、validation	classification、segmentation、person layout	标注中添加了`Occlusion`标签 segementation和person layout子数据集包含VOC2007的数据
2009	同2007的20类别 10103张图片、23374个ROI标注、4203个segmentation标注子数据集：train、validation、test	classification、segmentation、person layout	从现在开始图像都包含了前几年的图像和芯图像
2010	同2007的20类别 9963张图片、24640个标注、4203个segmentation 子数据集：train、validation	classification、segmentation、person layout	计算AP的方法从TREC式变为基于所有点计算
2011	同2007的20类别 11530张图片、27450个标注、5034个segmentation 子数据集：train、validation	classification、segmentation、person layout、action classification	action classification的类别扩展为“10+other”模式 layout标签并不完整，不是所有图片中的所有person均被标注
2012	同2007的20类别 11530张图片、27450个标注、6929个segmentation 子数据集：train、validation	classification、segmentation、person layout、action classification	使用person身上的参考点注释了action classification数据集

数据集下载

因为VOC2005-VOC2006，数据集的图片数量、物体类别数量都在不断变化。直到2007年，VOC2007的物体类别才固定下来，其所有的标签均比较完善，并且VOC2007也是最后一个数据集发布一个较为完整的test子数据集。在VOC2009-VOC2012期间，test子数据集均没有公开发布，并且图片数量和标签质量均有一定的完善。可以说VOC2012是PASCAL VOC系列数据集最后的一个版本，也是最完善的不包含test子数据集的PASCAL VOC公开数据集。
因此，我们一般使用VOC2007和VOC2012两个数据集。将VOC2007与VOC2012的train/val数据集进行合并，用于模型开发过程中的训练和验证，其共包含16511张图片。然后，单独使用VOC2007的test数据集作为测试，共包含4952张图片。

通过下面官方提供的网址下载

点击下方链接即可下载：

通过PyTorch的API下载

torchvision.datasets.VOCDetection: https://pytorch.org/vision/0.17/generated/torchvision.datasets.VOCDetection.html
torchvision.datasets.VOCSegmentation: https://pytorch.org/vision/main/generated/torchvision.datasets.VOCSegmentation.html

数据集解析

目标检测数据集

VOC2007目标检测数据集下载后的文件夹结构如下【无论直接通过URL下载还是PyTorch的API解压后都应是如此！】：

VOCdevkit/VOC2007/
├─ Annotations/
│  ├─ 000001.xml
│  ├─ 000002.xml
│  └─ ...
├─ ImageSets
│  ├─ Layout/
│  │  ├─ train.txt
│  │  ├─ trainval.txt
│  │  └─ val.txt
│  ├─ Main/
│  │  ├─ aeroplane_train.txt
│  │  ├─ aeroplane_trainval.txt
│  │  ├─ aeroplane_val.txt
│  │  └─ ...
│  └─ Segmentation/
│     ├─ train.txt
│     ├─ trainval.txt
│     └─ val.txt
├─ JPEGImages
│  ├─ 000001.jpg
│  ├─ 000002.jpg
│  └─ ...
├─ SegmentationClass
│  ├─ 000001.png
│  ├─ 000002.png
│  └─ ...
└─ SegmentationObject
   ├─ 000001.png
   ├─ 000002.png
   └─ ...

目标检测使用torchvision.datasets.VOCDetectionAPI下载、解压和读取。它将加载JPEGImagesw文件夹下面的JPEG图片并范围PIL.Image对象，同时加载Annotations子文件夹下的XML文件并返回Python字典作为标签。

一个加载图片并显示目标检测数据的Python脚本如下：

#!/usr/bin/env python3
# -*- encoding utf-8 -*-

'''
@File: **
@Date: 2024-08-27
@Author: KRISNAT
@Version: 0.0.0
@Email: **
@Copyright: (C)Copyright 2024, KRISNAT
@Desc: None
'''
import cv2
import PIL.Image
import matplotlib.pyplot as plt
from torchvision.datasets import VOCDetection
import numpy as np

# Download or load VOC2007 for detection task from local file
voc2007_dec_trainval = VOCDetection(
    root='voc',
    year='2007',
    image_set='train',
    download=True,
)

# Define a function to draw bounding box onto the image
def draw_bndbox(image: PIL.Image, object = None) -> PIL.Image:
    """draw the bounding box of the PASCAL VOC image"""
    if object is None:
        return image
    img_cv2 = cv2.cvtColor(np.asanyarray(image), cv2.COLOR_RGB2BGR)  # convert PIL.Image RGB to cv2 BGR
    if isinstance(object, dict):  # only one object
        name = object["name"]
        pose = object["pose"]
        xmin = int(object["bndbox"]["xmin"])
        ymin = int(object["bndbox"]["ymin"])
        xmax = int(object["bndbox"]["xmax"])
        ymax = int(object["bndbox"]["ymax"])
        img_cv2 = cv2.rectangle(img_cv2, (xmin, ymin), 
            (xmax, ymax), (0, 0, 255))  # red retangle
        img_cv2 = cv2.putText(img_cv2, name + ", " + pose, (xmin, ymin), 
                cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 1, cv2.FILLED)
    elif isinstance(object, list):  # multiple objects
        for obj in object:  # here, object is a list
            name = obj["name"]
            pose = obj["pose"]
            xmin = int(obj["bndbox"]["xmin"])
            ymin = int(obj["bndbox"]["ymin"])
            xmax = int(obj["bndbox"]["xmax"])
            ymax = int(obj["bndbox"]["ymax"])
            img_cv2 = cv2.rectangle(img_cv2, (xmin, ymin), 
                (xmax, ymax), (0, 0, 255))  # red retangle
            img_cv2 = cv2.putText(img_cv2, name + ", " + pose, (xmin, ymin), 
                cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 1, cv2.FILLED)
    else:
        raise "object can only be dict or list."
    
    return PIL.Image.fromarray(cv2.cvtColor(img_cv2, cv2.COLOR_BGR2RGB))


if __name__ == "__main__":
    # show the firt four images and targets
    fig = plt.figure(figsize=(10, 9))
    fig.suptitle("The first four images and labels"
                " \nof VOC2007 train dataset for Dectction in PyTorch")
    for idx, (image, label) in enumerate(voc2007_dec_trainval):
        filename = label["annotation"]["filename"]
        img_size = (label["annotation"]["size"]["width"], 
                    label["annotation"]["size"]["height"], 
                    label["annotation"]["size"]["depth"])
        
        # Be carefull: Ff the were only one object, you would get a Python dict
        # object, otherwise, you would get a list dict list of objects
        object = label["annotation"]["object"]
        xlabel = filename + "\n" + str(img_size).replace("'", '')
        ax = fig.add_subplot(2, 2, idx + 1)
        image_show = draw_bndbox(image, label["annotation"]["object"])
        ax.imshow(image_show)
        ax.set_xlabel(xlabel)

        # disable the ticks and frame of the axes
        ax.set_frame_on(False)
        ax.set_xticks([])
        ax.set_yticks([])

        # only read the first four images and labels
        if idx >= 3:
            break
    
    # plt.show()
    plt.savefig("PASCAL VOC Detection.jpg", bbox_inches='tight')  # Save the result

执行结果如下：

请添加图片描述

物体分割数据集

目标检测使用torchvision.datasets.VOCSegmentationAPI下载、解压和读取。它将加载JPEGImagesw文件夹下面的JPEG图片并范围PIL.Image对象，同时加载SegmentationClass子文件夹下的PNG文件并返回PIL.PngImagePlugin.PngImageFile对象作为掩膜标签。注意：torchvision.datasets.VOCSegmentation默认没有使用SegmentationObject目录下的实例分割掩膜，而是使用的SegmentationClass目录下的语义分割掩膜。

语义分割数据中掩膜的像素值代表了不同类别：
- 0=background
- 1=aeroplane
- 2=bicycle
- 3=bird
- 4=boat
- 5=bottle
- 6=bus
- 7=car
- 8=cat
- 9=chair
- 10=cow
- 11=dining table
- 12=dog
- 13=horse
- 14=motorbike
- 15=person
- 16=pottled plant
- 17=sheep
- 18=sofa
- 19=train
- 20=tv/monitor
- 255=void or unlabelled

一个加载图片并显示类别分割数据的Python脚本如下：

#!/usr/bin/env python3
# -*- encoding utf-8 -*-

'''
@File: **
@Date: 2024-08-27
@Author: KRISNAT
@Version: 0.0.0
@Email: **
@Copyright: (C)Copyright 2024, KRISNAT
@Desc: None
'''
import PIL.Image
import matplotlib.pyplot as plt
from torchvision.datasets import VOCSegmentation
import numpy as np

# Download or load VOC2007 for detection task from local file
voc2007_seg_trainval = VOCSegmentation(
    root='voc',
    year='2007',
    image_set='train',
    download=True,
)

# Define a function to draw bounding box onto the image
def draw_mask(image: PIL.Image, target: PIL.Image = None) -> PIL.Image:
    """draw the mask for segmentation of the PASCAL VOC image"""
    # define a color map for mask object
    color_map ={
        0: (0, 0, 0, 128),  # background, black
        1: (247, 116, 95, 128),  # aeroplane
        2: (232, 129, 49, 128),  # bicycle
        3: (208, 142, 49, 128),  # bird
        4: (190, 150, 49, 128),  # boat
        5: (173, 156, 49, 128),  # bottle
        6: (173, 156, 49, 128),  # bus
        7: (155, 162, 49, 128),  # car
        8: (134, 167, 49, 128),  # cat
        9: (99, 174, 49, 128),  # chair
        10: (49, 178, 82, 128),  # cow
        11: (51, 176, 122, 128),  # dining table
        12: (52, 174, 142, 128),  # dog
        13: (53, 173, 157, 128),  # horse
        14: (54, 172, 170, 128),  # motorbike
        15: (54, 170, 182, 128),  # person
        16: (56, 168, 197, 128),  # pottled plant
        17: (57, 166, 216, 128),  # sheep
        18: (73, 160, 244, 128),  # sofa
        19: (135, 149, 244, 128),  # train
        20: (172, 136, 244, 128),  # tv/monitor
        255: (255, 255, 255, 128),  # void or unlabelled, white
    }
    mask = PIL.Image.new("RGBA", image.size, (0, 0, 0, 0))  # create a new mask image

    # iterate the pixel value of the target PNG image
    target_array = np.array(target).T  # Here is question, why when convert PIL.PngImageFile to numpy.arrary, the width and weight is reversed? 

    for x in range(target.width):
        for y in range(target.height):
            mpv = target_array[x, y]  # mask pixel value
            if target_array[x, y] != 0 or target_array[x, y] != 255:
                mask.putpixel((x, y), color_map[mpv])  # alpha=128 means transluscent

    # merge the mask and origin images
    image = image.convert("RGBA")
    merged_image = PIL.Image.alpha_composite(image, mask)
    
    return merged_image

# show the firt four images and targets
if __name__ == "__main__":
    fig = plt.figure(figsize=(8, 6))
    fig.suptitle("The first four images and masks"
                " \nof VOC2007 train dataset for Segmentation in PyTorch")
    for idx, (image, target) in enumerate(voc2007_seg_trainval):
        ax = fig.add_subplot(2, 2, idx + 1)
        image_show = draw_mask(image, target)
        ax.imshow(image_show)

        # disable the ticks and frame of the axes
        ax.set_frame_on(False)
        ax.set_xticks([])
        ax.set_yticks([])
        if idx >= 3:
            break

    # plt.show()
    plt.savefig("PASCAL VOC SegementationClass.jpg", bbox_inches='tight')   # Save the result

执行结果如下：

请添加图片描述

参考文献

Everingham M., Van Gool L., Williams C.K.I., et. al. The PASCAL Visual Object Class (VOC) Challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303-339.
Everingham M., Eslami S.M.A., Van Gool L., et. al. The PASCAL Visual Object Classes Challenge: A Retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98-136.
C.K.I., et. al. The PASCAL Visual Object Class (VOC) Challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303-339.
Everingham M., Eslami S.M.A., Van Gool L., et. al. The PASCAL Visual Object Classes Challenge: A Retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98-136.
The PASCAL Visual Object Classes Homepage[EB/OL]. [2024-08-25]. http://host.robots.ox.ac.uk/pascal/VOC/

收集整理和创作不易, 若有帮助🉑, 请帮忙点赞👍➕收藏❤️, 谢谢!✨✨🚀🚀