VOC数据格式与YOLO数据格式

最新推荐文章于 2024-08-05 11:19:50 发布

Peter`Young

最新推荐文章于 2024-08-05 11:19:50 发布

阅读量702

点赞数

分类专栏： YOLO系列网络解读文章标签：深度学习计算机视觉人工智能

本文链接：https://blog.csdn.net/dzkdyhr1208/article/details/128790425

版权

YOLO系列网络解读专栏收录该内容

1 篇文章 0 订阅

订阅专栏

因为早期的一些YOLO版本会考虑使用VOC数据，但是VOC数据格式同YOLO网络输入的数据格式是不一致的，但是又没有像COCO数据那样的预处理库，因此就涉及到需要将VOC格式转成YOLO格式。首先先看一下VOC数据格式，保存在Annotations文件夹下，以xml文件形式保存。下面的部分展示了某一张图像的注释，其中需要关注的内容包括filename、size、name、difficult以及bndbox中的内容。其中name表示目标所属类别，bndbox表示图像的长宽，difficult主要表示目标是否较难发现，具体可以参考下面的一段英文。

difficult:an object marked as difficult’ indicates that the object is considered difficult to recognize, for example an object which is clearly visible but unidentifiable without substantial use of context. Objects marked as difficult are currently ignored in the evaluation of the challenge.

<filename>2007_000027.jpg</filename>
	<source>
		<database>The VOC2007 Database</database>
		<annotation>PASCAL VOC2007</annotation>
		<image>flickr</image>
	</source>
	<size>
		<width>486</width>
		<height>500</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>person</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>174</xmin>
			<ymin>101</ymin>
			<xmax>349</xmax>
			<ymax>351</ymax>
		</bndbox>

那YOLO所需要的数据是什么样的呢，就像下面这样，四个0-1之间的数值以及最后一个类别序号，其中可以发现person处在第14个（从0开始计数），类别序号转换比较容易，因此主要工作出现在了前面关于真值框位置信息的转换上。

names = [“aeroplane”,“bicycle”,“bird”,“boat”,“bottle”,“bus”,“car”,“cat”,“chair”,“cow”,“diningtable”,“dog”,“horse”,“motorbike” “person”,“pottedplant”,“sheep”,“sofa”,“train”,“tvmonitor”]

0.536008 0.450000 0.360082 0.500000 14

YOLO网络由于网格的原因，要求所有长度根据图像的大小归一化到0-1之间的一个数值，而要求出现的四个数字中前两个表示该真值框中心点的位置坐标，后两个分别表示框的宽度和高度。因此我们根据上面的xml中的内容可以来计算。首先是框中心点的位置可以表示为 $(\frac{x_{\max}+x_{\min}}{2*width},\frac{y_{\max}+y_{\min}}{2*height})$ ，特别的，因为考虑到从0计数，每个位置上的值还需要减一，因此就有(174-1+349-1)/(2*486 )= 0.536，(351-1+101-1)/(2*500)=0.45，刚好是YOLO的结果，后两个值也基本一直，可以通过(xmax-xmin)/width和(ymax-ymin)/height获得，即计算(349-174) / 486 = 0.360082和(351-101) / 500 = 0.5。

在这里插入图片描述

具体实现过程参考下述代码，当然值得注意的是，这种YOLO格式处理方式并不唯一，有些网络依然使用左上角坐标及右下角坐标的原始表示方式，但所有方法都需要将数值归一化到0-1之间。

def voc2yolo(xml_file: str) -> None:
    """Convert VOC to YOLO
    Args:
        xml_file: str
    """
    with open(f"{config.XML_DIR}/{xml_file}") as in_file:
        tree = ElementTree.parse(in_file)
        size = tree.getroot().find("size")
        height, width = map(int, [size.find("height").text, size.find("width").text])

    class_exists = False
    for obj in tree.findall("object"):
        name = obj.find("name").text
        if name in config.names:
            class_exists = True

    if class_exists:
        with open(f"{config.LABEL_DIR}/{xml_file[:-4]}.txt", "w") as out_file:
            for obj in tree.findall("object"):
                difficult = obj.find("difficult").text
                if int(difficult) == 1:
                    continue
                xml_box = obj.find("bndbox")

                x_min = float(xml_box.find("xmin").text)
                y_min = float(xml_box.find("ymin").text)

                x_max = float(xml_box.find("xmax").text)
                y_max = float(xml_box.find("ymax").text)

                # according to darknet annotation
                box_x_center = (x_min + x_max) / 2.0 - 1
                box_y_center = (y_min + y_max) / 2.0 - 1

                box_w = x_max - x_min
                box_h = y_max - y_min

                box_x = box_x_center * 1.0 / width
                box_w = box_w * 1.0 / width

                box_y = box_y_center * 1.0 / height
                box_h = box_h * 1.0 / height

                b = [box_x, box_y, box_w, box_h]

                cls_id = config.names.index(obj.find("name").text)
                out_file.write(" ".join([str(f"{i:.6f}") for i in b]) +" " + str(cls_id) + "\n")