【python3】批量删除voc数据集xml文件里的某些节点，得到单独某一类别的数据集（代码清晰，易操作！）

本文链接：https://blog.csdn.net/qq_43348528/article/details/107336676

举例说明，比如说是VOC 2007 train+val，只留下人和车类，其他类别去除掉：

下面代码去除掉xml文件中，不需要的类别的节点：
注意：使用代码时路径均使用绝对路径；

import xml.etree.cElementTree as ET
import os

# VOC 2007 train+val
path_root = "/*/VOCdevkit/VOC2007/Annotations/"
 
CLASSES = ["person","car"]
xml_list = os.listdir(path_root)
count = 0

for axml in xml_list:
    path_xml = os.path.join(path_root, axml)
    tree = ET.parse(path_xml)
    root = tree.getroot()
 
    for child in root.findall('object'):
        name = child.find('name').text
        if not name in CLASSES:
            root.remove(child)
 
    tree.write(os.path.join("/*/VOCdevkit/VOC2007/Annotations1/", axml))
    print(axml)
    count = count + 1
    
print(count)

下面代码用于删掉刚才去除其他节点后不包含人和车类的xml文件

如果使用的是tensorflow object detection API来进行voc数据集到tfrecord的转换需要记得更新/*/trainval/VOCdevkit/VOC2007/ImageSets/Main/aeroplane_trainval.txt文件为新的xml文件对应的图片情况；

# 1.delete the xml that no person or car object 
# 2.update /*/trainval/VOCdevkit/VOC2007/ImageSets/Main/aeroplane_trainval.txt
# for transer to tfrecord
path_root = "/*/VOCdevkit/VOC2007/Annotations1/"
 

xml_list = os.listdir(path_root)
count = 0

for axml in xml_list:
    path_xml = os.path.join(path_root, axml)
    tree = ET.parse(path_xml)
    root = tree.getroot()
    size = len(root.findall('object'))
    #print(size)
    if size == 0:
        print(axml + " " +'\\')
    else:
        count = count + 1
        print(axml[0:6])
    
print(count)