Visual Object Classes Challenge 2012 (VOC2012)
Introduction
The main goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:
- Person: person
- Animal: bird, cat, cow, dog, horse, sheep
- Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
- Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
20 classes. The train/val data has 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations.
There are three main object recognition competitions: classification, detection, and segmentation, a competition on action classification, and a competition on large scale recognition run by ImageNet. In addition there is a “taster” competition on person layout.
Classification/Detection Competitions
- Classification: For each of the twenty classes, predicting presence/absence of an example of that class in the test image.
- Detection: Predicting the bounding box and label of each object from the twenty target classes in the test image.
Segmentation Competition
- Segmentation: Generating pixel-wise segmentations giving the class of the object visible at each pixel, or “background” otherwise.
Action Classification Competition
-
Action Classification: Predicting the action(s) being performed by a person in a still image.
-
VOC2012
- Annotations
- 2008_003420.xml
- ImageSets
- Action
- Layout
- Main
- Segmentation
- JPEGImages
- SegmentationClass
- SegmentationObject
Annotations:中主要存放xml文件,每一个xml对应一张图像,并且每个xml中存放的是标记的各个目标的位置和类别信息,命名通常与对应的原始图像一样
JPEGImages:自己的原始图像放在JPEGImages文件夹
ImageSets:- Action 预测静态图像中人做出的动作(running、jumping等等)
- Layout 即人体轮廓布局。该任务的目标是预测人体部位(head、hand、feet等等)的bounding box和对应的label
- Main 存放的是目标识别的数据,总共分为20类,主要有xxx_test.txt , xxx_train.txt, xxx_val.txt,xxx_trainval.txt四个文件,前面的表示图像的name,后面的1代表正样本,-1代表负样本。如
tail -n5 person_train.txt 2011_003253 -1 //2011_003253.jpg 图片中没有person 2011_003255 1 //2011_003255.jpg 图片中有person 2011_003259 1 2011_003274 -1 2011_003276 -1
Segmentation 存放分割的数据,train.txt中存放的是训练集的图片编号,val.txt中存放的是验证集的图片编号,trainval是上面两者的合并集合.
VOC2012/ImageSets/Main/train.txt 保存了所有训练集的文件名,从 VOC2012/JPEGImages/ 找到文件名对应的图片文件
VOC2012/Annotations/ 找到文件名对应的标签文件 - Annotations
Annotations
Annotations文件夹中存放的是xml格式的标签文件,每一个xml文件都对应于JPEGImages文件夹中的一张图片。
xml的文件格式如下所示:
<annotation>
<filename>2012_000056.jpg</filename> // 文件名
<folder>VOC2012</folder>
<object> // 检测到到物体信息
<name>person</name> // 物体类别
<actions> // 做什么
<jumping>0</jumping>
<other>0</other>
<phoning>1</phoning>
<playinginstrument>0</playinginstrument>
<reading>0</reading>
<ridingbike>0</ridingbike>
<ridinghorse>0</ridinghorse>
<running>0</running>
<takingphoto>0</takingphoto>
<usingcomputer>0</usingcomputer>
<walking>0</walking>
</actions>
<bndbox> // bbox info,[left,top,right,bottom]
<xmax>63</xmax>
<xmin>1</xmin>
<ymax>375</ymax>
<ymin>84</ymin>
</bndbox>
<difficult>0</difficult> // 目标是否难以识别(0表示容易识别)
<pose>Unspecified</pose> // 物体的姿态
<point> // if the object has a reference point annotated
<x>26</x>
<y>183</y>
</point>
</object>
<segmented>0</segmented> // 是否用于分割
<size> // 图像大小whc
<depth>3</depth>
<height>375</height>
<width>500</width>
</size>
<source> // 图片来源
<annotation>PASCAL VOC2012</annotation>
<database>The VOC2012 Database</database>
<image>flickr</image>
</source>
</annotation>
ImageSets
ImageSets存放的是每一种类型的challenge对应的图像数据。在ImageSets下有四个文件夹:
- Action下存放的是人的动作(例如running、jumping等等,这也是VOC challenge的一部分)
- Layout下存放的是具有人体部位的数据(人的head、hand、feet等等,这也是VOC challenge的一部分)
- Main下存放的是图像物体识别的数据,总共分为20类。
- Segmentation下存放的是可用于分割的数据。
Main文件夹下包含了20个分类的***_train.txt、***_val.txt和***_trainval.txt。这些txt中的内容都差不多。前面的表示图像的name,后面的1代表正样本,-1代表负样本。_train中存放的是训练使用的数据,每一个class的train数据都有5717个。_val中存放的是验证结果使用的数据,每一个class的val数据都有5823个。_trainval将上面两个进行了合并,每一个class有11540个。需要保证的是train和val两者没有交集,也就是训练数据和验证数据不能有重复,在选取训练数据的时候 ,也应该是随机产生的。
JPEGImages
JPEGImages文件夹中包含了PASCAL VOC所提供的所有的图片,包含训练图片和测试图片,共有17125张。这些图像都是以“年份_编号.jpg”格式命名的。图片的像素尺寸大小不一,但是横向图的尺寸大约在500375左右,纵向图的尺寸大约在375500左右,基本不会偏差超过100。在之后的训练中,第一步就是将这些图片都resize到300300或是500500,所有原始图片不能离这个标准过远。这些图像就是用来进行训练和测试验证的图像数据。
SegmentationClass
含了2913张图片,每一张图片都对应JPEGImages里面的相应编号的图片,图片的像素颜色共有20种,对应20类物体。
SegmentationObject
包含了2913张图片,图片编号都与Class里面的图片编号相同。这里面的图片和Class里面图片的区别在于,这是针对Object的。在Class里面,一张图片里如果有多架飞机,那么会全部标注为红色。而在Object里面,同一张图片里面的飞机会被不同颜色标注出来。
制作 VOC2012数据集
制作 VOC数据集主要包括以下几步:
- 数据准备
- 标定图片:生成label文件,文件内容为类别及boundingbox信息
- 生成符合VOC格式要求的文件 主要是Annotations/.xml ImageSets/main/.txt
Step1 make voc2012 directory
-- VOC2012
|-- Annotations
|-- ImageSets
| |-- Action
| |-- Layout
| |-- Main
| `-- Segmentation
|-- JPEGImages
|-- SegmentationClass
`-- SegmentationObject
Step2 生成Annotations目录下的XML文件
choose Step2.1 or Step2.2 or both to run
Step2.1 生成相应的Annotations目录下的XML文件
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os, sys
import cv2
import numpy
from lxml.etree import Element, SubElement, tostring
from xml.dom.minidom import parseString
def save_Annotations_xml(root_dir, label_list, Annotations_dir, JPEGImages_dir):
"""
"""
filename = os.path.join(root_dir, label_list)
print('-------------label list filename:', filename)
# note i is the star index, if image in JPEGImages include
# train val or test images, change i
i = 1
with open(filename, 'r') as f:
lines = f.readlines()
print('save_Annotations_xml read lines:', len(lines))
for line in lines:
line_info = line.rstrip('\n').split(' ')
imgname = os.path.join(root_dir, line_info[0])
img = cv2.imread(imgname)
height, width, channel = img.shape
image_name = '%09d' % i + '.jpg'
# save JPEGImages
new_image_name = JPEGImages_dir + '/%09d' % i + '.jpg'
# construct annotation
node_root = Element('annotation')
node_folder = SubElement(node_root, 'folder')
node_folder.text = 'JPEGImages'
# image name
node_filename = SubElement(node_root, 'filename')
node_filename.text = image_name
# image width height channel
node_size = SubElement(node_root, 'size')
node_depth = SubElement(node_size, 'depth')
node_depth.text = '%s' % channel
node_height = SubElement(node_size, 'height')
node_height.text = '%s' % height
node_width = SubElement(node_size, 'width')
node_width.text = '%s' % width
write_infile = False
# bbounding box info
line_info = [int(b) for b in line_info[1:]]
array=numpy.array(line_info[:-1])
bboxs = array.reshape(-1, 4)
for bbox in bboxs:
x, y, w, h = [int(b) for b in bbox]
# add data filter
if w < 12 or h < 32:
continue
write_infile=True
left, top, right, bottom = x, y, x + w, y + h
node_object = SubElement(node_root, 'object')
node_name = SubElement(node_object, 'name')
node_name.text = 'person'
node_difficult = SubElement(node_object, 'difficult')
node_difficult.text = '0'
node_bndbox = SubElement(node_object, 'bndbox')
node_xmin = SubElement(node_bndbox, 'xmin')
node_xmin.text = '%s' % left
node_ymin = SubElement(node_bndbox, 'ymin')
node_ymin.text = '%s' % top
node_xmax = SubElement(node_bndbox, 'xmax')
node_xmax.text = '%s' % right
node_ymax = SubElement(node_bndbox, 'ymax')
node_ymax.text = '%s' % bottom
if write_infile:
# to string
xml = tostring(node_root, pretty_print=True)
dom = parseString(xml)
# save_xml
save_xml = os.path.join(Annotations_dir, image_name.replace('jpg', 'xml'))
with open(save_xml, 'wb') as f:
f.write(xml)
cv2.imwrite(new_image_name, img)
i = i + 1
print('******************* make Annotations xml Done *******************', i)
if __name__ == '__main__':
# dataset to convert
# test_label is format as filename tx0 ty0 bx0 by0 tx1 ty1 bx1 by1 ... classes
root_dir = '/opt/notebook_files/datasets/PedestrianDataset/Caltech'
test_label='caltech_train_label.txt'
# Voc dataset
Annotations_dir='VOC2012/Annotations'
JPEGImages_dir='VOC2012/JPEGImages'
save_Annotations_xml(root_dir, test_label, Annotations_dir, JPEGImages_dir)
Step2.2 在Annotations目录增量式添加XML文件
此接口是向已存在的VOC数据集目录下添加新的数据集
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os, sys
import cv2
import numpy
from lxml.etree import Element, SubElement, tostring
from xml.dom.minidom import parseString
def add_Annotations_xml(root_dir, label_list, Annotations_dir, JPEGImages_dir):
"""
"""
filename = os.path.join(root_dir, label_list)
print('-------------label list filename:', filename)
start_idx = 1
Annotations_xmls = os.listdir(Annotations_dir)
print('Annotations_xmls:',len(Annotations_xmls))
start_idx = len(Annotations_xmls)
with open(filename, 'r') as f:
lines = f.readlines()
print('save_Annotations_xml read lines:', len(lines))
for line in lines:
line_info = line.rstrip('\n').split(' ')
imgname = os.path.join(root_dir, line_info[0])
img = cv2.imread(imgname)
height, width, channel = img.shape
image_name = '%09d' % start_idx + '.jpg'
# save JPEGImages
new_image_name = JPEGImages_dir + '/%09d' % start_idx + '.jpg'
# construct annotation
node_root = Element('annotation')
node_folder = SubElement(node_root, 'folder')
node_folder.text = 'JPEGImages'
# image name
node_filename = SubElement(node_root, 'filename')
node_filename.text = image_name
# image width height channel
node_size = SubElement(node_root, 'size')
node_depth = SubElement(node_size, 'depth')
node_depth.text = '%s' % channel
node_height = SubElement(node_size, 'height')
node_height.text = '%s' % height
node_width = SubElement(node_size, 'width')
node_width.text = '%s' % width
write_infile = False
# bbounding box info
line_info = [int(b) for b in line_info[1:]]
array=numpy.array(line_info[:-1])
bboxs = array.reshape(-1, 4)
for bbox in bboxs:
x, y, w, h = [int(b) for b in bbox]
# add data filter
if w < 12 or h < 32:
continue
write_infile=True
left, top, right, bottom = x, y, x + w, y + h
node_object = SubElement(node_root, 'object')
node_name = SubElement(node_object, 'name')
node_name.text = 'person'
node_difficult = SubElement(node_object, 'difficult')
node_difficult.text = '0'
node_bndbox = SubElement(node_object, 'bndbox')
node_xmin = SubElement(node_bndbox, 'xmin')
node_xmin.text = '%s' % left
node_ymin = SubElement(node_bndbox, 'ymin')
node_ymin.text = '%s' % top
node_xmax = SubElement(node_bndbox, 'xmax')
node_xmax.text = '%s' % right
node_ymax = SubElement(node_bndbox, 'ymax')
node_ymax.text = '%s' % bottom
if write_infile:
# to string
xml = tostring(node_root, pretty_print=True)
dom = parseString(xml)
# save_xml
save_xml = os.path.join(Annotations_dir, image_name.replace('jpg', 'xml'))
with open(save_xml, 'wb') as f:
f.write(xml)
cv2.imwrite(new_image_name, img)
start_idx += 1
print('*******************Have Annotations xmls:{} *******************'.format(start_idx))
if __name__ == '__main__':
# dataset to convert
# test_label is format as filename tx0 ty0 bx0 by0 tx1 ty1 bx1 by1 ... classes
root_dir = '/opt/notebook_files/datasets/PedestrianDataset/Caltech'
test_label='caltech_train_label.txt'
# Voc dataset
Annotations_dir='VOC2012/Annotations'
JPEGImages_dir='VOC2012/JPEGImages'
add_Annotations_xml(root_dir, test_label, Annotations_dir, JPEGImages_dir)
Step3 Convert Test
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os, sys
import cv2
import matplotlib.pyplot as plt
import xml.etree.ElementTree as ET
import helper
import numpy as np
%matplotlib inline
VOC_CLASSES = ["__background__", "person"]
classes = VOC_CLASSES
class_to_ind = dict(zip(VOC_CLASSES, range(len(VOC_CLASSES))))
def AnnotationTransform(xml_path):
"""
get xml info
"""
xml_file = open(xml_path, 'r')
# xml
tree=ET.parse(xml_file)
# targets
root = tree.getroot()
# Transforms a VOC annotation into a Tensor of bbox coords and label index
res = np.empty((0, 5))
for obj in root.iter('object'):
difficult = obj.find('difficult').text
name = obj.find('name').text.lower().strip()
if name not in classes or int(difficult) == 1:
continue
bbox = obj.find('bndbox')
pts = ['xmin', 'ymin', 'xmax', 'ymax']
bndbox = []
for i, pt in enumerate(pts):
# print('bbox.find(pt).text:', bbox.find(pt).text)
cur_pt = int(bbox.find(pt).text) - 1
bndbox.append(cur_pt)
label_idx = class_to_ind[name]
bndbox.append(label_idx)
res = np.vstack((res, bndbox))
return res
JPEGImages_dir = 'VOC2012/JPEGImages'
Annotations_dir = 'VOC2012/Annotations'
result =[]
JPEGImages = os.listdir(JPEGImages_dir)
for img in JPEGImages[:10]:
imgname = os.path.join(JPEGImages_dir, img)
Annotation = os.path.join(Annotations_dir, img.replace('jpg', 'xml'))
img = cv2.imread(imgname)
bboxs = AnnotationTransform(Annotation)
for bbox in bboxs:
tx, ty, bx, by, label = [int(b) for b in bbox]
cv2.rectangle(img, (tx, ty), (bx, by), (0, 0, 255), 1)
result.append([img, label])
win_idx = 1
for k in range(len(result)):
try:
plt.ion()
plt.figure(win_idx)
plt.title(result[k][1])
plt.imshow(result[k][0][:,:,::-1])
except:
pass
finally:
win_idx+=1
Step4 生成Main目录下的txt文件
即生成测试、验证数据集合等等,然后存储成txt文件
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os, sys
import random
def make_ImageSets_label(ImageSets_Main_dir, Annotations_dir, label_list):
"""
"""
trainval_percent = 0.1
train_percent = 0.9
total_xml = os.listdir(Annotations_dir)
num = len(total_xml)
print('image nums:', num)
total_xml_list = range(num)
label_list_path = os.path.join(ImageSets_Main_dir, label_list)
print('label_list_path:', label_list_path)
with open(label_list_path, 'w') as f:
for i in total_xml_list:
name = total_xml[i][:-4] + '\n'
f.write(name)
print('******************* make ImageSets Main Done *******************')
if __name__ == '__main__':
ImageSets_Main_dir='VOC2012/ImageSets/Main'
Annotations_dir='VOC2012/Annotations'
label_list ='trainval.txt'
make_ImageSets_label(ImageSets_Main_dir, Annotations_dir, label_list)