文章目录
一、目标检测基本概念
1.1 目标检测
目标检测是计算机视觉中的一个重要方向。当前计算机视觉领域通常都采用深度学习来达到目标。相较于基于深度学习的图像分类任务,目标检测任务更具难度。
任务 | 描述 |
---|---|
图像分类任务 | 对图像进行分类,确认图像的类别 |
目标检测任务 | 除了要识别出图像中目标的类别之外,还要求对目标进行精确定位,并用外接矩形框出。 |
下图为图像分类任务和目标检测任务区别的具体示意图,需要注意的是目标检测中图像中可能不知含有一个目标。
1.2 目标检测的思路
在深度学习中,对于图像分类任务通常采用卷积神经网络(CNN)来实现。但在应用与目标检测领域时,人们发现CNN无法用于预测坐标信息。故采用分而治之的方法,**先建立许多候选框,再对候选框进行分类和微调。**具体的说,就是先用滑窗的方法在整幅图像上进行滑动,从而获得局部区域图像,将其送入CNN网络进行分类结果的预测,之后再基于分类结果对边界进行微调,便对于一个局部区域获得(class,x1,y1,x2,y2)五个属性,将其汇总就可得到整幅图像中的目标类别和坐标。
图中展示了目标检测的一个过程,候选框在图片上逐像素地移动,对于各个候选框都预测分类结果,得分最高的则为预测最准确的框,其位置为最终要检测目标的位置。
RCNN,YOLO,SSD等众多经典网络模型都是沿着这一思路进行优化发展的。
1.3 目标框定义方法
有监督深度学习的训练数据需要包括两项:图像和真实标签(ground truth, GT)。
图像分类任务中,真实标签仅为图像类别。
二在目标检测任务中,真实标签除了需要含有类别信息之外,还需要包含位置信息,即目标的外接矩形bounding box。
bbox的表示有两种形式,之所以有两种形式,是因为它们分别能够简化在不同情况下的计算。
(x1, y1, x2, y2) | (c_x, c_y, w, h) |
---|---|
左上坐标和右下坐标 | 中心坐标和长宽 |
代码实现两种形式的相互转化。
def xy_to_cxcy(xy):
"""
(x_min, y_min, x_max, y_max) to center-size coordinates (c_x, c_y, w, h).
:param xy:
:return:cxcy
"""
return torch.cat([(xy[:,2:]+xy[:,:2])/2, # cx cy
(xy[:,2:]-xy[:,:2])], # w h
dim=1) # 维度代表0代表多个样本
def cxcy_to_xy(cxcy):
"""
(c_x, c_y, w, h) to (x_min, y_min, x_max, y_max)
:param cxcy:
:return:
"""
return torch.cat([cxcy[:,:2]-cxcy[:,2:]/2, # xmin ymin
cxcy[:,:2]+cxcy[:,2:]/2], # xmax ymax
dim=1)
1.4 交并比(IoU)
交并比是目标检测中的一个重要概念,用于衡量两个目标框之间的重叠程度。
IoU的全称是(Intersection over Union),表示为两个目标框的交集/并集。
计算流程如下图所示
1.首先获取两个框的坐标,红框坐标: 左上(red_x1, red_y1), 右下(red_x2, red_y2),绿框坐标: 左上(green_x1, green_y1),右下(green_x2, green_y2)
2.计算两个框左上点的坐标最大值:(max(red_x1, green_x1), max(red_y1, green_y1)), 和右下点坐标最小值:(min(red_x2, green_x2), min(red_y2, green_y2))
3.利用2算出的信息计算黄框面积:yellow_area
4.计算红绿框的面积:red_area 和 green_area
代码实现IoU的计算。下边包含两个函数find_intersection(计算交集面积)和find_jaccard_overlap(计算并集面积和交并比)
def find_intersection(set_1,set_2):
"""
(x_min, y_min, x_max, y_max)
找到n1*n2对IOU
:param set1:有n1个框
:param set2:有n2个框
:return:
"""
lower_bounds = torch.max(set_1[:,:2].unsqueeze(1),set_2[:,:2].unsqueeze(0)) # 找到两个框中左上最大的坐标
upper_bounds = torch.min(set_1[:,2:].unsqueeze(1),set_2[:,2:].unsqueeze(0)) # 找到两个框中右下最小的坐标
intersection = torch.clamp(upper_bounds-lower_bounds,min=0) # 计算交集长宽 clamp将数据限制在min到max
return intersection[:,0] * intersection[:,1]
def find_jaccard_overlap(set_1,set_2):
"""
:param set_1: a tensor of dimensions (n1,4)
:param set_2:
:return:
"""
intersection = find_intersection(set_1,set_2)
areas_set_1 = (set_1[:,2]-set_1[:,0]) * (set_1[:,3]-set_1[:,1]) # n1个
areas_set_2 = (set_2[:,2]-set_2[:,0]) * (set_2[:,3]-set_2[:,1]) # n2个
union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection # n1 * n2
return intersection / union
return iou
二、目标检测数据集VOC
2.1 VOC数据集简介
VOC数据集是目标检测领域最常用的标准数据集之一,本次实验采用VOC2007和VOC2012这两个最流行的版本作为训练和测试的数据。
VOC数据集在类别上可以分为4大类,20小类,其类别信息如图所示。
VOC数量集图像和目标数量的基本信息如下图所示:
在VOC官网下载VOC2007和VOC2012之后,将其解压至dataset文件夹下。
VOC数据库的构成包括以下几个部分。
文件夹 | 描述 |
---|---|
JPEGImages | 存放所有的图片,图片命名为id.jpeg |
Annotatins | 利用id.xml文件存放id.jpeg图像中目标类别 |
ImageSets | 包含Layout,Main,Segmentation三个子文件夹. Layout中存放train val test和trainval Main中存各类目标的train val test和trainval Segmentation中存放用于图像分割的train val test和trainval |
SegmentationClass和SegmentionObject | 用于图像分割,这里不过多赘述. |
2.2 VOC数据库数据处理
在这一部分,需要先按照ImageSets中的train和test对id进行划分,然后将JPEGImages的图片和Annotations的objects对应起来.
首先定义读取xml中objects的函数 parse_annotations.
import torch
import json
import os
import random
import xml.etree.ElementTree as ET
import torchvision.transforms.functional as FT
# GPU设置
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Label map
# voc_labels为VOC数据集中20类目标的类别名称
voc_labels = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable',
'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor')
# 创建label_map字典,用于存储类别和类别索引之间的映射关系。比如:{1:'aeroplane', 2:'bicycle',......}
label_map = {k: v + 1 for v, k in enumerate(voc_labels)}
# VOC数据集默认不含有20类目标中的其中一类的图片的类别为background,类别索引设置为0
label_map['background'] = 0
# 将映射关系倒过来,{类别名称:类别索引}
rev_label_map = {v: k for k, v in label_map.items()} # Inverse mapping
# 解析xml文件,最终返回这张图片中所有目标的标注框及其类别信息,以及这个目标是否是一个difficult目标
def parse_annotation(annotation_path):
# 解析xml
tree = ET.parse(annotation_path)
root = tree.getroot()
boxes = list() # 存储bbox
labels = list() # 存储bbox对应的label
difficulties = list() # 存储bbox对应的difficult信息
# 遍历xml文件中所有的object,前面说了,有多少个object就有多少个目标
for object in root.iter('object'):
# 提取每个object的difficult、label、bbox信息
difficult = int(object.find('difficult').text == '1')
label = object.find('name').text.lower().strip()
if label not in label_map:
continue
bbox = object.find('bndbox')
xmin = int(bbox.find('xmin').text) - 1
ymin = int(bbox.find('ymin').text) - 1
xmax = int(bbox.find('xmax').text) - 1
ymax = int(bbox.find('ymax').text) - 1
# 存储
boxes.append([xmin, ymin, xmax, ymax])
labels.append(label_map[label])
difficulties.append(difficult)
# 返回包含图片标注信息的字典
return {'boxes': boxes, 'labels': labels, 'difficulties': difficulties}
然后定义按照ImageSets中Main的train_val.txt和test.txt将图像路径和objects对应顺序保存在TRAIN_images.json,TRAIN_objects.json和TEST_images.json,TEST_objects.json四个文件中.
def create_data_lists(voc07_path, voc12_path, output_folder):
'''
train_image存储图片路径信息
train_objects 每张图片对应一个字典 包含boxes labels difficulties
label_map 为标签对应的编号
:param voc07_path:
:param voc12_path:
:param output_folder:
:return:
'''
voc07path = os.path.abspath(voc07_path)
voc12path = os.path.abspath(voc12_path)
train_images = list()
train_objects = list()
n_objects = 0
# training_data
for path in [voc07path,voc12path]:
with open(os.path.join(path,'ImageSets/Main/trainval.txt')) as f:
ids = f.read().splitlines() # 按行分割
for id in ids:
objects = parse_annotation(os.path.join(path,'Annotations',id+'.xml'))
# return {'boxes': boxes, 'labels': labels, 'difficulties': difficulties}
if len(objects['boxes'])==0:
continue
n_objects += len(objects)
train_objects.append(objects)
train_images.append(os.path.join(path,'JPEGImages',id+'.jpg'))
assert len(train_objects) == len(train_images)
with open(os.path.join(output_folder,'TRAIN_images.json'),'w') as j:
json.dump(train_images,j)
with open(os.path.join(output_folder, 'TRAIN_objects.json'), 'w') as j:
json.dump(train_objects, j)
with open(os.path.join(output_folder, 'label_map.json'), 'w') as j:
json.dump(label_map, j)
print('\nThere are %d training images containing a total of %d objects. Files have been saved to %s.' % (
len(train_images), n_objects, os.path.abspath(output_folder)))
#与Train data一样,目的是将测试数据的图片路径,标注信息,类别映射信息,分别保存为json文件,参考上面的注释理解
# Test data
test_images = list()
test_objects = list()
n_objects = 0
# Find IDs of images in the test data
with open(os.path.join(voc07_path, 'ImageSets/Main/test.txt')) as f:
ids = f.read().splitlines()
for id in ids:
# Parse annotation's XML file
objects = parse_annotation(os.path.join(voc07_path, 'Annotations', id + '.xml'))
if len(objects) == 0:
continue
test_objects.append(objects)
n_objects += len(objects)
test_images.append(os.path.join(voc07_path, 'JPEGImages', id + '.jpg'))
assert len(test_objects) == len(test_images)
# Save to file
with open(os.path.join(output_folder, 'TEST_images.json'), 'w') as j:
json.dump(test_images, j)
with open(os.path.join(output_folder, 'TEST_objects.json'), 'w') as j:
json.dump(test_objects, j)
print('\nThere are %d test images containing a total of %d objects. Files have been saved to %s.' % (
len(test_images), n_objects, os.path.abspath(output_folder)))
2.3 定义数据库PascalVOCDataset
PascalVOCDataset
继承了torch.utils.data.Dataset,然后重写了__init__
, __getitem__
,__len__
和 collate_fn
四个方法.
import torch
from torch.utils.data import Dataset
import json
import os
from PIL import Image
class PascalVOCDataset(Dataset):
def __init__(self, data_folder, split, keep_difficult=False):
self.split = split.upper()
assert self.split in {'TRAIN','TEST'}
self.data_folder = data_folder
self.keep_difficult = keep_difficult
with open(os.path.join(data_folder,self.split + '_images.json'),'r') as j:
self.images = json.load(j)
with open(os.path.join(data_folder,self.split + '_objects.json'),'r') as j:
self.objects = json.load(j)
assert len(self.images) == len(self.objects)
# i 代表item 索引号
def __getitem__(self, i):
image = Image.open(self.images[i],mode='r')
image = image.convert('RGB')
objects = self.objects[i]
boxes = objects['boxes']
labels = objects['labels']
difficulties = objects['difficulties']
# 若无需难样本 即keep_difficult为false 将难样本删除
if not self.keep_difficult :
boxes = boxes[1-difficulties] # difficulties 为0和1 去除难样本
labels = labels[1-difficulties]
image, boxes, labels, difficulties = transform(image, boxes, labels, difficulties,split=self.split)
return image, boxes, labels, difficulties
def __len__(self):
return len(self.images)
# 将孤立的一张张图片整合成四维形式 objects用列表存储
def collate_fn(self, batch):
images = list()
boxes = list()
labels = list()
difficulties = list()
for b in batch:
images.append(b[0])
boxes.append(b[1])
labels.append(b[2])
difficulties.append(b[3])
images = torch.stack(images,dim=0)
return images, boxes, labels, difficulties
2.4 数据增强
在构建dataset中有个很重要的一步我们上面只是提及了一下,那就是transform操作(数据增强)。
image, boxes, labels, difficulties = transform(image, boxes, labels, difficulties, split=self.split)
需要注意的是,涉及位置变化的数据增强方法,同样需要对目标框进行一致的处理,因此目标检测框架的数据处理这部分的代码量通常都不小,且比较容易出bug。这里为了降低代码的难度,我们只是使用了几种比较简单的数据增强。
def expand(image, boxes, filler):
"""
Perform a zooming out operation by placing the image in a larger canvas of filler material.
Helps to learn to detect smaller objects.
:param image: image, a tensor of dimensions (3, original_h, original_w)
:param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
:param filler: RBG values of the filler material, a list like [R, G, B]
:return: expanded image, updated bounding box coordinates
"""
# Calculate dimensions of proposed expanded (zoomed-out) image
original_h = image.size(1)
original_w = image.size(2)
max_scale = 4
scale = random.uniform(1, max_scale)
new_h = int(scale * original_h)
new_w = int(scale * original_w)
# Create such an image with the filler
filler = torch.FloatTensor(filler) # (3)
new_image = torch.ones((3, new_h, new_w), dtype=torch.float) * filler.unsqueeze(1).unsqueeze(1) # (3, new_h, new_w)
# Note - do not use expand() like new_image = filler.unsqueeze(1).unsqueeze(1).expand(3, new_h, new_w)
# because all expanded values will share the same memory, so changing one pixel will change all
# Place the original image at random coordinates in this new image (origin at top-left of image)
left = random.randint(0, new_w - original_w)
right = left + original_w
top = random.randint(0, new_h - original_h)
bottom = top + original_h
new_image[:, top:bottom, left:right] = image
# Adjust bounding boxes' coordinates accordingly
new_boxes = boxes + torch.FloatTensor([left, top, left, top]).unsqueeze(
0) # (n_objects, 4), n_objects is the no. of objects in this image
return new_image, new_boxes
def random_crop(image, boxes, labels, difficulties):
"""
Performs a random crop in the manner stated in the paper. Helps to learn to detect larger and partial objects.
Note that some objects may be cut out entirely.
Adapted from https://github.com/amdegroot/ssd.pytorch/blob/master/utils/augmentations.py
:param image: image, a tensor of dimensions (3, original_h, original_w)
:param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
:param labels: labels of objects, a tensor of dimensions (n_objects)
:param difficulties: difficulties of detection of these objects, a tensor of dimensions (n_objects)
:return: cropped image, updated bounding box coordinates, updated labels, updated difficulties
"""
original_h = image.size(1)
original_w = image.size(2)
# Keep choosing a minimum overlap until a successful crop is made
while True:
# Randomly draw the value for minimum overlap
min_overlap = random.choice([0., .1, .3, .5, .7, .9, None]) # 'None' refers to no cropping
# If not cropping
if min_overlap is None:
return image, boxes, labels, difficulties
# Try up to 50 times for this choice of minimum overlap
# This isn't mentioned in the paper, of course, but 50 is chosen in paper authors' original Caffe repo
max_trials = 50
for _ in range(max_trials):
# Crop dimensions must be in [0.3, 1] of original dimensions
# Note - it's [0.1, 1] in the paper, but actually [0.3, 1] in the authors' repo
min_scale = 0.3
scale_h = random.uniform(min_scale, 1)
scale_w = random.uniform(min_scale, 1)
new_h = int(scale_h * original_h)
new_w = int(scale_w * original_w)
# Aspect ratio has to be in [0.5, 2]
aspect_ratio = new_h / new_w
if not 0.5 < aspect_ratio < 2:
continue
# Crop coordinates (origin at top-left of image)
left = random.randint(0, original_w - new_w)
right = left + new_w
top = random.randint(0, original_h - new_h)
bottom = top + new_h
crop = torch.FloatTensor([left, top, right, bottom]) # (4)
# Calculate Jaccard overlap between the crop and the bounding boxes
overlap = find_jaccard_overlap(crop.unsqueeze(0),
boxes) # (1, n_objects), n_objects is the no. of objects in this image
overlap = overlap.squeeze(0) # (n_objects)
# If not a single bounding box has a Jaccard overlap of greater than the minimum, try again
if overlap.max().item() < min_overlap:
continue
# Crop image
new_image = image[:, top:bottom, left:right] # (3, new_h, new_w)
# Find centers of original bounding boxes
bb_centers = (boxes[:, :2] + boxes[:, 2:]) / 2. # (n_objects, 2)
# Find bounding boxes whose centers are in the crop
centers_in_crop = (bb_centers[:, 0] > left) * (bb_centers[:, 0] < right) * (bb_centers[:, 1] > top) * (
bb_centers[:, 1] < bottom) # (n_objects), a Torch uInt8/Byte tensor, can be used as a boolean index
# If not a single bounding box has its center in the crop, try again
if not centers_in_crop.any():
continue
# Discard bounding boxes that don't meet this criterion
new_boxes = boxes[centers_in_crop, :]
new_labels = labels[centers_in_crop]
new_difficulties = difficulties[centers_in_crop]
# Calculate bounding boxes' new coordinates in the crop
new_boxes[:, :2] = torch.max(new_boxes[:, :2], crop[:2]) # crop[:2] is [left, top]
new_boxes[:, :2] -= crop[:2]
new_boxes[:, 2:] = torch.min(new_boxes[:, 2:], crop[2:]) # crop[2:] is [right, bottom]
new_boxes[:, 2:] -= crop[:2]
return new_image, new_boxes, new_labels, new_difficulties
def flip(image, boxes):
"""
Flip image horizontally.
:param image: image, a PIL Image
:param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
:return: flipped image, updated bounding box coordinates
"""
# Flip image
new_image = FT.hflip(image)
# Flip boxes
new_boxes = boxes
new_boxes[:, 0] = image.width - boxes[:, 0] - 1
new_boxes[:, 2] = image.width - boxes[:, 2] - 1
new_boxes = new_boxes[:, [2, 1, 0, 3]]
return new_image, new_boxes
def resize(image, boxes, dims=(300, 300), return_percent_coords=True):
"""
Resize image.
For the SSD300, resize to (300, 300).
For our demo, resize to (224, 224).
Since percent/fractional coordinates are calculated for the bounding boxes (w.r.t image dimensions) in this process,
you may choose to retain them.
:param image: image, a PIL Image
:param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
:return: resized image, updated bounding box coordinates (or fractional coordinates, in which case they remain the same)
"""
# Resize image
new_image = FT.resize(image, dims)
# Resize bounding boxes
old_dims = torch.FloatTensor([image.width, image.height, image.width, image.height]).unsqueeze(0)
new_boxes = boxes / old_dims # percent coordinates
if not return_percent_coords:
new_dims = torch.FloatTensor([dims[1], dims[0], dims[1], dims[0]]).unsqueeze(0)
new_boxes = new_boxes * new_dims
return new_image, new_boxes
def photometric_distort(image):
"""
Distort brightness, contrast, saturation, and hue, each with a 50% chance, in random order.
:param image: image, a PIL Image
:return: distorted image
"""
new_image = image
distortions = [FT.adjust_brightness,
FT.adjust_contrast,
FT.adjust_saturation,
FT.adjust_hue]
random.shuffle(distortions)
for d in distortions:
if random.random() < 0.5:
if d.__name__ is 'adjust_hue':
# Caffe repo uses a 'hue_delta' of 18 - we divide by 255 because PyTorch needs a normalized value
adjust_factor = random.uniform(-18 / 255., 18 / 255.)
else:
# Caffe repo uses 'lower' and 'upper' values of 0.5 and 1.5 for brightness, contrast, and saturation
adjust_factor = random.uniform(0.5, 1.5)
# Apply this distortion
new_image = d(new_image, adjust_factor)
return new_image
def transform(image, boxes, labels, difficulties, split):
"""
Apply the transformations above.
:param image: image, a PIL Image
:param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
:param labels: labels of objects, a tensor of dimensions (n_objects)
:param difficulties: difficulties of detection of these objects, a tensor of dimensions (n_objects)
:param split: one of 'TRAIN' or 'TEST', since different sets of transformations are applied
:return: transformed image, transformed bounding box coordinates, transformed labels, transformed difficulties
"""
assert split in {'TRAIN', 'TEST'}
# Mean and standard deviation of ImageNet data that our base VGG from torchvision was trained on
# see: https://pytorch.org/docs/stable/torchvision/models.html
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
new_image = image
new_boxes = boxes
new_labels = labels
new_difficulties = difficulties
# Skip the following operations for evaluation/testing
if split == 'TRAIN':
# A series of photometric distortions in random order, each with 50% chance of occurrence, as in Caffe repo
new_image = photometric_distort(new_image)
# Convert PIL image to Torch tensor
new_image = FT.to_tensor(new_image)
# Expand image (zoom out) with a 50% chance - helpful for training detection of small objects
# Fill surrounding space with the mean of ImageNet data that our base VGG was trained on
if random.random() < 0.5:
new_image, new_boxes = expand(new_image, boxes, filler=mean)
# Randomly crop image (zoom in)
new_image, new_boxes, new_labels, new_difficulties = random_crop(new_image, new_boxes, new_labels,
new_difficulties)
# Convert Torch tensor to PIL image
new_image = FT.to_pil_image(new_image)
# Flip image with a 50% chance
if random.random() < 0.5:
new_image, new_boxes = flip(new_image, new_boxes)
# Resize image to (224, 224) - this also converts absolute boundary coordinates to their fractional form
new_image, new_boxes = resize(new_image, new_boxes, dims=(224, 224))
# Convert PIL image to Torch tensor
new_image = FT.to_tensor(new_image)
# Normalize by mean and standard deviation of ImageNet data that our base VGG was trained on
new_image = FT.normalize(new_image, mean=mean, std=std)
return new_image, new_boxes, new_labels, new_difficulties
2.5 实现Dataloader
通过上边四个步骤将VOC数据转化为Dataset,接下来用Pytorch的Dataloader来读入数据.
import torch
from datasets import PascalVOCDataset
train_dataset = PascalVOCDataset(data_folder,
split='train',
keep_difficult=keep_difficult)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
collate_fn=train_dataset.collate_fn, num_workers=workers,
pin_memory=True)