数据加载与处理
Author: Sasank Chilamkurthy <https://chsasank.github.io>
_
A lot of effort in solving any machine learning problem goes in to
preparing the data. PyTorch provides many tools to make data loading
easy and hopefully, to make your code more readable. In this tutorial,
we will see how to load and preprocess/augment data from a non trivial
dataset.
To run this tutorial, please make sure the following packages are
installed:
scikit-image
: For image io and transformspandas
: For easier csv parsing
from __future__ import print_function, division
import os
import torch
import pandas as pd
from skimage import io, transform
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
plt.ion() # interactive mode
The dataset we are going to deal with is that of facial pose.
This means that a face is annotated like this:
… figure:: /_static/img/landmarked_face2.png
:width: 400
Over all, 68 different landmark points are annotated for each face.
Note
Download the dataset from `here
Dataset comes with a csv file with annotations which looks like this:
::
image_name,part_0_x,part_0_y,part_1_x,part_1_y,part_2_x, ... ,part_67_x,part_67_y
0805personali01.jpg,27,83,27,98, ... 84,134
1084239450_e76e00b7e7.jpg,70,236,71,257, ... ,128,312
从CSV文件中读入标定数据,并将其存储大小为(N,2)的数组中,其中N表示landmarks的数目。
landmarks_frame = pd.read_csv('F:/工作学习/编程与操作系统/Pytorch/datasets/faces/faces/face_landmarks.csv')
n = 65
img_name = landmarks_frame.iloc[n, 0]# 第0列为图片名称
landmarks = landmarks_frame.iloc[n, 1:].as_matrix()# 将标定数据转换为矩阵的形式
landmarks = landmarks.astype('float').reshape(-1, 2)
print('Image name: {}'.format(img_name))
print('Landmarks shape: {}'.format(landmarks.shape))
print('First 4 Landmarks: {}'.format(landmarks[:4]))
Image name: person-7.jpg
Landmarks shape: (68, 2)
First 4 Landmarks: [[32. 65.]
[33. 76.]
[34. 86.]
[34. 97.]]
Let’s write a simple helper function to show an image and its landmarks
and use it to show a sample.
def show_landmarks(image, landmarks):
"""Show image with landmarks"""
plt.imshow(image)
plt.scatter(landmarks[:, 0], landmarks[:, 1], s=10, marker='o', c='g')# 绘制散点
plt.pause(0.001) # pause a bit so that plots are updated
plt.figure()
image = io.imread(os.path.join('F:/工作学习/编程与操作系统/Pytorch/datasets/faces/faces/', img_name))
show_landmarks(image, landmarks)
Dataset class
torch.utils.data.Dataset
是一个用来对数据集进行抽象表示的类。自己定义的数据集应该对Dataset
进行继承,并对以下两个方法进行重载。
__len__
:len(dataset)
返回数据集的size。__getitem__
:用于支持下标索引操作,例如:可以使用dataset[i]
得到第 i i i个样本。
下面将针对人脸标定数据集创建一个类。在该类中, __init__
将完成csv文件的读取操作,
__getitem__
将完成样本图片的读取操作。这样做将获得较高的内存使用效率,因为这种做法不会一次性将
所有图片都读入内存中,只会读取所需要的图片。
在类中,样本以字典的方式存储,即:{'image': image, 'landmarks': landmarks}
。数据集有一个可选
参数transform
,该参数用于决定对样本进行的操作。transform
的使用将在下一节介绍。
class FaceLandmarksDataset(Dataset):
"""Face Landmarks dataset."""
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.landmarks_frame = pd.read_csv(csv_file)# 读取csv文件
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.landmarks_frame)
def __getitem__(self, idx):
img_name = os.path.join(self.root_dir,
self.landmarks_frame.iloc[idx, 0])
image = io.imread(img_name)# 读取图片
# 读取当前图片对应的标定
landmarks = self.landmarks_frame.iloc[idx, 1:].as_matrix()
landmarks = landmarks.astype('float').reshape(-1, 2)
sample = {'image': image, 'landmarks': landmarks}
# 依据transform决定是否执行相应操作
if self.transform:
sample = self.transform(sample)
return sample
接下来实例化一个该类的对象,并在数据样本上进行迭代,打印出前四个样本以及其标定。
face_dataset = FaceLandmarksDataset(csv_file='F:/工作学习/编程与操作系统/Pytorch/datasets/faces/faces/face_landmarks.csv',
root_dir='F:/工作学习/编程与操作系统/Pytorch/datasets/faces/faces/')
fig = plt.figure()
for i in range(len(face_dataset)):
sample = face_dataset[i]# 索引操作
print(i, sample['image'].shape, sample['landmarks'].shape)
ax = plt.subplot(1, 4, i + 1)
plt.tight_layout()
ax.set_title('Sample #{}'.format(i))
ax.axis('off')
show_landmarks(**sample)
if i == 3:
plt.show()
break
0 (324, 215, 3) (68, 2)
1 (500, 333, 3) (68, 2)
2 (250, 258, 3) (68, 2)
3 (434, 290, 3) (68, 2)
转换
One issue we can see from the above is that the samples are not of the
same size. Most neural networks expect the images of a fixed size.
Therefore, we will need to write some prepocessing code.
Let’s create three transforms:
Rescale
: 修改图片的大小RandomCrop
: 对图片进行随机裁剪(数据增强方法的一种)。ToTensor
: 将numpy图片转换为torch图片(需要对坐标进行交换)。
我们将使用类实现以上功能,而不是用分离的函数。当类被调用时,需要传入转换的参数。
为了实现这一功能,我们只需要重写 __call__
方法,如果需要,再重写__init__
方法。
这样,我们就可以使用如下调用方式。
tsfm = Transform(params)
transformed_sample = tsfm(sample)
转换需要再图片和标定上同时进行。
class Rescale(object):
"""Rescale the image in a sample to a given size.
Args:
output_size(元组或整形):需要转换成的大小。如果是元组,输出大小将和output_size匹配。
如果是int,图像较小的边将和output_size匹配,以保持图片的长宽比不变。
"""
def __init__(self, output_size):
assert isinstance(output_size, (int, tuple))
self.output_size = output_size
def __call__(self, sample):
image, landmarks = sample['image'], sample['landmarks']
h, w = image.shape[:2]
# int
if isinstance(self.output_size, int):
if h > w:
new_h, new_w = self.output_size * h / w, self.output_size
else:
new_h, new_w = self.output_size, self.output_size * w / h
# tuple
else:
new_h, new_w = self.output_size
new_h, new_w = int(new_h), int(new_w)
img = transform.resize(image, (new_h, new_w))
# h and w are swapped for landmarks because for images,
# x and y axes are axis 1 and 0 respectively
landmarks = landmarks * [new_w / w, new_h / h]
return {'image': img, 'landmarks': landmarks}
class RandomCrop(object):
"""Crop randomly the image in a sample.
Args:
output_size (tuple or int): Desired output size. If int, square crop
is made.
"""
def __init__(self, output_size):
assert isinstance(output_size, (int, tuple))
if isinstance(output_size, int):
self.output_size = (output_size, output_size)
else:
assert len(output_size) == 2
self.output_size = output_size
def __call__(self, sample):
image, landmarks = sample['image'], sample['landmarks']
h, w = image.shape[:2]
new_h, new_w = self.output_size
# 确定随机裁剪的左上角坐标
top = np.random.randint(0, h - new_h)
left = np.random.randint(0, w - new_w)
image = image[top: top + new_h,
left: left + new_w]
landmarks = landmarks - [left, top]
return {'image': image, 'landmarks': landmarks}
class ToTensor(object):
"""Convert ndarrays in sample to Tensors."""
def __call__(self, sample):
image, landmarks = sample['image'], sample['landmarks']
# swap color axis because
# numpy image: H x W x C
# torch image: C X H X W
image = image.transpose((2, 0, 1))# 进行转置即可
return {'image': torch.from_numpy(image),
'landmarks': torch.from_numpy(landmarks)}
组合转换
Now, we apply the transforms on an sample.
Let’s say we want to rescale the shorter side of the image to 256 and
then randomly crop a square of size 224 from it. i.e, we want to compose
Rescale
and RandomCrop
transforms.
torchvision.transforms.Compose
is a simple callable class which allows us
to do this.
scale = Rescale(256)
crop = RandomCrop(128)
composed = transforms.Compose([Rescale(256),
RandomCrop(224)])
# Apply each of the above transforms on sample.
fig = plt.figure()
sample = face_dataset[65]
for i, tsfrm in enumerate([scale, crop, composed]):
transformed_sample = tsfrm(sample)
ax = plt.subplot(1, 3, i + 1)
plt.tight_layout()
ax.set_title(type(tsfrm).__name__)
show_landmarks(**transformed_sample)
plt.show()
对数据集进行迭代
Let’s put this all together to create a dataset with composed
transforms.
To summarize, every time this dataset is sampled:
- An image is read from the file on the fly
- Transforms are applied on the read image
- Since one of the transforms is random, data is augmentated on
sampling
We can iterate over the created dataset with a for i in range
loop as before.
transformed_dataset = FaceLandmarksDataset(csv_file='F:/工作学习/编程与操作系统/Pytorch/datasets/faces/faces/face_landmarks.csv',
root_dir='F:/工作学习/编程与操作系统/Pytorch/datasets/faces/faces/',
transform=transforms.Compose([
Rescale(256),
RandomCrop(224),
ToTensor()
]))
for i in range(len(transformed_dataset)):
sample = transformed_dataset[i]
print(i, sample['image'].size(), sample['landmarks'].size())
if i == 3:
break
0 torch.Size([3, 224, 224]) torch.Size([68, 2])
1 torch.Size([3, 224, 224]) torch.Size([68, 2])
2 torch.Size([3, 224, 224]) torch.Size([68, 2])
3 torch.Size([3, 224, 224]) torch.Size([68, 2])
However, we are losing a lot of features by using a simple for
loop to
iterate over the data. In particular, we are missing out on:
- 对数据进行batch处理
- 对数据进行打乱操作
- 使用
multiprocessing
对数据集进行并行加载.
torch.utils.data.DataLoader
is an iterator which provides all these
features. Parameters used below should be clear. One parameter of
interest is collate_fn
. You can specify how exactly the samples need
to be batched using collate_fn
. However, default collate should work
fine for most use cases.
dataloader = DataLoader(transformed_dataset, batch_size=4,
shuffle=True, num_workers=4)
# Helper function to show a batch
def show_landmarks_batch(sample_batched):
"""Show image with landmarks for a batch of samples."""
images_batch, landmarks_batch = \
sample_batched['image'], sample_batched['landmarks']
batch_size = len(images_batch)
im_size = images_batch.size(2)
grid = utils.make_grid(images_batch)
plt.imshow(grid.numpy().transpose((1, 2, 0)))
for i in range(batch_size):
plt.scatter(landmarks_batch[i, :, 0].numpy() + i * im_size,
landmarks_batch[i, :, 1].numpy(),
s=10, marker='.', c='r')
plt.title('Batch from dataloader')
for i_batch, sample_batched in enumerate(dataloader):
print(i_batch, sample_batched['image'].size(),
sample_batched['landmarks'].size())
# observe 4th batch and stop.
if i_batch == 3:
plt.figure()
show_landmarks_batch(sample_batched)
plt.axis('off')
plt.ioff()
plt.show()
break
这里目前还没有调通
---------------------------------------------------------------------------
BrokenPipeError Traceback (most recent call last)
<ipython-input-16-7777157b058e> in <module>()
20 plt.title('Batch from dataloader')
21
---> 22 for i_batch, sample_batched in enumerate(dataloader):
23 print(i_batch, sample_batched['image'].size(),
24 sample_batched['landmarks'].size())
E:\Anaconda\envs\python35\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
499
500 def __iter__(self):
--> 501 return _DataLoaderIter(self)
502
503 def __len__(self):
E:\Anaconda\envs\python35\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
287 for w in self.workers:
288 w.daemon = True # ensure that the worker exits on process exit
--> 289 w.start()
290
291 _update_worker_pids(id(self), tuple(w.pid for w in self.workers))
E:\Anaconda\envs\python35\lib\multiprocessing\process.py in start(self)
103 'daemonic processes are not allowed to have children'
104 _cleanup()
--> 105 self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel
107 _children.add(self)
E:\Anaconda\envs\python35\lib\multiprocessing\context.py in _Popen(process_obj)
210 @staticmethod
211 def _Popen(process_obj):
--> 212 return _default_context.get_context().Process._Popen(process_obj)
213
214 class DefaultContext(BaseContext):
E:\Anaconda\envs\python35\lib\multiprocessing\context.py in _Popen(process_obj)
311 def _Popen(process_obj):
312 from .popen_spawn_win32 import Popen
--> 313 return Popen(process_obj)
314
315 class SpawnContext(BaseContext):
E:\Anaconda\envs\python35\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
64 try:
65 reduction.dump(prep_data, to_child)
---> 66 reduction.dump(process_obj, to_child)
67 finally:
68 context.set_spawning_popen(None)
E:\Anaconda\envs\python35\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
57 def dump(obj, file, protocol=None):
58 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 59 ForkingPickler(file, protocol).dump(obj)
60
61 #
BrokenPipeError: [Errno 32] Broken pipe
Afterword: torchvision
In this tutorial, we have seen how to write and use datasets, transforms
and dataloader. torchvision
package provides some common datasets and
transforms. You might not even have to write custom classes. One of the
more generic datasets available in torchvision is ImageFolder
.
It assumes that images are organized in the following way: ::
root/ants/xxx.png
root/ants/xxy.jpeg
root/ants/xxz.png
.
.
.
root/bees/123.jpg
root/bees/nsdf3.png
root/bees/asd932_.png
where ‘ants’, ‘bees’ etc. are class labels. Similarly generic transforms
which operate on PIL.Image
like RandomHorizontalFlip
, Scale
,
are also available. You can use these to write a dataloader like this
import torch
from torchvision import transforms, datasets
data_transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
hymenoptera_dataset = datasets.ImageFolder(root='hymenoptera_data/train',
transform=data_transform)
dataset_loader = torch.utils.data.DataLoader(hymenoptera_dataset,
batch_size=4, shuffle=True,
num_workers=4)