关于Pytorch中的采样
Pytorch Sampler
在训练神经网络时,如果数据量太大,无法一次性将数据放入到网络中进行训练,所以需要进行分批处理数据读取。这一个问题涉及到如何从数据集中进行读取数据的问题,PyTorch 框架提供了 Sampler 基类与多个子类实现不同方式的数据采样。
基类 Sampler
class Sampler(object):
r"""Base class for all Samplers.
Every Sampler subclass has to provide an :meth:`__iter__` method, providing a
way to iterate over indices of dataset elements, and a :meth:`__len__` method
that returns the length of the returned iterators.
.. note:: The :meth:`__len__` method isn't strictly required by
:class:`~torch.utils.data.DataLoader`, but is expected in any
calculation involving the length of a :class:`~torch.utils.data.DataLoader`.
"""
def __init__(self, data_source):
pass
def __iter__(self):
raise NotImplementedError
顺序采样 Sequential Sampler
class SequentialSampler(Sampler):
r"""Samples elements sequentially, always in the same order.
Arguments:
data_source (Dataset): dataset to sample from
"""
def __init__(self, data_source):
self.data_source = data_source
def __iter__(self):
return iter(range(len(self.data_source)))
def __len__(self):
return len(self.data_source)
顺序采样类并没有定义过多的方法,其中初始化方法仅仅需要一个 Dataset 类作为参数。
对于 len() 只负责返回数据源包含的数据个数, iter() 方法返回可迭代对象,这个可迭代对象是一个由 range 方法产生的顺序数值序列,也就是说迭代是按照顺序进行的。
每个 Epoch 包含很多 iteration,每个 Epoch 执行一次 iter() 函数,每个 iteration 执行一次可迭代对象的 next() 函数。
//测试
# 定义数据和对应的采样器
data = list([1, 2, 3, 4, 5])
seq_sampler = sampler.SequentialSampler(data_source=data)
# 迭代获取采样器生成的索引
for index in seq_sampler:
print("index: {}, data: {}".format(str(index), str(data[index])))
//输出
index: 0, data: 1
index: 1, data: 2
index: 2, data: 3
index: 3, data: 4
index: 4, data: 5
随机采样 Random Sampler
class RandomSampler(Sampler):
r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
If with replacement, then user can specify :attr:`num_samples` to draw.
Arguments:
data_source (Dataset): dataset to sample from
replacement (bool): samples are drawn with replacement if ``True``, default=``False``
num_samples (int): number of samples to draw, default=`len(dataset)`. This argument
is supposed to be specified only when `replacement` is ``True``.
generator (Generator): Generator used in sampling.
"""
def __init__(self, data_source, replacement=False, num_samples=None, generator=None):
self.data_source = data_source
# 这个参数控制的应该是否重复采样
self.replacement =