pytorch数据加载能力的核心是torch.utils.data.DataLoader类,它表示数据集上的Python iterable,支持如下特性:
- map-style和iterable-style的数据集
- customizing data loading order
- automatic batching
- single- and multi-process data loading
- automatic memory pining
- Data Loading Order 和 Sampler
对于iterable-style数据集,数据加载顺序完全由user定义的iterable控制。
对于map-style数据集,将使用sampler指定加载数据时的keys/indices。
默认将会使用Dataloader的shuffle参数自动构建一个连续的或者随机的sampler。
- automatic batching
Whenbatch_size
(default1
) is notNone
, the data loader yields batched samples instead of individual samples.
Disable automatic batching: When bothbatch_size
andbatch_sampler
areNone
(default value forbatch_sampler
is alreadyNone
), automatic batching is disabled. Each sample obtained from thedataset
is processed with the function passed as thecollate_fn
argument.
When automatic batching is disabled, the defaultcollate_fn
simply converts NumPy arrays into PyTorch Tensors, and keeps everything else untouched. - single- and multi-process data loading
将num_workers设置为正整数,将会触发相应数量的多进程数据加载。在这种模式下,当创建Dataloader的迭代器的时候,num_workers的进程将被创建,这时,dataset、collate_fn、worker_init_fn被传递给每一个进程,用于初始化和获取数据。
For map-style datasets, the main process generates the indices usingsampler
and sends them to the workers. So any shuffle randomization is done in the main process which guides loading by assigning indices to load. - memory pining
当主机到GPU的拷贝来自固定(页锁定)内存时,它们的速度要快得多。