pytorch数据加载之torch.utils.data

最新推荐文章于 2024-04-18 12:51:30 发布

Guan19

最新推荐文章于 2024-04-18 12:51:30 发布

阅读量211

点赞数

分类专栏： pytorch基础文章标签： pytorch

原文链接：https://pytorch.org/docs/stable/data.html?highlight=torch%20utils%20data%20dataloader#module-torch.utils.data

版权

pytorch基础专栏收录该内容

2 篇文章 0 订阅

订阅专栏

pytorch数据加载能力的核心是torch.utils.data.DataLoader类，它表示数据集上的Python iterable，支持如下特性：

map-style和iterable-style的数据集
customizing data loading order
automatic batching
single- and multi-process data loading
automatic memory pining

Data Loading Order 和 Sampler
对于iterable-style数据集，数据加载顺序完全由user定义的iterable控制。

对于map-style数据集，将使用sampler指定加载数据时的keys/indices。

默认将会使用Dataloader的shuffle参数自动构建一个连续的或者随机的sampler。
automatic batching
When batch_size (default 1) is not None, the data loader yields batched samples instead of individual samples.
Disable automatic batching: When both batch_size and batch_sampler are None (default value for batch_sampler is already None), automatic batching is disabled. Each sample obtained from the dataset is processed with the function passed as the collate_fn argument.
When automatic batching is disabled, the default collate_fn simply converts NumPy arrays into PyTorch Tensors, and keeps everything else untouched.
single- and multi-process data loading
将num_workers设置为正整数，将会触发相应数量的多进程数据加载。在这种模式下，当创建Dataloader的迭代器的时候，num_workers的进程将被创建，这时，dataset、collate_fn、worker_init_fn被传递给每一个进程，用于初始化和获取数据。
For map-style datasets, the main process generates the indices using sampler and sends them to the workers. So any shuffle randomization is done in the main process which guides loading by assigning indices to load.
memory pining
当主机到GPU的拷贝来自固定（页锁定）内存时，它们的速度要快得多。