【数据加载之Dataset和DataLoader的使用】

漫天飘雪13

已于 2022-11-16 11:20:18 修改

阅读量2.7k

点赞数 3

文章标签： pytorch 深度学习人工智能

于 2022-02-08 10:09:13 首次发布

本文链接：https://blog.csdn.net/w1530/article/details/122783687

版权

本文介绍了PyTorch中数据加载的过程，包括Dataset基类的使用，通过自定义Dataset处理SMS Spam Collection数据集，并展示了DataLoader如何实现批量数据加载、数据打乱及多线程并行加载，以提升深度学习模型训练效率。

摘要由CSDN通过智能技术生成

Pytorch中的数据加载

在深度学习中，数据量通常是都非常多，非常大的，如此大量的数据，不可能一次性的在模型中进行向前的计算和反向传播，经常我们会对整个数据进行随机的打乱顺序，把数据处理成一个个的batch，同时还会对数据进行预处理。

1.Dataset基类介绍

在torch中提供了数据集的基类torch.utils.data.Dataset，继承这个基类，可以快速的实现对数据的加载。
torch.utils.data.Dataset的源码如下：

class Dataset(object):
    """An abstract class representing a Dataset.

    All other datasets should subclass it. All subclasses should override
    ``__len__``, that provides the size of the dataset, and ``__getitem__``,
    supporting integer indexing in range from 0 to len(self) exclusive.
    """

    def __getitem__(self, index):
        raise NotImplementedError

    def __len__(self):
        raise NotImplementedError

    def __add__(self, other):
        return ConcatDataset([self, other]