pytorch学习（一）Dataset_class dataset-CSDN博客

本文链接：https://blog.csdn.net/qq_39324954/article/details/104766866

Dataset类：
pytorch读取图片，主要通过Dataset类。Dataset类作为所有datasets的基类，所有的datasets都要继承它。
码源：

class Dataset(object):
"""An abstract class representing a Dataset
All other datasets should subclass it. All subclasses should override
``__len__``, that provides the size of the dataset, and ``__getitem__``,
supporting integer indexing in range from 0 to len(self) exclusive.
""
def __getitem__(self, index):
    raise NotImplementedErro
def __len__(self):
    raise NotImplementedErro
def __add__(self, other):
    return ConcatDataset([self, other])
    

    之所以要自己写DataSet，是因为原本提供的DataSet无法满足我们的实际需求，此时就需要我们自定义。通过继承 torch.utils.data.Dataset来实现，在继承的时候，需要 override 三个方法：

init：用来初始化一些有关操作数据集的参数
getitem:定义数据获取的方式（包括读取数据，对数据进行变换等），该方法支持从 0 到 len(self)-1的索引。obj[index]等价于obj.getitem
len:获取数据集的大小。len(obj)等价于obj.len()

*自定义DataSet的框架：

class CustomDataset(data.Dataset):#需要继承data.Dataset
   def __init__(self):
       # TOD  
       # 1. Initialize file path or list of file names.
       pass
   def__getitem__(self, index):
        # TODO
        # 1. Read one data from file (e.g. using numpy.fromfile, PIL.Image.open).
        # 2. Preprocess the data (e.g. torchvision.Transform).
        # 3. Return a data pair (e.g. image and label).
        #这里需要注意的是，第一步：read one data，是一个data
        pass
    def __len__(self):
          # You should change 0 to the total size of your dataset.
          return 0

def len(self):
return len(self.batch_sampler)
DataSet创建及使用的完整流程如下：

1.首先创建好自己的DataSet类，然后创建一个Dataset的对象
2.将创建好的DataSet对象传入DataLoader中，创建一个 DataLoader的对象
3.遍历这个DataLoader对象，从而取出自己的数据。

在训练神经网络时，最好是对一个batch的数据进行操作，同时还需要对数据进行shuffle和并行加速等。对此，PyTorch提供了DataLoader帮助实现这些功能。Dataset只负责数据的抽象，一次调用__getitem__只返回一个样本。

DataLoader()的各个参数含义如下：
DataLoader的函数定义如下： DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=default_collate, pin_memory=False, drop_last=False)

dataset：加载的数据集，这个从DataSet()函数而来。
batch_size：batch size，设定每次训练迭代时加载的数据量。
shuffle:：是否将数据打乱
sampler：样本抽样
num_workers：使用多进程加载的进程数，0代表不使用多进程，设定多进程可以使得加载数据时更加快速。
collate_fn：如何将多个样本数据拼接成一个batch，一般使用默认的拼接方式即
pin_memory：是否将数据（tensor）保存在pin memory区，pin memory中的数据转到GPU中会快一些
drop_last：dataset中的数据个数可能不是batch_size的整数倍，drop_last为True会将多出来不足一个batch的数据丢弃，False表示不丢弃。