DataSet类的说明
An abstract class representing a :class:Dataset
.
All datasets that represent a map from keys to data samples should subclass
it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
data sample for a given key. Subclasses could also optionally overwrite
:meth:`__len__`, which is expected to return the size of the dataset by many
:class:`~torch.utils.data.Sampler` implementations and the default options
of :class:`~torch.utils.data.DataLoader`.
.. note::
:class:`~torch.utils.data.DataLoader` by default constructs a index
sampler that yields integral indices. To make it work with a map-style
dataset with non-integral indices/keys, a custom sampler must be provided.
来自Pytorch的官方文档,意思是说Dataset类为一个抽象类,每次编写Dataset类时需要继承(subclass)之后才可继续编写。
其次在编写自己使用的Dataset类时必须重写__getitem__方法和__len__方法。
方法__getitem__:获取指定index的数据
方法__len__:获取Dataset的长度
import os
from torch.utils.data import Dataset
from PIL import Image
class MyData(Dataset):
def __init__(self, root_dir, label_dir):
self.root_dir = root_dir
self.label_dir = label_dir
self.path = os.path.join(self.root_dir, self.label_dir)
self.img_path = os.listdir(self.path)
def __getitem__(self, index):
image_name = self.img_path[index]
image_item_path = os.path.join(self.root_dir, self.label_dir, image_name)
img = Image.open(image_item_path)
label = self.label_dir
return img, label
def __len__(self):
return len(self.img_path)
root_dir = r"../Pytorch学习/dataset/train"
ants_label_dir = "ants"
bees_label_dir = "bees"
ants_data = MyData(root_dir=root_dir, label_dir=ants_label_dir)
print(len(ants_data))
bees_data = MyData(root_dir=root_dir, label_dir=bees_label_dir)
print(len(bees_data))
# 重载 + 号, 使得训练数据集变成两个小的数据集的和
train_data = ants_data + bees_data
print(len(train_data))
其次在DataLoader类也是在写神经网络时重要的一环,每一个epoch中都有对整个数据集进行iter迭代,对于一个loader中可以实例化成由多个batch_size组成的DataSet,每次epoch中,对于实例化的dataloader中每个batch_size进行如下模板:
开始训练 --> 附上0梯度 --> 计算预测值 --> 计算损失函数 --> 求微分 --> 下一次迭代
伪代码即:
for epoch in range(epochs):
# 记录损失值、准确度等参数
for iter in data_loader:"
# 计算预测值、计算损失函数、求微分
end
end
from torch.utils.tensorboard import SummaryWriter
from torchvision.datasets import CIFAR10
import torchvision
from torch.utils.data import DataLoader
test_data = CIFAR10("../data", train=False, transform=torchvision.transforms.ToTensor(), download=True)
# batch_size = 4 每次取4个batch
# 即若shuffle = True时,会取4个样本集作为一个batch进入迭代iter
test_loader = DataLoader(dataset=test_data, batch_size=4, shuffle=True, num_workers=0, drop_last=False)
writer = SummaryWriter("dataloader")
step = 0
for data in test_loader:
imgs, targets = data
writer.add_images("test_data", imgs, global_step=step)
step += 1
writer.close()
# terminal中敲:tensorboard --logdir="dataloader"可以查看tensorboard