学习pytorch6 torchvision中的数据集使用


B站 小土堆 视频学习笔记

1. torchvision中的数据集使用

官网文档

注意左上角的版本

https://pytorch.org/vision/0.9/
在这里插入图片描述

注意点1 totensor实例化不要忘记加括号

totensor实例化不要忘记加括号,否则后面用数据集序列号的时候会报错
在这里插入图片描述

注意点2 download可以一直保持为True

download可以一直保持为True,下载一次后指定目录下有下载好的数据集,代码不会重复下载,也可以自己把下载好的数据集压缩包放到指定目录,代码会自动解压缩

代码

from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

# 用法1
# 数据下载很慢的话 可以使用迅雷下载,属性里面可以看到迅雷是从多方下载的,速度比较快 https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
train_set = datasets.CIFAR10(root='./dataset', train=True, download=True)
test_set = datasets.CIFAR10(root='./dataset', train=False, download=True)
# 下载的数据集是图片类型,可以debug查看数据
print(test_set[0])  # __getitem__ return img, target
print(type(test_set[0]))
img, target = test_set[0]
print(target)
print(test_set.classes[target])
print(img)
# PIL 图片可以直接show函数展示
img.show()

# 用法2
# 将数据集批量调用transforms,使用tensor数据类型
# trans_compose = transforms.Compose([transforms.ToTensor])  # 错误写法 会导致后面报错
trans_compose = transforms.Compose([transforms.ToTensor()])
train_set2 = datasets.CIFAR10(root='./dataset', train=True, transform=trans_compose, download=True)
test_set2 = datasets.CIFAR10(root='./dataset', train=False, transform=trans_compose, download=True)
print(type(test_set2[2]))
img, target = test_set2[0]
print(target)
print(test_set2.classes[target])
print(type(img))
writer = SummaryWriter("logs")
for i in range(10):
    img_tensor, target = test_set2[i]
    writer.add_image('tensor dataset', img_tensor, i)
writer.close()

执行结果

> p11_torchvision_dataset.py
Files already downloaded and verified
Files already downloaded and verified
(<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>, 3)
<class 'tuple'>
3
cat
<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>
Files already downloaded and verified
Files already downloaded and verified
<class 'tuple'>
3
cat
<class 'torch.Tensor'>

Process finished with exit code 0

2. DataLoader的使用

在这里插入图片描述

官方文档

https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

参数解释

Parameters:

  • dataset (Dataset) – dataset from which to load the data.
    加载哪个数据

  • batch_size (int, optional) – how many samples per batch to load (default: 1).
    每次拿多少数据

  • shuffle (bool, optional) – set to True to have the data reshuffled at every epoch (default: False).
    当设置为False,则两次拿的数据都是一样的【相当于设置了随机数种子】,当设置为True,则每次拿的数据不一样

  • sampler (Sampler or Iterable, optional) – defines the strategy to draw samples from the dataset. Can be any Iterable with len implemented. If specified, shuffle must not be specified.

  • batch_sampler (Sampler or Iterable, optional) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

  • num_workers (int, optional) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
    多进程取数据,设置为0则用主进程执行,win系统不设置为0可能报下面的错误BrokenPipeError

  • collate_fn (Callable, optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • pin_memory (bool, optional) – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

  • drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
    当数据总数除以batch_size 除不尽有余数是,设置为True是余下的数据不参与训练,舍去余数,设置为False则剩下的数据就参与训练,不舍弃余数。默认为False

  • timeout (numeric, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

  • worker_init_fn (Callable, optional) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

  • generator (torch.Generator, optional) – If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None)

  • prefetch_factor (int, optional, keyword-only arg) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise if value of num_workers>0 default is 2).

  • persistent_workers (bool, optional) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False)

  • pin_memory_device (str, optional) – the data loader will copy Tensors into device pinned memory before returning them if pin_memory is set to true.
    当num_worker>0 在windows上使用的时候可能会报错 BrokenPipeError,此时可以把num_wordker参数设置为0试试

https://blog.csdn.net/Ginomica_xyx/article/details/113745596
https://www.ngui.cc/el/1916356.html?action=onClick

在这里插入图片描述

DataLoader的返回值,返回两个迭代器

在这里插入图片描述
取数据的时候,默认是随机采样 sampler
在这里插入图片描述

代码

from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets
from torchvision import transforms


test_set = datasets.CIFAR10(root='./dataset', train=False, transform=transforms.ToTensor(), download=True)
"""
注意调整的参数:
1. batch_size 一次拿多少张图片
2. shuffle  两个epoch数据顺序是否一致 false不打乱,顺序一致
3. drop_last 除不尽batch_size余下的样本是否丢弃不处理, false 剩余样本不丢弃,数据一样处理
"""
# data = DataLoader(dataset=test_set, batch_size=4, shuffle=False, num_workers=0, drop_last=False)
data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=False)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=True)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=True, num_workers=0, drop_last=True)

writer = SummaryWriter("logs")
for epoch in range(2):
    step = 0
    for one in data:
        imgs, targets = one
        # print(imgs.shape)
        # print(targets)
        """add_images 可以在同一张画布上展示多张图片, imgs有多少张展示多少张"""
        writer.add_images("step_dropFalse: {}".format(epoch), imgs, step)
        # writer.add_images("step_dropTrue: {}".format(epoch), imgs, step)
        # writer.add_images("step_dropTrue_shufTrue: {}".format(epoch), imgs, step)
        step += 1

writer.close()

执行结果

1. shuffle=False 不重新洗牌,拿到数据的顺序一致

drop_last=False  不重新洗牌,拿到数据的顺序一致

2. shuffle=True 重新洗牌,拿到数据的顺序不一致

shuffle=True  重新洗牌,拿到数据的顺序不一致

3. drop_last=False 不丢弃未除尽batch的样本

drop_last=False  不丢弃未除尽batch的样本

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
,让我们下载和导入必要的PyTorchtorchvision库: ```python import torch import torchvision import torchvision.transforms as transforms ``` 接下来,我们可以定义一些数据转换,以便将CIFAR10图像的像素值转换为张量,并对它们进行标准化。我们还可以将数据集分成训练集和测试集。 ```python transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2) testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2) ``` 现在,我们可以显示一些图像来检查它们是否已成功载。我们将使用matplotlib库来绘制图像。 ```python import matplotlib.pyplot as plt import numpy as np # 定义类别标签 classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck') # 随机获取一些训练图像 dataiter = iter(trainloader) images, labels = dataiter.next() # 绘制图像 def imshow(img): img = img / 2 + 0.5 # 反归一化 npimg = img.numpy() plt.imshow(np.transpose(npimg, (1, 2, 0))) plt.show() # 显示图像 imshow(torchvision.utils.make_grid(images)) # 输出标签 print(' '.join('%5s' % classes[labels[j]] for j in range(4))) ``` 这将显示四张训练图片和它们的标签。现在,我们已经成功地载并显示了CIFAR10数据集,可以开始使用PyTorch进行图像分类任务。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值