Datasets & DataLoader
torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.
Loading a Dataset
Here is an example of how to load the Fashion-MNIST dataset from TorchVision
We load the FashionMNIST Dataset with the following parameters:
root
is the path where the train/test data is stored,train
specifies training or test dataset,download=True
downloads the data from the internet if it’s not available at roottransform
andtarget_transform
specify the feature and label transformations
import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=True,
transform=ToTensor()
)
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=True,
transform=ToTensor()
)
Iterating and Visualizing the Dataset
We can index Datasets manually like a list: training_data[index]
labels_map = {
0: "T-Shirt",
1: "Trouser",
2: "Pullover",
3: "Dress",
4: "Coat",
5: "Sandal",
6: "Shirt",
7: "Sneaker",
8: "Bag",
9: "Ankle Boot",
}
figure = plt.figure(figsize=(8,8))
cols,rows=3,3
for i in range(1,cols*rows+1):
sample_idx = torch.randint(len(training_data),size=(1,)).item()
img,label = training_data[sample_idx]
figure.add_subplot(rows,cols,i)
plt.title(labels_map[label])
plt.axis("off")
plt.imshow(img.squeeze(),cmap="gray")
plt.show()
Creating a Custom Dataset for your files
import os
import pandas as pd
from torchvision.io import read_image
# 创建属于自己的本地数据集
# 看样子需要继承Dataset类
class CustomImageDataset(Dataset):
def __init__(self, annotations_file, ima_dir, transform=None,target_transform=None):
self.img_labels = pd.read_csv(annotations_file)
self.img_dir = img_dir
self.transform = transform
self.target_transform = target_transform
def __len__(self):
return len(self.img_labels)
def __getitem__(self,idx):
img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx,0])
image = read_image(img_path)
label = self.img_labels.iloc[idx,1]
if self.transform:
image=self.transform(image)
if self.target_transform:
label = self.target_transform(label)
return image,label
init()
The init function is run once when instantiating the Dataset object
len():
The len function returns the number of samples in our dataset.
getitem():
The getitem function loads and returns a sample from the dataset at the given index idx
Preparing your data for training with DataLoaders
from torch.utils.data import DataLoader
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)
iterate through DataLoader
# Display image and label.
train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0].squeeze()
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")