pytorch DataLoader 自定义 sampler

最新推荐文章于 2025-03-29 23:51:45 发布

子燕若水

最新推荐文章于 2025-03-29 23:51:45 发布

阅读量3.7k

点赞数 5

分类专栏： cuda&深度学习环境开发

本文链接：https://blog.csdn.net/u010087338/article/details/117927204

版权

开发同时被 2 个专栏收录

270 篇文章

订阅专栏

cuda&深度学习环境

87 篇文章

订阅专栏

本文介绍了如何在PyTorch中创建自定义数据加载器和Sampler。通过一个基础代码示例，展示了如何定义一个名为`IndependentHalvesSampler`的Sampler子类，该Sampler将数据集分为两半并随机打乱，然后以交错方式遍历。使用这个Sampler创建DataLoader，可以观察到数据以特定顺序被加载。这为理解和定制数据采样策略提供了基础。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文归纳自:

https://www.scottcondron.com/jupyter/visualisation/audio/2020/12/02/dataloaders-samplers-collate.html#SequentialSampler

基础代码

from torch.utils.data import DataLoader
#构造数据
xs = list(range(11))
ys = list(range(10,21))
print('xs values: ', xs)
print('ys values: ', ys)
#定义Dataset子类 
class MyDataset:
    def __init__(self, xs, ys):
        self.xs = xs
        self.ys = ys
    
    def __getitem__(self, i):
        return self.xs[i], self.ys[i]
    
    def __len__(self):
        return len(self.xs)
#实例化一个Dataset
dataset = MyDataset(xs, ys)
dataset[2] # returns the tuple (x[2], y[2])

自定义 Sampler

每个自定义的Sampler子类需要一个 __iter__ 方法, 以保证遍历到dataset的每一个元素,

还需要一个 __len__ 方法 ,来获取数据集的大小.

#collapse-hide
import random
from torch.utils.data.sampler import Sampler

class IndependentHalvesSampler(Sampler):
    def __init__(self, dataset):
        halfway_point = int(len(dataset)/2)
        self.first_half_indices = list(range(halfway_point))
        self.second_half_indices = list(range(halfway_point, len(dataset)))
        
    def __iter__(self):
        random.shuffle(self.first_half_indices)
        random.shuffle(self.second_half_indices)
        return iter(self.first_half_indices + self.second_half_indices)
    
    def __len__(self):
        return len(self.first_half_indices) + len(self.second_half_indices)


our_sampler = IndependentHalvesSampler(dataset)
print('First half indices: ', our_sampler.first_half_indices)
print('Second half indices:', our_sampler.second_half_indices)

dl = DataLoader(dataset, sampler=our_sampler, batch_size=1)

for i, data in enumerate(dl):
  print(data)

[tensor([1]), tensor([11])]
[tensor([4]), tensor([14])]
[tensor([3]), tensor([13])]
[tensor([0]), tensor([10])]
[tensor([2]), tensor([12])]
[tensor([7]), tensor([17])]
[tensor([8]), tensor([18])]
[tensor([9]), tensor([19])]
[tensor([5]), tensor([15])]
[tensor([6]), tensor([16])]
[tensor([10]), tensor([20])]