WeightedRandomSampler 理解了吧

WeightedRandomSampler

 

sampler = WeightedRandomSampler(samples_weight, samples_num)

train_loader = DataLoader( train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

 

我的数据不平衡,使用pytorch,发现WeightedRandomSampler这个东西,网上找了一圈,有点会用了,就是上面这个用法,但是理解了很久才知道为什么这么用。

最大的问题就是不能理解WeightedRandomSampler是怎么运作的。除了官方解释,其他也没有找到更有用的信息了。

现在我觉得是有点理解了。

 

官方解释是:

 

还给了例子:

 

然后不是很懂,还是不知道怎么用。感觉这个例子却是不是很好说明问题,也是我理解能力太差,多试几次才懂了。

 

我换一个例子如下:

list(WeightedRandomSampler([1, 9], 5, replacement=True))

上面这句话反复运行,你猜怎么着?

我每次运行的结果如下:(你的结果肯定不一样)

[1, 0, 1, 1, 1]

[1, 1, 1, 1, 1]

[1, 1, 1, 0, 1]

[1, 1, 1, 1, 1]

[1, 1, 1, 1, 1]

[1, 1, 0, 1, 1]

有点理解了吧?

这个5代表要生成5个数,这5个数是谁呢? 取决于前面【】内的数的数量,上面【】内有2个数,根据上面[0,..,len(weights)-1]即生成的数是0-1之间的任意数。

那这5个数到底是几,有10%的概率是0,有90%的概率是1。

理解了吧?其他参数不解释了。

 

使用

有一种通常的用法是:(不限于此)

假设分类问题,分为3类。

sampler = WeightedRandomSampler(samples_weight,samples_num)

samples_weight的数量等于我们训练集总样本的数量,假设为1000。

samples_weight的每一项代表该样本种类占总样本的比例的倒数。

samples_num 为我们想采集多少个样本,可以重复采集。假设为2000。

 

假设3类样本分布比例为 猫,狗,猪 =  0.1,0.2,0.7

Count = [0.1,0.2,0.7]

Weight = 1/Count = [10,5,1.43] 约等于[0.7,0.2,0.1]

 

samples_weight内全是 10或5或1.43,是10代表该样本是猫...

假设samples_weight内样子是:

[10,5,5,1.43,1.43,1.43,1.43.......,10]

10的数量最少,但是权重最大,所以达到了样本平衡的效果。

 

所以结合上面的WeightedRandomSampler的使用:

会生成样本总数个数即2000个数,

每个数可能是0-999之间的某个数,

每个数:(和samples_weight内数值对应)

是0的概率为 10/sum(samples_weight)

是1的概率为5/sum(samples_weight)

是2的概率为1.43/sum(samples_weight)

是3的概率为1.43/sum(samples_weight)

是4的概率为1.43/sum(samples_weight)

......

是999的概率为 10/sum(samples_weight)

 

把取出来的数字作为index,DataLoader就取用了。

 

 

end

目前的理解,难免有疏漏错误,还望大佬们多多指正。

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 55
    点赞
  • 91
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
WeightedRandomSampler is a sampling technique used in PyTorch to sample data from a dataset with a probability proportional to the weights assigned to each sample. This is useful when the dataset is imbalanced, and we want to ensure that the model sees a balanced representation of the data during training. The WeightedRandomSampler takes two arguments: the weights of each sample and the number of samples to draw. The weights can be any positive numbers and do not need to sum to one. The number of samples to draw can be less than the total number of samples in the dataset, allowing us to create smaller subsets of the data. Here is an example of using WeightedRandomSampler: ``` import torch from torch.utils.data import DataLoader, WeightedRandomSampler # create a dataset with 100 samples dataset = torch.utils.data.TensorDataset(torch.randn(100, 3), torch.randint(0, 2, (100,))) # calculate weights for each sample based on the class distribution class_count = torch.tensor([len(torch.where(dataset.tensors[1] == t)[0]) for t in torch.unique(dataset.tensors[1])]) weights = 1.0 / class_count.float() sample_weights = weights[dataset.tensors[1]] # create a sampler with the weighted probabilities sampler = WeightedRandomSampler(sample_weights, num_samples=len(dataset), replacement=True) # create a dataloader using the sampler dataloader = DataLoader(dataset, batch_size=16, sampler=sampler) # iterate over the dataloader and print the class distribution for i, (x, y) in enumerate(dataloader): print(f"Batch {i}, Class distribution: {torch.bincount(y)}") ``` In this example, we create a dataset with 100 samples, where the second tensor represents the class labels. We then calculate the weights for each sample based on the class distribution, where samples from the minority class are given higher weights. We create a WeightedRandomSampler with the sample weights and use it to create a DataLoader. Finally, we iterate over the dataloader and print the class distribution of each batch to confirm that the sampler is working as expected.
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值