编写不易如果觉得不错,麻烦关注一下~
参考链接:
【1】https://blog.csdn.net/huhuan4480/article/details/113248503
【2】https://blog.csdn.net/zw__chen/article/details/82806900
由于自己的程序总是一个epoch 断掉,而每次重新继续运行我怀疑我的Sampler 一直是固定从第一个batch开始,所以我认为一直在第一个batch 上训练,从未碰触其他次序,所以已经过拟合了。(我一直养成固定seed的习惯)
所以当前目标,利用randperm 获取一个随机的次序 batch 内 的次序,然后存入一个文件中,每次断掉相应获取sampler相应的次序去 固定dataloader调用的次序。以避免上面情况的过拟合。
小范围试验,总共30个数据,每个batch 4个数据。需要8个batch 能够走一遍数据集,共进行2次epoch。
1.首先摇 2次 30个随机数,存储dd_100文件
dd_list = {}
epoch = 2
for i in range(epoch):
indices = torch.randperm(30)
dd_list[str(i)]= indices.numpy().tolist()
haha_str = json.dumps(dd_list)
with open('dd_100.json', 'w') as f:
f.write(haha_str)
2.然后选用自定义的sampler 进行batch顺序选择按照上面的次序
import torch
from torch import nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import os
import numpy as np
import json
def seed_torch(seed):
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
print("seed: ", seed)
seed_torch(3)
class MyDataset_v1(Dataset):
def __init__(self):
self.data = range(30)#
def __len__(self):
return len(self.data)
def __getitem__(self, item):
return self.data[item]
class SamplerDef(object):
def __init__(self, data_source, indices):
self.data_source = data_source
self.indices = indices
def __iter__(self):
return iter(self.indices)
def __len__(self):
return len(self.data_source)
if __name__ == "__main__":
myDataset1 = MyDataset_v1()
dataloader = {}
with open('dd_100.json', 'r') as f:
dd_list = json.load(f)
epoch = 2
for i in range(epoch):
mySampler = SamplerDef(data_source=myDataset1, indices=dd_list[str(i)])
dataloader['v1'] = DataLoader(dataset=myDataset1, batch_size=4, shuffle=False, pin_memory=True, sampler=mySampler)
for batch_ind, data in enumerate(zip(dataloader['v1'])):
d1 = data
print("Epoch: {} Batch_ind: {} data in Dataset1: {}".format(i, batch_ind, d1))
结果截图,正如我们所定义的dd那样~~