🚀debug专栏
目录
mmseg训练,遇到了个数据加载过程中的bug,特此记录下debug过程和思路。其他debug请参考上文中【debug专栏】
❓❓问题1:
先是在dataloder那报了这样一个错
RuntimeError: Caught RuntimeError in DataLoader worker process 0.这是经常在数据加载过程中遇到的问题,主要还是看后面的详细报错说明。
然后后面报错
RuntimeError: Trying to resize storage that is not resizable
🌻🌻解决方案:
报错这种思路,首先应该定位到详细的报错位置“RuntimeError: Trying to resize storage that is not resizable”这一句,完整报错如下:
Traceback (most recent call last):
File "train.py", line 100, in <module>
for data in train_dataloader:
File "/data0/thw/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in <listcomp>
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 120, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 172, in collate_numpy_array_fn
return collate([torch.as_tensor(b) for b in batch], collate_fn_map=collate_fn_map)
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 120, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/XXX/anaconda3/envs/Mmseg/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 162, in collate_tensor_fn
out = elem.new(storage).resize_(len(batch), *list(elem.size()))
RuntimeError: Trying to resize storage that is not resizable
解决方法:网上很多网友说是设置的num_works不对导致的,需要设置为0 或 和显卡相同的数。但是我修改了此处仍然报错。
后来检查输入数据image和label的尺寸,报错原因果然是因为尺寸不一致,检查后修改成一致尺寸,解决了!!!
❓❓问题2:
训练模型加载数据时,报错DataLoader worker (pid xxx) is killed by signal: Killed.
此处的报错信息没有其他的详细内容了,只有这一句,这就愁人了,冲浪搜索了一下,发现还是刚才的解决方案。
🌻🌻解决方案:
num_works设置有问题,需要设置为0 或 和显卡相同的数量。个人经验,可以设置成 与显卡数量成倍数的数字 ,解决了!!!
举例:显卡数量是2,这里的 num_works就可以设置成4/8/16等,只要不爆显存,越大越好,但是也不建议超过64。
总结:
训练报错不要慌,检查下报错停止为止的上方是否有其他报错信息,详细的报错信息可能在上方,需要翻找一下第一个报错位置,一般就是真实的报错了,其他的模糊报错可能就是因为这个报错导致的,改了第一个报错位置可能后面的报错也就解决了。
整理不易,欢迎一键三连!!!
送你们一条美丽的--分割线--
🌷🌷🍀🍀🌾🌾🍓🍓🍂🍂🙋🙋🐸🐸🙋🙋💖💖🍌🍌🔔🔔🍉🍉🍭🍭🍋🍋🍇🍇🏆🏆📸📸⛵⛵⭐⭐🍎🍎👍👍🌷🌷