最近在训练YOLOV5时,发现以下问题:
Traceback (most recent call last):
File "train.py", line 399, in <module>
train(hyp)
File "train.py", line 228, in train
for i, (imgs, targets, paths, _) in pbar: # batch -------------------------------------------------------------
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/tqdm/std.py", line 1171, in __iter__
for obj in iterable:
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in _pin_memory_loop
data = pin_memory(data)
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return [pin_memory(sample) for sample in data]
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in <listcomp>
return [pin_memory(sample) for sample in data]
File "/home/videoc/miniconda3/envs/yolov5/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
return data.pin_memory()
RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
这是在加载数据的时候报错,因为GPU已经满载,所以使用CPU进行训练,但是数据加载还是默认使用GPU,所以报错。
解决方案:
网上查了一些资料,发现是因为pytorch的DataLoader中pin_memery=True导致的,将pin_memery设置为False,代码正常进入训练。
关于pin_memery参数,有以下说明:
pin_memory就是锁页内存,创建DataLoader时,设置pin_memory=True,则意味着生成的Tensor数据最开始是属于内存中的锁页内存,这样将内存的Tensor转义到GPU的显存就会更快一些。
主机中的内存,有两种存在方式,一是锁页,二是不锁页,锁页内存存放的内容在任何情况下都不会与主机的虚拟内存进行交换(注:虚拟内存就是硬盘),而不锁页内存在主机内存不足时,数据会存放在虚拟内存中。
而显卡中的显存全部是锁页内存!
当计算机的内存充足的时候,可以设置pin_memory=True。当系统卡住,或者交换内存使用过多的时候,设置pin_memory=False。因为pin_memory与电脑硬件性能有关,pytorch开发者不能确保每一个炼丹玩家都有高端设备,因此pin_memory默认为False。
Reference:
1. pytorch创建data.DataLoader时,参数pin_memory的理解。 http://www.voidcn.com/article/p-fsdktdik-bry.html