1.实验环境:一张1080Ti显卡(11G),2个16G内存,Linux18.04系统
2.遇到问题:pytorch运行深度学习模型遇到报错如下:
xx@xx-H310M-A-V2:~/Documents/NTS-Net-master$ python train.py
Traceback (most recent call last):
File “train.py”, line 58, in
for i, data in enumerate(trainloader):
File “/home/zls/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 193, in iter
return _DataLoaderIter(self)
File “/home/zls/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 469, in init
w.start()
File “/home/zls/anaconda3/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/home/zls/anaconda3/lib/python3.6/multiprocessing/context.py”, line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File “/home/zls/anaconda3/lib/python3.6/multiprocessing/context.py”, line 277, in _Popen
return Popen(process_obj)
File “/home/zls/anaconda3/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
self._launch(process_obj)
File “/home/zls/anaconda3/lib/python3.6/multiprocessing/popen_fork.py”, line 66, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
3.解决方法:使用代码
watch -n 2 nvidia-smi
watch -n 2 free -m
全程监视电脑CPU、GPU,以及物理内存、交换区内存的变化情况,发现并不是内存的原因。找bug未果。
换了一个思路,从出错的代码以及错误提示上来看,是dataloader.py出了问题,于是Google,关键词:dataloader OSError: [Errno 12] Cannot allocate memory
果然有很多人也是由于在dataload的时候出错,找了很多原因:
1、电脑内存原因(已排除)
2、电脑系统线程数量限制:https://blog.csdn.net/m0_37644085/article/details/92795488:修改最大进程数(尝试无效)
3、设置pin_memory=False;(尝试无效)
4、修改多线程数量:设置num_workers为0,问题解决!!!程序可以跑了。
参考博客;https://blog.csdn.net/breeze210/article/details/99679048