关于使用torch.dataparallel时锁死的解决办法

最新推荐文章于 2024-05-20 12:59:31 发布

HumbleHater

最新推荐文章于 2024-05-20 12:59:31 发布

阅读量586

点赞数 1

文章标签： pytorch 深度学习人工智能

本文链接：https://blog.csdn.net/qq_43693857/article/details/124536274

版权

1

大致情况是这样的，就是在训练中通过torch.dataparallel时进行了训练，这个时候会出现不报错，也不显示任何进展的问题。这种情况可能一开始训练就会出现，也有可能再重新训练时出现。当终止进程时会出现

Process finished with exit code 137 (interrupted by signal 9: SIGKILL

然后我去查看gpu的使用情况，主显卡（用于加载模型）显存已经占用了一部分，说明模型已经加载进去了。而并行gpu却基本没有显存占用，说明数据没有被加载进去，问题一般出现在了dataloader。
网上有很多已有的办法，其实都没啥用，这些能用得到的就看看吧[1]
然后看到一个github链接[2]，在里面试了各种办法，找到了一个办法可以解决我这种问题。

I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).

The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:

pin_memory =False/True
num_workers = 0/1/8
Increase ulimit
staggering the start of each experiment
Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.

Solution:
Use export OMP_NUM_THREADS=N, as described here
or use torch.set_num_threads(N), as described here
We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.

我是通过这句解决了锁死的问题：

torch.set_num_threads(N)

refrences

[1]https://3water.com/article/8MTM21NDY22Ljg4
[2]https://github.com/pytorch/pytorch/issues/1355

HumbleHater

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
关于使用torch.dataparallel时锁死的解决办法

1大致情况是这样的，就是在训练中通过torch.dataparallel时进行了训练，这个时候会出现不报错，也不显示任何进展的问题。这种情况可能一开始训练就会出现，也有可能再重新训练时出现。当终止进程时会出现Process finished with exit code 137 (interrupted by signal 9: SIGKILL然后我去查看gpu的使用情况，主显卡（用于加载模型）显存已经占用了一部分，说明模型已经加载进去了。而并行gpu却基本没有显存占用，说明数据没有被加载进去，问题
复制链接

扫一扫