Pytorch多卡使用总结

最新推荐文章于 2024-08-09 11:35:37 发布

R.X. NLOS

最新推荐文章于 2024-08-09 11:35:37 发布

阅读量3.6k

点赞数 1

分类专栏： # Deep Learning 文章标签： pytorch 多卡 GPU

本文链接：https://blog.csdn.net/qazwsxrx/article/details/107401215

版权

Deep Learning 专栏收录该内容

62 篇文章 6 订阅

订阅专栏

1. 单机单卡转单机多卡

对于Pytorch1.0以上版本，转多卡的方法和0.4.0不太一样。经过多次尝试，下面的方法是尝试过的所有方法中最方便的。

https://zhuanlan.zhihu.com/p/86441879

2. 多卡平衡

依旧是上面的文章，不过我试了之后只是平衡了一点，还在尝试中

https://zhuanlan.zhihu.com/p/86441879

3. 多卡训练时的报错

error1：

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /tmp/pip-req-build-ocx5vxk7/torch/csrc/distributed/c10d/reducer.cpp:518)

解决方案：

提示中已经说了两个解决方案，经尝试只使用第一种即可解决。按照它说的改为：

netG= DistributedDataParallel(netG,find_unused_parameters=True)

即可。