1. 单机单卡转单机多卡
对于Pytorch1.0以上版本,转多卡的方法和0.4.0不太一样。经过多次尝试,下面的方法是尝试过的所有方法中最方便的。
https://zhuanlan.zhihu.com/p/86441879
2. 多卡平衡
依旧是上面的文章,不过我试了之后只是平衡了一点,还在尝试中
https://zhuanlan.zhihu.com/p/86441879
3. 多卡训练时的报错
error1:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /tmp/pip-req-build-ocx5vxk7/torch/csrc/distributed/c10d/reducer.cpp:518)
解决方案:
提示中已经说了两个解决方案,经尝试只使用第一种即可解决。按照它说的改为:
netG= DistributedDataParallel(netG,find_unused_parameters=True)
即可。
error2:
训练过程中突然出现Segmentation fault (core dumped)
解决方案:
参考https://zhuanlan.zhihu.com/p/66667725
先查看ulimit -a
再修改ulimit -s 81920