RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 17 18 19 20 47 48 49 50 66 67 68 69 85 86 87 88 104 105 106 107 123 124 125 126 144 145 146 147 174 175 176 177 193 194 195 196 212 213 214 215 231 232 233 234 250 251 252 253 271 272 273 274 301 302 303 304 320 321 322 323 339 340 341 342
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
这个错误出现在我改动别人的代码之后运行的过程中,上网搜到的教程有魔改代码的报应,试了之后我所有网络层的输出都是None(可能是代码插入的位置不对),看了这篇教程中参考的第一个教程https://blog.csdn.net/racesu/article/details/122260113?utm_medium=distribute.pc_relevant.none-task-blog-2,虽然我报的错跟这篇博客里的内容不完全一样,但受到启发,我的init()函数中有一个定义的变量在forward()函数中没有被用到,我之前改完懒得注释掉,没想到居然会影响运行。于是乎,将那个定义但未使用的变量注释掉,代码就能够正常运行了。