使用 DDP时,如果不是 每块 GPU 一个进程,torch.autocast 应该放在 forward 中。
The autocast state is thread-local. If you want it enabled in a new thread, the context manager or decorator must be invoked in that thread. This affects torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel when used with more than one GPU per process (see Working with Multiple GPUs).
参考:Automatic Mixed Precision package - torch.amp — PyTorch 1.12 documentation