一文详解PyTorch分布式训练中数据并行DDP的原理和代码实现

本文链接：https://blog.csdn.net/my_name_is_learn/article/details/146468992

在这里插入图片描述

英文原文链接：Pytorch Distributed Data Parallel

实现的 torch.nn.parallel.DistributedDataParallel 随时间发展而演变。此设计说明是基于 v1.4 的状态编写的。

torch.nn.parallel.DistributedDataParallel (DDP) 透明地执行分布式数据并行训练。本页描述了其工作原理并揭示了实现细节。

简介

PyTorch中的数据并行（Data Parallel, DP）是一种用于在多个GPU上并行训练模型的技术。它通过在多个设备上复制模型，并在每个设备上处理一部分输入数据来实现并行化。以下是PyTorch数据并行的详细解释：

1. 基本概念

模型复制：在数据并行中，模型会被复制到每个可用的GPU上。
数据分割：输入数据会被分割成多个小批次（mini-batches），每个小批次被分配到一个GPU上。
并行计算：每个GPU独立地计算其小批次的前向和后向传播。
梯度汇总：在所有GPU上计算的梯度会被汇总（通常是求和），然后更新模型参数。

2. 实现细节

模型初始化：在使用数据并行时，首先需要将模型放置在torch.nn.DataParallel中。这样，PyTorch会自动处理模型的复制和数据的分发。
```
import torch
import torch.nn as nn

model = MyModel()
model = nn.DataParallel(model)
```
数据加载：数据加载器会自动将数据分割成小批次，并将其分发到不同的GPU上。
前向传播：每个GPU会独立地执行前向传播计算。
后向传播：每个GPU会计算其小批次的梯度。PyTorch会自动将这些梯度汇总。
参数更新：在所有梯度汇总后，主GPU会更新模型参数。

3. 优缺点

优点：
- 实现简单：只需少量代码修改即可实现。
- 适合小型模型：对于小型模型，数据并行可以有效地利用多个GPU。
缺点：
- 内存消耗大：每个GPU上都需要存储一份完整的模型。
- 通信开销：在多个GPU之间传输梯度会带来通信开销，可能会成为瓶颈。

4. 使用建议

适用场景：数据并行适用于模型较小且数据量较大的场景。
优化技巧：可以通过调整批次大小、使用混合精度训练等方法来优化性能。

PyTorch的官方文档中提供了更多关于数据并行的详细信息和示例代码，您可以参考PyTorch官方文档获取更多信息。

示例

让我们从一个简单的 torch.nn.parallel.DistributedDataParallel 开始示例。这个示例使用了一个 torch.nn.Linear 作为本地模型，将其包装成 DDP，然后在 DDP 模型上运行一次前向传播、一次反向传播和一次优化器步骤。之后，本地模型的参数将被更新，并且在不同进程上的所有模型都会被更新。请注意：每个GPU都运行在一个独立进程里面。

import torch
import torch.distributed as dist  # 提供分布式训练的功能。
import torch.multiprocessing as mp  # 提供多进程支持。
import torch.nn as nn
import torch.optim as optim
import os
from torch.nn.parallel import DistributedDataParallel as DDP  # PyTorch 的分布式数据并行模块。


def example(rank, world_size):
	
	"""
	rank: 当前进程的编号（从 0 开始）。
	world_size: 总的进程数（即 GPU 的数量）。
	"""
	
    # create default process group
    # 创作默认进程组,"gloo" 是后端通信库，适用于 CPU 和 GPU（在 GPU 上通常使用 nccl）。
    dist.init_process_group("gloo", rank=rank, world_size=world_size)
    # create local model
    # 创建本地模型;  .to(rank) 将模型移动到当前进程对应的设备（如 GPU）
    model = nn.Linear(10, 10).to(rank)
    # construct DDP model
    # 输入本地模型，构造DDP模型。使用 DistributedDataParallel 包装模型。device_ids=[rank] 指定当前进程使用的设备
    ddp_model = DDP(model, device_ids=[rank])
    # define loss function and optimizer
    # 定义损失函数和优化器
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    # forward pass
    # 前向传播
    outputs = ddp_model(torch.randn(20, 10).to(rank))
    labels = torch.randn(20, 10).to(rank)
    # backward pass
    # 反向传播
    loss_fn(outputs, labels).backward()
    
    # update parameters
    # 更新参数
    optimizer.step()

def main():
	"""
	mp.spawn: 是 PyTorch 提供的多进程启动函数
	example 是每个进程执行的函数
	args=(world_size,) 是传递给 example 的参数
	nprocs=world_size 指定启动的进程数
	join=True 表示主进程等待所有子进程结束后再退出
	"""
    world_size = 2
    mp.spawn(example,
        args=(world_size,),
        nprocs=world_size,
        join=True)

if __name__=="__main__":
    # Environment variables which need to be
    # set when using c10d's default "env"
    # initialization mode.
    os.environ["MASTER_ADDR"] = "localhost"  # 设置主节点的地址为本地主机。
    os.environ["MASTER_PORT"] = "29500"  # 设置主节点的端口号为 29500。
    main()  # 调用主函数，启动分布式训练。

DDP 与 TorchDynamo 兼容。当与 TorchDynamo 一起使用时，在编译模型之前应用 DDP 模型包装器，以便 torchdynamo 可以根据 DDP 桶大小应用 DDPOptimizer（图中断点优化）。 (基于 DDP 桶大小的图中断点优化)。 (有关更多信息，请参见 TorchDynamo DDPOptimizer)。

ddp_model = DDP(model, device_ids=[rank])
ddp_model = torch.compile(ddp_model)

总结

这段代码的核心是使用 PyTorch 的 DistributedDataParallel (DDP) 模块实现分布式训练。它的主要步骤包括：

初始化分布式进程组。
创建模型并将其包装为 DDP 模型。
定义损失函数和优化器。
执行前向传播、计算损失、反向传播和参数更新。
使用 mp.spawn 启动多个进程进行分布式训练。

注意事项

后端选择: 这里使用了 gloo 后端，适用于 CPU 和 GPU。如果使用 GPU，建议使用 nccl 后端以获得更好的性能。
环境变量: 在分布式训练中，MASTER_ADDR 和 MASTER_PORT 是必须设置的，用于指定主节点的地址和端口。
设备分配: 每个进程需要将模型和张量移动到对应的设备（如 GPU）。
数据并行: DDP 会自动处理数据的分发和梯度的同步。

内部设计

在这里插入图片描述

下面详细解释了 torch.nn.parallel.DistributedDataParallel (DDP) 的工作原理，尤其是它在一次迭代中的每个步骤的细节。以下是逐段解释和总结，帮助你理解 DDP 的内部机制。

1. 前提条件

Prerequisite: DDP relies on c10d ProcessGroup for communications. Hence, applications must create ProcessGroup instances before constructing DDP.

• 解释:
• DDP 依赖于 c10d.ProcessGroup 来实现进程间的通信。
• 在创建 DDP 实例之前，必须先初始化分布式进程组（dist.init_process_group），以便进程之间可以进行通信。
• c10d 是 PyTorch 的分布式通信库，支持多种后端（如 gloo 和 nccl）。

2. 构造函数

Construction: The DDP constructor takes a reference to the local module, and broadcasts state_dict() from the process with rank 0 to all other processes in the group to make sure that all model replicas start from the exact same state.

• 解释:
• DDP 的构造函数接收一个本地模型（local module）作为输入。
• 在初始化时，DDP 会将 rank 0（即进程编号为 0 的进程）的模型状态（state_dict()）广播到所有其他进程。
• 这样可以确保所有进程的模型副本从完全相同的状态开始训练。

Reducer 的创建

Then, each DDP process creates a local Reducer, which later will take care of the gradients synchronization during the backward pass.

• 解释:
• 每个 DDP 进程都会创建一个本地的 Reducer 对象。
• Reducer 的主要职责是在反向传播时同步梯度。

梯度分桶（Bucketing）

To improve communication efficiency, the Reducer organizes parameter gradients into buckets, and reduces one bucket at a time. Bucket size can be configured by setting the bucket_cap_mb argument in DDP constructor.

• 解释:
• 为了提高通信效率，Reducer 将参数梯度分组到多个桶（buckets）中。
• 每次只对一个桶进行梯度同步（allreduce 操作）。
• 桶的大小可以通过 bucket_cap_mb 参数配置（单位是 MB）。

桶的分配规则

The mapping from parameter gradients to buckets is determined at the construction time, based on the bucket size limit and parameter sizes. Model parameters are allocated into buckets in (roughly) the reverse order of Model.parameters() from the given model.

• 解释:
• 参数梯度到桶的映射在 DDP 构造时确定。
• 映射规则是基于桶的大小限制和参数的大小。
• 参数默认按照 Model.parameters() 的逆序分配到桶中（即从后向前的顺序）。

为什么使用逆序？

The reason for using the reverse order is because DDP expects gradients to become ready during the backward pass in approximately that order.

• 解释:
• 在反向传播中，梯度的计算顺序通常是从最后一层到第一层。
• 通过将参数按逆序分配到桶中，DDP 可以期望梯度在反向传播过程中按顺序准备好，从而减少通信等待时间。

桶的假设问题

Note that, the grad0 and grad1 are in bucket1, and the other two gradients are in bucket0. Of course, this assumption might not always be true, and when that happens it could hurt DDP backward speed as the Reducer cannot kick off the communication at the earliest possible time.

• 解释:
• 假设梯度按逆序准备好并不总是成立。
• 如果某个梯度的计算延迟，Reducer 可能无法尽早启动通信，从而降低反向传播的速度。

自动求导钩子（Autograd Hooks）

Besides bucketing, the Reducer also registers autograd hooks during construction, one hook per parameter. These hooks will be triggered during the backward pass when the gradient becomes ready.

• 解释:
• Reducer 在构造时会为每个参数注册一个自动求导钩子（autograd hook）。
• 这些钩子会在反向传播中梯度准备好时被触发，用于通知 Reducer 梯度已就绪。

3. 前向传播

Forward Pass: The DDP takes the input and passes it to the local model, and then analyzes the output from the local model if find_unused_parameters is set to True.

• 解释:
• DDP 将输入数据传递给本地模型，执行前向传播。
• 如果设置了 find_unused_parameters=True，DDP 会分析本地模型的输出。

`find_unused_parameters` 的作用

This mode allows running backward on a subgraph of the model, and DDP finds out which parameters are involved in the backward pass by traversing the autograd graph from the model output and marking all unused parameters as ready for reduction.

• 解释:
• find_unused_parameters=True 允许在模型的子图上执行反向传播。
• DDP 会从模型输出开始遍历自动求导图（autograd graph），标记所有未使用的参数为“就绪”，以便进行梯度同步。

性能开销

During the backward pass, the Reducer would only wait for unready parameters, but it would still reduce all buckets. Marking a parameter gradient as ready does not help DDP skip buckets as for now, but it will prevent DDP from waiting for absent gradients forever during the backward pass. Note that traversing the autograd graph introduces extra overheads, so applications should only set find_unused_parameters to True when necessary.

• 解释:
• 在反向传播中，Reducer 只会等待未准备好的梯度，但仍然会对所有桶进行同步。
• 标记梯度为“就绪”不会跳过桶的同步，但可以防止 Reducer 永远等待不存在的梯度。
• 遍历自动求导图会引入额外的开销，因此只有在必要时才应启用 find_unused_parameters=True。

4. 反向传播

Backward Pass: The backward() function is directly invoked on the loss Tensor, which is out of DDP’s control, and DDP uses autograd hooks registered at construction time to trigger gradients synchronizations.

• 解释:
• backward() 是直接在损失张量上调用的，DDP 无法直接控制这一过程。
• DDP 使用构造时注册的自动求导钩子来触发梯度同步。

梯度同步的过程

When one gradient becomes ready, its corresponding DDP hook on that grad accumulator will fire, and DDP will then mark that parameter gradient as ready for reduction. When gradients in one bucket are all ready, the Reducer kicks off an asynchronous allreduce on that bucket to calculate mean of gradients across all processes.

• 解释:
• 当某个梯度准备好时，对应的钩子会被触发，标记该梯度为“就绪”。
• 当一个桶中的所有梯度都准备好时，Reducer 会启动一个异步的 allreduce 操作，计算所有进程梯度的平均值。

所有桶的同步

When all buckets are ready, the Reducer will block waiting for all allreduce operations to finish. When this is done, averaged gradients are written to the param.grad field of all parameters. So after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same.

• 解释:
• 当所有桶都准备好后，Reducer 会阻塞，等待所有 allreduce 操作完成。
• 同步完成后，平均梯度会被写入所有进程的 param.grad 字段。
• 这样，不同 DDP 进程中相同参数的梯度值是一致的。

5. 优化器步骤

Optimizer Step: From the optimizer’s perspective, it is optimizing a local model. Model replicas on all DDP processes can keep in sync because they all start from the same state and they have the same averaged gradients in every iteration.

• 解释:
• 从优化器的角度看，它只是在优化本地模型。
• 所有 DDP 进程的模型副本可以保持同步，因为它们：
1. 从相同的状态开始。
2. 每次迭代都有相同的平均梯度。

总结

DDP 的核心机制可以概括为以下几点：

初始化:
• 广播模型状态，确保所有进程的模型副本一致。
• 创建 Reducer，负责梯度同步。
• 按逆序将参数分配到桶中，并注册自动求导钩子。
前向传播:
• 将输入传递给本地模型。
• 如果启用 find_unused_parameters=True，会分析模型输出并标记未使用的参数。
反向传播:
• 使用自动求导钩子触发梯度同步。
• 梯度准备好后，Reducer 对桶中的梯度进行 allreduce 操作。
• 同步完成后，所有进程的梯度值一致。
优化器更新:
• 优化器只优化本地模型，但由于梯度同步，所有进程的模型参数保持一致。

通过这些机制，DDP 实现了高效的分布式训练，同时保证了模型参数在所有进程之间的一致性。

实现DDP

进程组

PyTorch 分布式通信库 c10d 中的两个重要模块：ProcessGroup 和 Store。

1. ProcessGroup

ProcessGroup: ProcessGroup.hpp contains the abstract API of all process group implementations. The c10d library provides 3 implementations out of the box, namely, ProcessGroupGloo, ProcessGroupNCCL, and ProcessGroupMPI.

解释:

（1）ProcessGroup 是什么？
• ProcessGroup 是 c10d 库中定义的一个抽象类，提供了分布式进程组通信的通用接口。
• 它定义了一组通用的通信操作（如 broadcast、allreduce 等），具体的实现由不同的后端提供。

（2）c10d 提供的实现:
（2.1）ProcessGroupGloo:
◦ 基于 Gloo 库实现。
◦ Gloo 是一个高性能的集合通信库，支持 CPU 和 GPU。
◦ 适用于小规模分布式训练（如单机多卡）或在不支持 NCCL 的环境中使用。

（2.2） ProcessGroupNCCL:
◦ 基于 NVIDIA NCCL 库实现。
◦ NCCL 是专门为 NVIDIA GPU 设计的高性能集合通信库。
◦ 适用于大规模分布式训练（如多机多卡），尤其是在 GPU 上的性能非常优越。

（2.3）ProcessGroupMPI:
◦ 基于 MPI 实现。
◦ MPI 是一种传统的分布式通信协议，广泛用于高性能计算（HPC）领域。
◦ 适用于需要与现有 MPI 基础设施集成的场景。

`ProcessGroup` 的作用:

（1）初始化阶段:
• 在 DistributedDataParallel (DDP) 的初始化过程中，ProcessGroup::broadcast() 被用来将 rank 0 的模型状态（state_dict()）广播到其他进程。
• 这确保了所有进程的模型副本从相同的状态开始训练。

（2）反向传播阶段:
• 在反向传播中，ProcessGroup::allreduce() 被用来对梯度进行求和并取平均。
• 这是 DDP 实现梯度同步的核心操作。

总结:

• ProcessGroup 是一个抽象类，定义了分布式通信的通用接口。
• c10d 提供了三种后端实现：ProcessGroupGloo、ProcessGroupNCCL 和 ProcessGroupMPI。
• 不同的后端适用于不同的硬件和场景：
• Gloo: 适合 CPU 或小规模 GPU 训练。
• NCCL: 适合大规模 GPU 训练。
• MPI: 适合高性能计算场景。

2. Store

Store.hpp: assists the rendezvous service for process group instances to find each other.

解释:

（1） Store 是什么？
• Store 是 c10d 库中的一个辅助模块，用于实现进程间的“发现服务”（rendezvous service）。
• 它的主要作用是帮助分布式训练中的各个进程找到彼此，并交换必要的元信息（如进程的 IP 地址、端口号等）。

（2） Store 的工作原理:
• 在分布式训练中，每个进程需要知道其他进程的信息（如 rank 和 world_size）才能建立通信。
• Store 提供了一个键值存储（key-value store）机制，允许进程在其中存储和检索信息。
• 例如，rank 0 可以将自己的地址和端口号存储到 Store 中，其他进程可以从 Store 中读取这些信息，从而建立通信。

（3）Store 的典型用途:
• 进程发现:
◦ 在分布式训练开始时，进程需要通过某种方式找到彼此。
◦ Store 提供了一种机制，让进程可以通过共享的键值对信息找到其他进程。
• 元信息交换:
◦ Store 可以用来存储和交换分布式训练所需的元信息，例如 master_addr（主节点地址）、master_port（主节点端口）等。

`Store` 的实现:

（1） Store 是一个抽象类，c10d 提供了多种具体实现，例如：
• 基于文件的实现:
◦ 使用文件系统存储键值对信息。
• 基于 TCP 的实现:
◦ 使用网络通信存储和检索键值对信息。
• 其他自定义实现:
◦ 用户可以根据需求实现自己的 Store。

`Store` 的作用场景:

• 在 DDP 的初始化过程中，Store 被用来协调进程间的通信。
• 例如，dist.init_process_group() 会使用 Store 来完成进程发现和元信息交换。

总结:

• Store 是一个辅助模块，用于实现分布式训练中的进程发现和元信息交换。
• 它提供了一个键值存储机制，允许进程存储和检索信息。
• Store 是分布式训练中建立通信的基础组件之一。

DistrbutedDataParallel

在这里插入图片描述

下面描述了 PyTorch 的 DistributedDataParallel (DDP) 模块的实现细节，包括 Python 和 C++ 层的分工、参数同步、梯度同步以及模型缓冲区的广播。帮助你理解 DDP 的内部工作原理。

1. DistributedDataParallel (Python 层)

distributed.py: is the Python entry point for DDP. It implements the initialization steps and the forward function for the nn.parallel.DistributedDataParallel module which call into C++ libraries.

解释:

（1） distributed.py 是什么？
• distributed.py 是 DDP 的 Python 接口文件，提供了用户调用的 nn.parallel.DistributedDataParallel 类。
• 它是 DDP 的入口点，负责初始化和前向传播的逻辑。

(2) DDP 的初始化步骤:
• 在初始化时，distributed.py 会调用 C++ 层的实现来完成进程间的通信初始化。
• 它还会调用 _sync_param 函数来同步模型参数和缓冲区。

(3) 前向传播:
• 前向传播的逻辑也在 distributed.py 中实现，但底层会调用 C++ 层的功能。
• 如果一个 DDP 进程需要处理多个设备（如多 GPU），_sync_param 还会负责进程内的参数同步。

_sync_param 函数

Its _sync_param function performs intra-process parameter synchronization when one DDP process works on multiple devices, and it also broadcasts model buffers from the process with rank 0 to all other processes.

(1) 作用:
• _sync_param 是 DDP 的一个内部函数，负责参数同步。
• 它有两种主要功能：
1. 进程内参数同步:
◦ 当一个 DDP 进程需要处理多个设备（如多 GPU）时，_sync_param 会在进程内同步这些设备的模型参数。
2. 模型缓冲区的广播:
◦ 在初始化阶段，_sync_param 会将 rank 0 的模型缓冲区（buffers）广播到所有其他进程。
◦ 模型缓冲区是模型中不需要训练的部分（如 BatchNorm 的均值和方差），但需要在所有进程中保持一致。

(2) 为什么需要广播缓冲区？
• 模型缓冲区是模型的一部分，但它们不参与梯度计算。
• 为了保证所有进程的模型一致性，缓冲区需要在初始化时从 rank 0 广播到其他进程。

2. comm.h

comm.h: implements the coalesced broadcast helper function which is invoked to broadcast model states during initialization and synchronize model buffers before the forward pass.

解释:

(1) comm.h 是什么？
• comm.h 是 DDP 的 C++ 层头文件，提供了通信相关的辅助函数。
• 它实现了 合并广播（coalesced broadcast） 功能，用于高效地广播模型状态和同步模型缓冲区。

合并广播（Coalesced Broadcast）:

(1) 什么是合并广播？
• 合并广播是一种优化技术，将多个小张量合并成一个大张量进行广播，从而减少通信开销。
• 在分布式训练中，模型参数和缓冲区可能被分成多个小张量，直接广播这些小张量会导致频繁的通信操作。
• 合并广播将这些小张量合并成一个大张量，然后一次性广播，显著提高了通信效率。

(1) 合并广播的使用场景:

初始化阶段:
◦ 在 DDP 初始化时，comm.h 的合并广播功能被用来将 rank 0 的模型状态（state_dict()）广播到其他进程。
前向传播之前:
◦ 在每次前向传播之前，comm.h 的合并广播功能被用来同步模型缓冲区。

3. reducer.h

reducer.h: provides the core implementation for gradient synchronization in the backward pass. It has three entry point functions:

解释:

(1) reducer.h 是什么？
• reducer.h 是 DDP 的 C++ 层头文件，提供了梯度同步的核心实现。
• 它定义了 Reducer 类，负责在反向传播中同步梯度。

(2) Reducer 的作用:
• 在反向传播中，Reducer 负责收集所有进程的梯度，计算它们的平均值，并将结果写回到每个进程的 param.grad 字段中。
• 这是 DDP 实现分布式训练的核心功能。

`Reducer` 的三个入口函数

It has three entry point functions:

1. Reducer 构造函数

Reducer: The constructor is called in distributed.py which registers Reducer::autograd_hook() to gradient accumulators.

(1) 作用:
• Reducer 的构造函数在 distributed.py 中被调用。
• 它会为每个参数注册一个自动求导钩子（autograd_hook()），这些钩子会在梯度准备好时被触发。

(2) 为什么需要注册钩子？
• 在反向传播中，梯度的计算是异步的。
• 注册钩子后，Reducer 可以在梯度准备好时立即触发同步操作，而不需要等待所有梯度都计算完成。

2. autograd_hook() 函数

autograd_hook() function will be invoked by the autograd engine when a gradient becomes ready.

(1)作用:
• autograd_hook() 是一个自动求导钩子，由 PyTorch 的自动求导引擎（autograd engine）在梯度准备好时调用。
• 当某个参数的梯度准备好时，autograd_hook() 会被触发，通知 Reducer 进行梯度同步。

(2) 工作流程:

在反向传播中，梯度计算完成后，PyTorch 的自动求导引擎会调用 autograd_hook()。
autograd_hook() 将该参数的梯度标记为“就绪”。
当一个桶（bucket）中的所有梯度都就绪时，Reducer 会启动一次 allreduce 操作，同步这些梯度。

3. prepare_for_backward() 函数

prepare_for_backward() is called at the end of DDP forward pass in distributed.py. It traverses the autograd graph to find unused parameters when find_unused_parameters is set to True in DDP constructor.

(1) 作用:
• prepare_for_backward() 是在 DDP 的前向传播结束时调用的。
• 它的主要功能是为反向传播做准备：
1. 如果启用了 find_unused_parameters=True，它会遍历自动求导图（autograd graph），找到未使用的参数。
2. 将这些未使用的参数标记为“就绪”，以便在反向传播中进行梯度同步。

(2)为什么需要遍历自动求导图？
• 在某些情况下，模型的某些参数可能在前向传播中没有被使用（例如条件分支中的参数）。
• 如果不标记这些参数为“就绪”，Reducer 会一直等待它们的梯度，导致性能下降。
• 遍历自动求导图可以找到这些未使用的参数，并将它们标记为“就绪”，从而避免不必要的等待。

4. 总结

`distributed.py` (Python 层):

• 是 DDP 的入口点，负责初始化和前向传播的逻辑。
• _sync_param 函数：
• 在进程内同步参数（多 GPU 场景）。
• 广播模型缓冲区（从 rank 0 到其他进程）。

`comm.h` (C++ 层):

• 提供了合并广播功能，用于高效地广播模型状态和同步模型缓冲区。
• 在初始化阶段和前向传播之前被调用。

`reducer.h` (C++ 层):

• 提供了梯度同步的核心实现。
• 包含三个入口函数：

Reducer 构造函数:
◦ 注册 autograd_hook()，用于在梯度准备好时触发同步操作。
autograd_hook() 函数:
◦ 在梯度准备好时被调用，标记梯度为“就绪”。
◦ 当一个桶中的所有梯度都就绪时，启动 allreduce 操作同步梯度。
prepare_for_backward() 函数:
◦ 在前向传播结束时调用，为反向传播做准备。
◦ 如果启用了 find_unused_parameters=True，会遍历自动求导图，找到未使用的参数并标记为“就绪”。