nvidia-nccl 学习笔记

最新推荐文章于 2024-09-26 18:05:50 发布

weixin_34313182

最新推荐文章于 2024-09-26 18:05:50 发布

阅读量1.1k

点赞数

原文链接：https://my.oschina.net/u/1459307/blog/1650028

版权

2019独角兽企业重金招聘Python工程师标准>>>

NCCL 资料

NCCL1 vs NCCL2

nccl1：
nccl1支持单机多卡通信，不支持多机通信。
开源地址：https://github.com/NVIDIA/nccl-tests
nccl2:
nccl2支持多机通信，在nccl1的基础上增加了多机通信策略。多机通信可进行通信协议的选择，支持通过IB、TCP等协议实现多机间数据通信。

NCCL2接口

初始化操作

Id 创建

创建统一的Id，一个通信组中只初始化一次。创建的Id被分发给通信组中的所有应用。

ncclResult_t ncclGetUniqueId(ncclUniqueId* uniqueId)
创建一个被初始化函数（ncclCommInitRank）使用的Id。该函数只能被调用一次（在整个分布式计算中只能被一个地方调用），调用后产生的Id需要分发给分布式任务中其他所有的任务，然后在进行ncclCommInitRank初始化操作（该初始化操作需要使用全局统一Id）。
> Generates an Id to be used in ncclCommInitRank. ncclGetUniqueId should be called once and the Id should be distributed to all ranks in the communicator before calling ncclCommInitRank.

communicator 初始化

创建通信组中每个应用的communicator。每个应用在通信过程中需要绑定自己的communicator。

ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int rank)
多进程/多线程中创建一个新的communicator。参数重的rank必须是0到nranks-1之间，并且是唯一的。每个rank应该对应一个已经设置的device。该函数会对每个rank做隐式同步。该函数必须被不同的进程、线程调用；或者在同一个线程中使用ncclGroupStart/ncclGroupEnd进行限制。

Creates a new communicator (multi thread/process version). rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly syncronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.

ncclResult_t ncclCommInitAll(ncclComm_t* comm, int ndev, const int* devlist)
但进程中统一创建communicators，需要预先分配comm地址，并且传入device个数和device列表（该函数在单机通信中使用较方便，多机通信中不使用该函数）。
> Creates a clique of communicators (single process version). This is a convenience function to create a single-process communicator clique. Returns an array of ndev newly initialized communicators in comm. comm should be pre-allocated with size at least ndev*sizeof(ncclComm_t). If devlist is NULL, the first ndev CUDA devices are used. Order of devlist defines user-order of processors within the communicator.

合并操作

对通信组中的每个communicator，需要分别调用collective操作。当操作进入到cuda stream中排队时函数就返回。collective需要每个进程／线程进行独立操作；或者在单线程中使用组语句（ncclGroupStart/ncclGroupEnd）。支持in-place模式（sendbuf=recvbuf）

Collective communication operations must be called separately for each ommunicator in a communicator clique. They return when operations have been enqueued on the CUDA stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics.

reduce
1. ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)
  数据合并操作，将数据合并到root节点（root节点是rank的root，不是device的root）
Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff may be NULL on all calls except for root device. root is the rank (not the CUDA device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.
Broadcast
1. ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)
  广播root节点的数据到全部节点。
Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitely in place.
All-Reduce
1. ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)
  在每个节点上对全部数据做reduce操作
Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.
Reduce-Scatter
All-Gather
1. ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)
  从其他节点接受数据并存储到本地recvbuf中。接收到的数据存储的偏移位置为i*sendcount（i为rank序列）
Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.

组操作

当在单线程中操作多个GPU时，需要使用组操作进行不同ranks／devices间通信的约束（保证在cpu同步时不冲突）。通过使用ncclGroupStart 和 ncclGroupEnd，保证组内相同操作的进行。ncclGroupStart将所有的操作放入队列，ncclGroupEnd等待队列中所有操作的完成（在collective操作中ncclGroupEnd只保证把所有的操作都放到cuda stream中，不等待操作完成）。组操作可以在collective操作和ncclCommInitRank中被使用。

When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for different ranks/devices into a single call. Grouping NCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done. Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd.

ncclResult_t ncclGroupStart()
组开始操作，其后的操作不使用cpu同步.

start a group call. All subsequent calls to NCCL may not block due to inter-CPU synchronization.
ncclResult_t ncclGroupEnd()
组结束操作，阻塞到所有从ncclGroupStart开始的操作完成在返回.

End a group call. Wait for all calls since ncclGroupStart to complete before returning.

其他操作

ncclResult_t ncclCommDestroy(ncclComm_t comm)
释放comm资源.

Frees resources associated with communicator object
const char* ncclGetErrorString(ncclResult_t result)
获得错误信息结果

Returns a human-readable error message.
ncclResult_t ncclCommCount(const ncclComm_t comm, int* count)
获得通信组中全部的rank数

Gets the number of ranks in the communicator clique.
ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* device)
获得当前通信communicator对应的device

Returns the cuda device number associated with the communicator.
ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) 获得当前通信communicator对应的rank值

Returns the user-ordered "rank" associated with the communicator.

NCCL 动态扩展

单机多卡多线程动态扩展

设计思路：
采用在线程内各自初始化自己communicator的方法进行初始化（在主线程中创建ncclid，该ncclid对全局线程可见）。当某一个线程调用merge操作失败时，查看是否因为某个线程退出引起的。
如果因为某个线程退出引起merge失败，这时每个线程重新初始化自己的communicator，并进行上一步的merge操作（该次初始化时device已经减少，相当于重新创建communicator）

测试结论：
1. 每个线程初始化自己OK
2. merge操作过程中如果出现某个线程退出，其他线程会处于block状态（不返回）
结论
单机多卡（多线程）动态扩展无法支持。

单机/多机多卡多进程动态扩展

设计思路：
采用在进程内各自初始化自己communicator的方法进行初始化（初始化时，0号进程使用tpc协议广播ncclid到全部进程）。当某一个进程调用merge操作失败时，查看是否是因为有进程退出引起的。如果因为某个进程退出引起merge失败，这时每个进程重新初始化自己的communicator，并进行上一步的merge操作（该次初始化时device已经减少，相当于重新创建communicator）

测试结论： 1. server进程（TCP server端）创建ncclId，并且将该进程bcast到所有work进程（TCP client端），然后进行通信是可以的（server进程可以不参与通信）
2. merge操作过程中如果出现某个进程退出，其他进程全部处于block状态（不返回），且这时候其他进程的GPU使用率是100%，cpu使用100%。
结论
单机／多机多卡多进程动态扩展无法支持。

转载于:https://my.oschina.net/u/1459307/blog/1650028