通讯库的初始化需要各RANK间先建立基本的通信路径,NCCL通过构建Bootstrap网络实现RANK间的高效控制面通信和同步。
NCCL编程过程
其中Bootstrap网络在NCCL初始化阶段调用ncclGetUniqueId以及ncclCommInitRank接口过程中构建完成,用于后续不同RANK间控制面的通信和同步。
Bootstrap网络构建
以下是Bootstrap网络的构建过程。
构建中心地址
中心地址ncclUniqueId的构建系通过ncclGetUniqueId接口实现,约定由RANK 0调用构建。以下是ncclUniqueId里包含的地址相关信息的结构体。其中ncclSocketAddress可能是IPV4或IPV6的套接字地址,可用于实例化套接字。各RANK向该地址发起连接,连接成功后随后均可向RANK 0发送信息。
struct ncclBootstrapHandle {
uint64_t magic; // generated by RANK 0
union ncclSocketAddress addr; // Contains RANK 0 Socket Address to connect
};
以下是ncclGetUniqueId的函数调用图,和ncclUniqueId相关的函数分别是bootsrapNetInit和ncclSocketGetAddrFromString函数。
其中bootsrapNetInit函数按照规则查询机器上的网络设备,选定一个网络设备用于Bootstrap网络的构建,使用网络设备的IP地址构建bootstrapNetIfAddr,其可以作为构建ncclGetUniqueId的相关信息。
在执行bootstrapGetUniqueId函数时,如果没有用户指定NCCL_COMM_ID则使用bootstrapNetIfAddr构建ncclGetUniqueId,否则使用用户指定的NCCL_COMM_ID,通过ncclSocketGetAddrFromString函数通过用户指定的地址构建ncclGetUniqueId。
广播中心地址
中心地址的广播即是ncclUniqueID的广播,这部分并无固定做法,以上例子系通过使用MPI的MPI_Bcast函数广播至各RANKs。除此之外还有很多其他方式,如预先配置各RANK网络信息用于广播;如是单机场景,各RANKs可以是不同线程,线程间可以通过内存共享信息等等。
构建星形通信网络
星形通信网络的构建和bootstrapCreateRoot函数以及bootsrapInit/boostrapSplit关系密切,这两个函数均是由ncclGetUniqueId与ncclCommInitRank函数触发。RANK 0首先调用bootstrapCreateRoot在后台创建bootstrapRoot线程,该线程作为星形网络的中心。
struct extInfo {
int rank;
int nranks;
union ncclSocketAddress extAddressListenRoot;
union ncclSocketAddress extAddressListen;
};
随后所有RANKs包括RANK 0在启动新线程调用bootstrapInit与bootstrapRoot线程通过网络通信,bootstrapInit函数构建使用bootstrapNetIfAddr初始化构建两份ncclSocket,并将这两份ncclSocket的地址填入extInfo结构体中,准备向RANK 0发送。因为中心节点的地址在上一步已经随着ncclUniqueID的广播获得,这时extInfo数据可向中心节点传输,即bootstrapRoot线程。至此所有RANK可向RANK 0发送数据,RANK 0收到extInfo后,通过extAddressListenRoot也可向所有RANK发送数据,至此星形网络构建完成。
构建环形通信网络
星形网络可满足各RANK之间数据通信的功能需求,但由于所有数据均需要汇集在中心节点再由中心节点发送至各RANKs节点,而Bootstrap网络常常用于执行点对点的Send/Recv和AllGather等集合通信操作,星形拓扑中心节点容易成为瓶颈。
因此,需要RANKs之间构建更高效的互联拓扑用于Bootstrap网络的点对点和集合通信操作,NCCL采用环形图谱。
在RANK 0 和其他RANK完成星形拓扑的构建后,RANK 0 将各RANK的next RANK地址extAddressListen发送到各个RANK,各RANK收到next RANK的地址后即可向next RANK发起连接,连接成功后单向的环便构建完成。随后各RANK接受previous RANK的连接,完成另外一个方向的环的构建,至此Bootstrap网络环构建完成,可使用Bootstrap模块的接口收发数据了。
bootstrapInit函数后续的代码通过使Bootstrap模块的接口完成RANK之间其他数据的交换,将bootstrapState结构体初始化完成保存在ncclComm的bootstrap成员变量中。
构建过程图示
以下图示描述了从ncclUniqueID到Bootstrap环形拓扑构建完成中间的过程。
BootstrapState结构体
BootstrapState是Bootstrap过程中主要的对象,整个Bootstrap过程围绕着如何正确构造BootstrapState结构体进行。通过BootstrapState结构体,可以完成RANKs间的控制面数据通信。以下是Bootstrap结构体成员的注解,注解在注释中。
struct bootstrapState {
struct ncclSocket listenSock; // Socket to accept incoming connections from ROOT RANK
struct ncclSocket ringRecvSocket; // Socket to accept incoming connections from PREV RANK in the allGather ring
struct ncclSocket ringSendSocket; // Socket to connect to NEXT RANK in the AllGather ring
union ncclSocketAddress* peerCommAddresses; // All Ranks Listen Socket Address
union ncclSocketAddress* peerProxyAddresses; // All Ranks Proxy Socket Address
uint64_t* peerProxyAddressesUDS; // Unix Domain Socket Address for Local IPC
struct unexConn* unexpectedConnections; // Cached Acctepted connections
int cudaDev; // Device ID
int rank; // Rank number
int nranks; // Rank amount
uint64_t magic; // Magic number generated by RANK 0 in ncclUniqueID
volatile uint32_t *abortFlag; // Support abort or not
};
Bootstrap模块
Bootstrap模块接口可分为两类,一类像bootstrapGetUniqueId,bootstrapNetInit,bootstrapCreateRoot,bootstrapInit以及bootstrapSplit用于Bootstrap网络的构建;另一类则是Bootstrap网络构建完成后提供不同的通信/同步函数,如bootstrapAllGather,bootstrapSend,bootstrapRecv,bootstrapBarrier,bootstrapBroadcast,bootstrapIntraNodeBarrier,bootstrapIntraNodeAllGather和bootstrapIntraNodeBroadcast。
以下头文件出自NCCL源码,笔者根据阐述需要调整了函数定义次序,通过注释添加了注解。
/*************************************************************************
* Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef NCCL_BOOTSTRAP_H_
#define NCCL_BOOTSTRAP_H_
#include "nccl.h"
#include "comm.h"
// ncclBootstrapHandle included in ncclUniqueId generated by RANK 0
struct ncclBootstrapHandle {
uint64_t magic; // generated by RANK 0
union ncclSocketAddress addr; // RANK 0 Socket Address
};
static_assert(sizeof(struct ncclBootstrapHandle) <= sizeof(ncclUniqueId), "Bootstrap handle is too large to fit inside NCCL unique ID");
// bootstrapGetUniqueId Invoke by RANK 0 to create a ncclSocketAddress for Other RANKs to connect
ncclResult_t bootstrapGetUniqueId(struct ncclBootstrapHandle* handle);
// Initailize bootstrapNetIfAddr by network device searching
ncclResult_t bootstrapNetInit();
// bootstrapCreateRoot Invoke by RANK 0 to create bootstrapRoot thread,
// interact with bootstrapInit thread to finish bootstrap network construction
ncclResult_t bootstrapCreateRoot(struct ncclBootstrapHandle* handle, bool idFromEnv);
// Invoke by all RANKs including RANK 0,
// interact with bootstrapRoot thread bootstrap to finish bootstrap network construction
ncclResult_t bootstrapInit(struct ncclBootstrapHandle* handle, struct ncclComm* comm);
// For ncclComm Split senario, create bootstrap network for new group by resuing parent's bootstrap network
ncclResult_t bootstrapSplit(struct ncclBootstrapHandle* handle, struct ncclComm* comm, struct ncclComm* parent, int color, int key, int* parentRanks);
// Communication Functions after bootstrap network setup
ncclResult_t bootstrapAllGather(void* commState, void* allData, int size);
ncclResult_t bootstrapSend(void* commState, int peer, int tag, void* data, int size);
ncclResult_t bootstrapRecv(void* commState, int peer, int tag, void* data, int size);
ncclResult_t bootstrapBarrier(void* commState, int rank, int nranks, int tag);
ncclResult_t bootstrapBroadcast(void* commState, int rank, int nranks, int root, void* bcastData, int size);
ncclResult_t bootstrapIntraNodeBarrier(void* commState, int *ranks, int rank, int nranks, int tag);
ncclResult_t bootstrapIntraNodeAllGather(void* commState, int *ranks, int rank, int nranks, void* allData, int size);
ncclResult_t bootstrapIntraNodeBroadcast(void* commState, int *ranks, int rank, int nranks, int root, void* bcastData, int size);
ncclResult_t bootstrapClose(void* commState);
ncclResult_t bootstrapAbort(void* commState);
#endif
bootstrapAllGather and bootstrapIntraNodeAllGather
bootstrapAllGather的使用需将初始化完成后的bootstrapState传入(一般在保存在ncclComm->bootstrap里),传入正确通信内存参数即可调用。
在实现层面调用了bootstrapRingAllGather,用环的方式实现。
bootstrapIntraNodeAllGather类似,不过提供了ranks数组用于组建更小的环,完成小范围AllGather操作,从函数名看,主要用于单机内RANKs之间的AllGather操作。
ncclResult_t bootstrapAllGather(void* commState, void* allData, int size) {
struct bootstrapState* state = (struct bootstrapState*)commState;
int rank = state->rank;
int nranks = state->nranks;
TRACE(NCCL_INIT, "rank %d nranks %d size %d", rank, nranks, size);
NCCLCHECK(bootstrapRingAllGather(&state->ringRecvSocket, &state->ringSendSocket, rank, nranks, (char*)allData, size));
TRACE(NCCL_INIT, "rank %d nranks %d size %d - DONE", rank, nranks, size);
return ncclSuccess;
}
ncclResult_t bootstrapIntraNodeAllGather(void* commState, int *ranks, int rank, int nranks, void* allData, int size) {
if (nranks == 1) return ncclSuccess;
TRACE(NCCL_INIT, "rank %d nranks %d size %d - ENTER", rank, nranks, size);
int prevRank = ranks[(rank - 1 + nranks)%nranks];
int nextRank = ranks[(rank + 1) % nranks];
struct ncclSocket prevSocket, nextSocket;
NCCLCHECK(bootstrapConnect(commState, nextRank, 0, &nextSocket));
NCCLCHECK(bootstrapAccept(commState, prevRank, 0, &prevSocket));
NCCLCHECK(bootstrapRingAllGather(&prevSocket, &nextSocket, rank, nranks, (char*)allData, size));
NCCLCHECK(ncclSocketClose(&nextSocket));
NCCLCHECK(ncclSocketClose(&prevSocket));
TRACE(NCCL_INIT, "rank %d nranks %d size %d - DONE", rank, nranks, size);
return ncclSuccess;
}
ncclResult_t bootstrapRingAllGather(struct ncclSocket* prevSocket, struct ncclSocket* nextSocket, int rank, int nranks, char* data, int size) {
/* Simple ring based AllGather
* At each step i receive data from (rank-i-1) from prev
* and send previous step's data from (rank-i) to next
*/
for (int i=0; i<nranks-1; i++) {
size_t rslice = (rank - i - 1 + nranks) % nranks;
size_t sslice = (rank - i + nranks) % nranks;
// Send slice to the right, recv slice from the left
NCCLCHECK(bootstrapNetSendRecv(nextSocket, data+sslice*size, size, prevSocket, data+rslice*size, size));
}
return ncclSuccess;
}
bootstrapSend and bootstrapRecv
支持点对点的Send和Recv通信。
ncclResult_t bootstrapSend(void* commState, int peer, int tag, void* data, int size) {
ncclResult_t ret = ncclSuccess;
struct ncclSocket sock;
TRACE(NCCL_BOOTSTRAP, "Sending to peer=%d tag=%d size=%d", peer, tag, size);
NCCLCHECK(bootstrapConnect(commState, peer, tag, &sock));
NCCLCHECKGOTO(bootstrapNetSend(&sock, data, size), ret, exit);
TRACE(NCCL_BOOTSTRAP, "Sent to peer=%d tag=%d size=%d", peer, tag, size);
exit:
NCCLCHECK(ncclSocketClose(&sock));
return ret;
}
ncclResult_t bootstrapRecv(void* commState, int peer, int tag, void* data, int size) {
ncclResult_t ret;
struct ncclSocket sock;
NCCLCHECK(bootstrapAccept(commState, peer, tag, &sock));
TRACE(NCCL_BOOTSTRAP, "Receiving tag=%d peer=%d size=%d", tag, peer, size);
NCCLCHECKGOTO(bootstrapNetRecv(&sock, ((char*)data), size), ret, exit);
exit:
NCCLCHECK(ncclSocketClose(&sock));
return ret;
}
bootstrapBarrier and bootstrapIntraNodeBarrier
使用扩散屏障算法算法(dissemination algorithm)实现RANKs同步。
简单总结扩散屏障算法:
- 进行环通信,通信步长为1
- 多次环通信迭代,每次通信步长翻倍,迭代 log ( n r a n k ) \log(nrank) log(nrank)次
- 完成Ranks间同步
ncclResult_t bootstrapBarrier(void* commState, int rank, int nranks, int tag) {
return bootstrapIntraNodeBarrier(commState, NULL, rank, nranks, tag);
}
ncclResult_t bootstrapIntraNodeBarrier(void* commState, int *ranks, int rank, int nranks, int tag) {
if (nranks == 1) return ncclSuccess;
TRACE(NCCL_INIT, "rank %d nranks %d tag %x - ENTER", rank, nranks, tag);
/* Simple [intra] process barrier
*
* Based on the dissemination algorithm by Debra Hensgen, Raphael Finkel, and Udi Manbet,
* "Two Algorithms for Barrier Synchronization," International Journal of Parallel Programming, 17(1):1-17, 1988"
*/
int data[1];
for (int mask=1; mask<nranks; mask<<=1) {
int src = (rank - mask + nranks) % nranks;
int dst = (rank + mask) % nranks;
NCCLCHECK(bootstrapSend(commState, ranks ? ranks[dst] : dst, tag, data, sizeof(data)));
NCCLCHECK(bootstrapRecv(commState, ranks ? ranks[src] : src, tag, data, sizeof(data)));
}
TRACE(NCCL_INIT, "rank %d nranks %d tag %x - DONE", rank, nranks, tag);
return ncclSuccess;
}
bootstrapBroadcast and bootstrapIntraNodeBroadcast
广播操作系通过星形拓扑实现,bootstrapIntraNodeBroadcast可通过提供单机ranks数组和root ID完成小范围广播。bootstrapBroadcast默认全局广播,即对象是除了root的所有RANKs。
ncclResult_t bootstrapBroadcast(void* commState, int rank, int nranks, int root, void* bcastData, int size) {
return bootstrapIntraNodeBroadcast(commState, NULL, rank, nranks, root, bcastData, size);
}
// [IntraNode] in-place Broadcast
ncclResult_t bootstrapIntraNodeBroadcast(void* commState, int *ranks, int rank, int nranks, int root, void* bcastData, int size) {
if (nranks == 1) return ncclSuccess;
TRACE(NCCL_INIT, "rank %d nranks %d root %d size %d - ENTER", rank, nranks, root, size);
if (rank == root) {
for (int i=0; i<nranks; i++) {
if (i != root) NCCLCHECK(bootstrapSend(commState, ranks ? ranks[i] : i, /*tag=*/ranks ? ranks[i] : i, bcastData, size));
}
}
else {
NCCLCHECK(bootstrapRecv(commState, ranks ? ranks[root] : root, /*tag=*/ranks ? ranks[rank] : rank, bcastData, size));
}
TRACE(NCCL_INIT, "rank %d nranks %d root %d size %d - DONE", rank, nranks, root, size);
return ncclSuccess;
}