NCCL启动机制 - Bootstrap

慢河

已于 2024-07-03 14:14:46 修改

阅读量1k

点赞数 40

分类专栏： NCCL源码剖析文章标签：人工智能

于 2024-06-27 21:13:20 首次发布

本文链接：https://blog.csdn.net/weixin_42371021/article/details/139804668

版权

NCCL源码剖析专栏收录该内容

7 篇文章 3 订阅

订阅专栏

通讯库的初始化需要各RANK间先建立基本的通信路径，NCCL通过构建Bootstrap网络实现RANK间的高效控制面通信和同步。

NCCL编程过程

NCCL Usage
其中Bootstrap网络在NCCL初始化阶段调用ncclGetUniqueId以及ncclCommInitRank接口过程中构建完成，用于后续不同RANK间控制面的通信和同步。

Bootstrap网络构建

以下是Bootstrap网络的构建过程。
Bootstrap Flow

构建中心地址

中心地址ncclUniqueId的构建系通过ncclGetUniqueId接口实现，约定由RANK 0调用构建。以下是ncclUniqueId里包含的地址相关信息的结构体。其中ncclSocketAddress可能是IPV4或IPV6的套接字地址，可用于实例化套接字。各RANK向该地址发起连接，连接成功后随后均可向RANK 0发送信息。

struct ncclBootstrapHandle {
  uint64_t magic; // generated by RANK 0
  union ncclSocketAddress addr; // Contains RANK 0 Socket Address to connect
};

以下是ncclGetUniqueId的函数调用图，和ncclUniqueId相关的函数分别是bootsrapNetInit和ncclSocketGetAddrFromString函数。
其中bootsrapNetInit函数按照规则查询机器上的网络设备，选定一个网络设备用于Bootstrap网络的构建，使用网络设备的IP地址构建bootstrapNetIfAddr，其可以作为构建ncclGetUniqueId的相关信息。
在执行bootstrapGetUniqueId函数时，如果没有用户指定NCCL_COMM_ID则使用bootstrapNetIfAddr构建ncclGetUniqueId，否则使用用户指定的NCCL_COMM_ID，通过ncclSocketGetAddrFromString函数通过用户指定的地址构建ncclGetUniqueId。

广播中心地址

中心地址的广播即是ncclUniqueID的广播，这部分并无固定做法，以上例子系通过使用MPI的MPI_Bcast函数广播至各RANKs。除此之外还有很多其他方式，如预先配置各RANK网络信息用于广播；如是单机场景，各RANKs可以是不同线程，线程间可以通过内存共享信息等等。
ncclGetUniqueId

构建星形通信网络

星形通信网络的构建和bootstrapCreateRoot函数以及bootsrapInit/boostrapSplit关系密切，这两个函数均是由ncclGetUniqueId与ncclCommInitRank函数触发。RANK 0首先调用bootstrapCreateRoot在后台创建bootstrapRoot线程，该线程作为星形网络的中心。

struct extInfo {
  int rank;
  int nranks;
  union ncclSocketAddress extAddressListenRoot;
  union ncclSocketAddress extAddressListen; 
};

随后所有RANKs包括RANK 0在启动新线程调用bootstrapInit与bootstrapRoot线程通过网络通信，bootstrapInit函数构建使用bootstrapNetIfAddr初始化构建两份ncclSocket，并将这两份ncclSocket的地址填入extInfo结构体中，准备向RANK 0发送。因为中心节点的地址在上一步已经随着ncclUniqueID的广播获得，这时extInfo数据可向中心节点传输，即bootstrapRoot线程。至此所有RANK可向RANK 0发送数据，RANK 0收到extInfo后，通过extAddressListenRoot也可向所有RANK发送数据，至此星形网络构建完成。
ncclCommInitRank

构建环形通信网络

星形网络可满足各RANK之间数据通信的功能需求，但由于所有数据均需要汇集在中心节点再由中心节点发送至各RANKs节点，而Bootstrap网络常常用于执行点对点的Send/Recv和AllGather等集合通信操作，星形拓扑中心节点容易成为瓶颈。
因此，需要RANKs之间构建更高效的互联拓扑用于Bootstrap网络的点对点和集合通信操作，NCCL采用环形图谱。
在RANK 0 和其他RANK完成星形拓扑的构建后，RANK 0 将各RANK的next RANK地址extAddressListen发送到各个RANK，各RANK收到next RANK的地址后即可向next RANK发起连接，连接成功后单向的环便构建完成。随后各RANK接受previous RANK的连接，完成另外一个方向的环的构建，至此Bootstrap网络环构建完成，可使用Bootstrap模块的接口收发数据了。
bootstrapInit函数后续的代码通过使Bootstrap模块的接口完成RANK之间其他数据的交换，将bootstrapState结构体初始化完成保存在ncclComm的bootstrap成员变量中。

构建过程图示

以下图示描述了从ncclUniqueID到Bootstrap环形拓扑构建完成中间的过程。
Bootstrap Flow

BootstrapState结构体

BootstrapState是Bootstrap过程中主要的对象，整个Bootstrap过程围绕着如何正确构造BootstrapState结构体进行。通过BootstrapState结构体，可以完成RANKs间的控制面数据通信。以下是Bootstrap结构体成员的注解，注解在注释中。

struct bootstrapState {
  struct ncclSocket listenSock;                 // Socket to accept incoming connections from ROOT RANK
  struct ncclSocket ringRecvSocket;             // Socket to accept incoming connections from PREV RANK in the allGather ring
  struct ncclSocket ringSendSocket;             // Socket to connect to NEXT RANK in the AllGather ring
  union ncclSocketAddress* peerCommAddresses;   // All Ranks Listen Socket Address
  union ncclSocketAddress* peerProxyAddresses;  // All Ranks Proxy Socket Address
  uint64_t* peerProxyAddressesUDS;              // Unix Domain Socket Address for Local IPC
  struct unexConn* unexpectedConnections;       // Cached Acctepted connections
  int cudaDev;                                  // Device ID
  int rank;                                     // Rank number
  int nranks;                                   // Rank amount
  uint64_t magic;                               // Magic number generated by RANK 0 in ncclUniqueID
  volatile uint32_t *abortFlag;                 // Support abort or not
};

Bootstrap模块

Bootstrap模块接口可分为两类，一类像bootstrapGetUniqueId，bootstrapNetInit，bootstrapCreateRoot，bootstrapInit以及bootstrapSplit用于Bootstrap网络的构建；另一类则是Bootstrap网络构建完成后提供不同的通信/同步函数，如bootstrapAllGather，bootstrapSend，bootstrapRecv，bootstrapBarrier，bootstrapBroadcast，bootstrapIntraNodeBarrier，bootstrapIntraNodeAllGather和bootstrapIntraNodeBroadcast。
以下头文件出自NCCL源码，笔者根据阐述需要调整了函数定义次序，通过注释添加了注解。

/*************************************************************************
 * Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
 *
 * See LICENSE.txt for license information
 ************************************************************************/

#ifndef NCCL_BOOTSTRAP_H_
#define NCCL_BOOTSTRAP_H_

#include "nccl.h"
#include "comm.h"

// ncclBootstrapHandle included in ncclUniqueId generated by RANK 0
struct ncclBootstrapHandle {
  uint64_t magic; 				// generated by RANK 0
  union ncclSocketAddress addr; // RANK 0 Socket Address
};
static_assert(sizeof(struct ncclBootstrapHandle) <= sizeof(ncclUniqueId), "Bootstrap handle is too large to fit inside NCCL unique ID");

// bootstrapGetUniqueId Invoke by RANK 0 to create a ncclSocketAddress for Other RANKs to connect
ncclResult_t bootstrapGetUniqueId(struct ncclBootstrapHandle* handle);
// Initailize bootstrapNetIfAddr by network device searching
ncclResult_t bootstrapNetInit();


// bootstrapCreateRoot Invoke by RANK 0 to create bootstrapRoot thread,
// interact with bootstrapInit thread to finish bootstrap network construction
ncclResult_t bootstrapCreateRoot(struct ncclBootstrapHandle* handle, bool idFromEnv);
// Invoke by all RANKs including RANK 0,
// interact with bootstrapRoot thread bootstrap to finish bootstrap network construction
ncclResult_t bootstrapInit(struct ncclBootstrapHandle* handle, struct ncclComm* comm);
// For ncclComm Split senario, create bootstrap network for new group by resuing parent's bootstrap network
ncclResult_t bootstrapSplit(struct ncclBootstrapHandle* handle, struct ncclComm* comm, struct ncclComm* parent, int color, int key, int* parentRanks);


// Communication Functions after bootstrap network setup
ncclResult_t bootstrapAllGather(void* commState, void* allData, int size);
ncclResult_t bootstrapSend(void* commState, int peer, int tag, void* data, int size);
ncclResult_t bootstrapRecv(void* commState, int peer, int tag, void* data, int size);
ncclResult_t bootstrapBarrier(void* commState, int rank, int nranks, int tag);
ncclResult_t bootstrapBroadcast(void* commState, int rank, int nranks, int root, void* bcastData, int size);
ncclResult_t bootstrapIntraNodeBarrier(void* commState, int *ranks, int rank, int nranks, int tag);
ncclResult_t bootstrapIntraNodeAllGather(void* commState, int *ranks, int rank, int nranks, void* allData, int size);
ncclResult_t bootstrapIntraNodeBroadcast(void* commState, int *ranks, int rank, int nranks, int root, void* bcastData, int size);


ncclResult_t bootstrapClose(void* commState);
ncclResult_t bootstrapAbort(void* commState);
#endif

bootstrapAllGather and bootstrapIntraNodeAllGather

bootstrapAllGather的使用需将初始化完成后的bootstrapState传入（一般在保存在ncclComm->bootstrap里），传入正确通信内存参数即可调用。
在实现层面调用了bootstrapRingAllGather，用环的方式实现。
bootstrapIntraNodeAllGather类似，不过提供了ranks数组用于组建更小的环，完成小范围AllGather操作，从函数名看，主要用于单机内RANKs之间的AllGather操作。

ncclResult_t bootstrapAllGather(void* commState, void* allData, int size) {
  struct bootstrapState* state = (struct bootstrapState*)commState;
  int rank = state->rank;
  int nranks = state->nranks;

  TRACE(NCCL_INIT, "rank %d nranks %d size %d", rank, nranks, size);

  NCCLCHECK(bootstrapRingAllGather(&state->ringRecvSocket, &state->ringSendSocket, rank, nranks, (char*)allData, size));

  TRACE(NCCL_INIT, "rank %d nranks %d size %d - DONE", rank, nranks, size);
  return ncclSuccess;
}

ncclResult_t bootstrapIntraNodeAllGather(void* commState, int *ranks, int rank, int nranks, void* allData, int size) {
  if (nranks == 1) return ncclSuccess;
  TRACE(NCCL_INIT, "rank %d nranks %d size %d - ENTER", rank, nranks, size);

  int prevRank = ranks[(rank - 1 + nranks)%nranks];
  int nextRank = ranks[(rank + 1) % nranks];
  struct ncclSocket prevSocket, nextSocket;
  NCCLCHECK(bootstrapConnect(commState, nextRank, 0, &nextSocket));
  NCCLCHECK(bootstrapAccept(commState, prevRank, 0, &prevSocket));

  NCCLCHECK(bootstrapRingAllGather(&prevSocket, &nextSocket, rank, nranks, (char*)allData, size));

  NCCLCHECK(ncclSocketClose(&nextSocket));
  NCCLCHECK(ncclSocketClose(&prevSocket));

  TRACE(NCCL_INIT, "rank %d nranks %d size %d - DONE", rank, nranks, size);
  return ncclSuccess;
}

ncclResult_t bootstrapRingAllGather(struct ncclSocket* prevSocket, struct ncclSocket* nextSocket, int rank, int nranks, char* data, int size) {
  /* Simple ring based AllGather
   * At each step i receive data from (rank-i-1) from prev
   * and send previous step's data from (rank-i) to next
   */
  for (int i=0; i<nranks-1; i++) {
    size_t rslice = (rank - i - 1 + nranks) % nranks;
    size_t sslice = (rank - i + nranks) % nranks;

    // Send slice to the right, recv slice from the left
    NCCLCHECK(bootstrapNetSendRecv(nextSocket, data+sslice*size, size, prevSocket, data+rslice*size, size));
  }
  return ncclSuccess;
}

bootstrapSend and bootstrapRecv

支持点对点的Send和Recv通信。

ncclResult_t bootstrapSend(void* commState, int peer, int tag, void* data, int size) {
  ncclResult_t ret = ncclSuccess;
  struct ncclSocket sock;

  TRACE(NCCL_BOOTSTRAP, "Sending to peer=%d tag=%d size=%d", peer, tag, size);
  NCCLCHECK(bootstrapConnect(commState, peer, tag, &sock));
  NCCLCHECKGOTO(bootstrapNetSend(&sock, data, size), ret, exit);

  TRACE(NCCL_BOOTSTRAP, "Sent to peer=%d tag=%d size=%d", peer, tag, size);

exit:
  NCCLCHECK(ncclSocketClose(&sock));
  return ret;
}
ncclResult_t bootstrapRecv(void* commState, int peer, int tag, void* data, int size) {
  ncclResult_t ret;
  struct ncclSocket sock;
  NCCLCHECK(bootstrapAccept(commState, peer, tag, &sock));
  TRACE(NCCL_BOOTSTRAP, "Receiving tag=%d peer=%d size=%d", tag, peer, size);
  NCCLCHECKGOTO(bootstrapNetRecv(&sock, ((char*)data), size), ret, exit);
exit:
  NCCLCHECK(ncclSocketClose(&sock));
  return ret;
}

bootstrapBarrier and bootstrapIntraNodeBarrier

使用扩散屏障算法算法（dissemination algorithm）实现RANKs同步。
简单总结扩散屏障算法：

进行环通信，通信步长为1
多次环通信迭代，每次通信步长翻倍，迭代 $\log(nrank)$ 次
完成Ranks间同步

ncclResult_t bootstrapBarrier(void* commState, int rank, int nranks, int tag) {
  return bootstrapIntraNodeBarrier(commState, NULL, rank, nranks, tag);
}

ncclResult_t bootstrapIntraNodeBarrier(void* commState, int *ranks, int rank, int nranks, int tag) {
  if (nranks == 1) return ncclSuccess;
  TRACE(NCCL_INIT, "rank %d nranks %d tag %x - ENTER", rank, nranks, tag);

  /* Simple [intra] process barrier
   *
   * Based on the dissemination algorithm by Debra Hensgen, Raphael Finkel, and Udi Manbet,
   * "Two Algorithms for Barrier Synchronization," International Journal of Parallel Programming, 17(1):1-17, 1988"
   */
  int data[1];
  for (int mask=1; mask<nranks; mask<<=1) {
    int src = (rank - mask + nranks) % nranks;
    int dst = (rank + mask) % nranks;
    NCCLCHECK(bootstrapSend(commState, ranks ? ranks[dst] : dst, tag, data, sizeof(data)));
    NCCLCHECK(bootstrapRecv(commState, ranks ? ranks[src] : src, tag, data, sizeof(data)));
  }

  TRACE(NCCL_INIT, "rank %d nranks %d tag %x - DONE", rank, nranks, tag);
  return ncclSuccess;
}

bootstrapBroadcast and bootstrapIntraNodeBroadcast

广播操作系通过星形拓扑实现，bootstrapIntraNodeBroadcast可通过提供单机ranks数组和root ID完成小范围广播。bootstrapBroadcast默认全局广播，即对象是除了root的所有RANKs。

ncclResult_t bootstrapBroadcast(void* commState, int rank, int nranks, int root, void* bcastData, int size) {
  return bootstrapIntraNodeBroadcast(commState, NULL, rank, nranks, root, bcastData, size);
}

// [IntraNode] in-place Broadcast
ncclResult_t bootstrapIntraNodeBroadcast(void* commState, int *ranks, int rank, int nranks, int root, void* bcastData, int size) {
  if (nranks == 1) return ncclSuccess;
  TRACE(NCCL_INIT, "rank %d nranks %d root %d size %d - ENTER", rank, nranks, root, size);
  if (rank == root) {
    for (int i=0; i<nranks; i++) {
      if (i != root) NCCLCHECK(bootstrapSend(commState, ranks ? ranks[i] : i, /*tag=*/ranks ? ranks[i] : i, bcastData, size));
    }
  }
  else {
    NCCLCHECK(bootstrapRecv(commState, ranks ? ranks[root] : root, /*tag=*/ranks ? ranks[rank] : rank, bcastData, size));
  }
  TRACE(NCCL_INIT, "rank %d nranks %d root %d size %d - DONE", rank, nranks, root, size);
  return ncclSuccess;
}