NCCL拓扑管理 - XML模块

NCCL使用XML作为物理拓扑图和逻辑拓扑图的底层表示形式,XML模块提供物理拓扑构和逻辑拓扑的构建,管理,以及序列化反序列化等接口。NCCL的物理拓扑主要便是通过该模块完成,其通过Linux sysfs接口和NVIDIA NVML接口构建单机的物理拓扑,提供XML图的合并接口和XML节点操作接口便于上层模块后续调整图结构和节点信息。

数据结构

XML相关结构体是有两个,一是ncclXmlNode用于代表一个节点,二是ncclXml负责代表整个系统拓扑或者逻辑拓扑。所有XML模块接口均围绕这两个结构体运转。详细信息参见以下代码,笔者的注解附加在注释中。

ncclXml

ncclXml代表整个XML拓扑,可以代表系统拓扑ncclTopoSystem,也可以代表逻辑拓扑ncclTopoGraph,这两类拓扑图均可以序列化为ncclXml。通过使用NCCL的环境变量可以将ncclTopoSystem和ncclTopoGraph输出到一个xml文件里。

// Represent a System or Graph
struct ncclXml {
  int maxIndex, maxNodes;
  struct ncclXmlNode nodes[1];
};

ncclXmlNode

ncclXmlNode用于描述一个节点的所有相关信息,包括其父节点,子节点以及该节点的属性。

// A few constraints to make the implementation easy
#define MAX_STR_LEN 255  	// Max length of a string
#define MAX_ATTR_COUNT 16  	// Max number of attributes per node
#define MAX_SUBS 128  		// Max number of subnodes per node

// for xml parsing
#define NODE_TYPE_NONE 0    // Unused
#define NODE_TYPE_OPEN 1    // <tag>
#define NODE_TYPE_CLOSE 2   // </tag>  
#define NODE_TYPE_SINGLE 3  // <tag/>    

// Represent a Node
struct ncclXmlNode {
  char name[MAX_STR_LEN+1];
  struct {
    char key[MAX_STR_LEN+1];
    char value[MAX_STR_LEN+1];
  } attrs[MAX_ATTR_COUNT+1]; 	// Need an extra one to consume extra params
  int nAttrs;					
  int type; 					// NODE_TYPE_NONE, NODE_TYPE_OPEN, NODE_TYPE_CLOSE, NODE_TYPE_SINGLE for xml parsing
  struct ncclXmlNode* parent;
  struct ncclXmlNode* subs[MAX_SUBS]; // max 128 subnodes
  int nSubs;
};

两类节点

NCCL XML主要的叶子节点只有两类,一类是GPU节点另一类是NET节点。GPU节点在单主机节点内有多种互联方式,如若需要和另一主机节点互联则需要通过NET节点。这两类节点的构建分别通过ncclTopoFillGpu和ncclTopoFillNet。

ncclTopoFillGpu

ncclTopoFillGpu通过busId来构建表示GPU的ncclXmlNode。busId系GPU的BDF(Bus, Device, Function)地址,这部分PCI标准相关,通过BDF地址可以定位到一个PCI设备,GPU在Linux里被当作一个PCI设备。
ncclTopoFillGpu调用ncclTopoGetPciNode,ncclTopoGetXmlFromSys和ncclTopoGetXmlFromGpu等函数分别从不同源获区GPU的属性信息,用以构建完整的GPU节点。
这些函数主要是通过Linux sysfs查询PCI相关信息,通过NVIDIA的NVML接口查询GPU信息,创建PCI节点,NUAM节点(CPU节点),GPU节点,NVLINK节点等等。

ncclResult_t ncclTopoFillGpu(struct ncclXml* xml, const char* busId, struct ncclXmlNode** gpuNode) {
  struct ncclXmlNode* node;
  NCCLCHECK(ncclTopoGetPciNode(xml, busId, &node));
  NCCLCHECK(xmlSetAttrIfUnset(node, "class", "0x03")); // PCI class 0x0x means Display Controler
  NCCLCHECK(ncclTopoGetXmlFromSys(node, xml));
  nvmlDevice_t nvmlDev;
  NCCLCHECK(ncclNvmlDeviceGetHandleByPciBusId(busId, &nvmlDev));
  NCCLCHECK(ncclTopoGetXmlFromGpu(node, nvmlDev, xml, gpuNode));
  return ncclSuccess;
}

ncclTopoGetPciNode

ncclTopoGetPciNode用于构建GPU的PCI节点。

ncclResult_t ncclTopoGetPciNode(struct ncclXml* xml, const char* busId, struct ncclXmlNode** pciNode) {
  NCCLCHECK(xmlFindTagKv(xml, "pci", pciNode, "busid", busId));
  if (*pciNode == NULL) {
    NCCLCHECK(xmlAddNode(xml, NULL, "pci", pciNode));
    NCCLCHECK(xmlSetAttr(*pciNode, "busid", busId));
  }
  return ncclSuccess;
}

ncclTopoGetXmlFromSys

ncclTopoGetXmlFromSys通过查询linux sysfs读取PCI设备信息。代码主要完成通过busid定位到linux上PCI设备的路径,从系统路径中提取PCI信息,最后将节点归纳到正确的NUMA节点下(NUMA节点作为PCI的父节点可能需自己创建),NCCL用CPU节点代表NUMA节点。

ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml* xml) {
  // Fill info, then parent
  const char* busId;
  NCCLCHECK(xmlGetAttr(pciNode, "busid", &busId));
  char* path = NULL;
  ncclDebugNoWarn = NCCL_GRAPH;
  getPciPath(busId, &path);
  ncclDebugNoWarn = 0;

  if (path) {
    NCCLCHECK(ncclTopoSetAttrFromSys(pciNode, path, "class", "class"));
  }
  int index;
  ncclDebugNoWarn = NCCL_GRAPH;
  NCCLCHECK(xmlGetAttrIndex(pciNode, "vendor", &index));
  if (index == -1) {
    if (path) ncclTopoSetAttrFromSys(pciNode, path, "vendor", "vendor");
  }
  NCCLCHECK(xmlGetAttrIndex(pciNode, "device", &index));
  if (index == -1) {
    if (path) ncclTopoSetAttrFromSys(pciNode, path, "device", "device");
  }
  NCCLCHECK(xmlGetAttrIndex(pciNode, "subsystem_vendor", &index));
  if (index == -1) {
    if (path) ncclTopoSetAttrFromSys(pciNode, path, "subsystem_vendor", "subsystem_vendor");
  }
  NCCLCHECK(xmlGetAttrIndex(pciNode, "subsystem_device", &index));
  if (index == -1) {
    if (path) ncclTopoSetAttrFromSys(pciNode, path, "subsystem_device", "subsystem_device");
  }
  ncclDebugNoWarn = 0;
  NCCLCHECK(xmlGetAttrIndex(pciNode, "link_speed", &index));
  if (index == -1) {
    if (path) {
      char deviceSpeedStr[MAX_STR_LEN];
      float deviceSpeed = FLT_MAX;
      NCCLCHECK(ncclTopoGetStrFromSys(path, "max_link_speed", deviceSpeedStr));
      sscanf(deviceSpeedStr, "%f GT/s", &deviceSpeed);
      char portSpeedStr[MAX_STR_LEN];
      float portSpeed = FLT_MAX;
      NCCLCHECK(ncclTopoGetStrFromSys(path, "../max_link_speed", portSpeedStr));
      sscanf(portSpeedStr, "%f GT/s", &portSpeed);
      NCCLCHECK(xmlSetAttr(pciNode, "link_speed", portSpeed < deviceSpeed ? portSpeedStr : deviceSpeedStr));
    } else {
      NCCLCHECK(xmlSetAttr(pciNode, "link_speed", ""));
    }
  }
  NCCLCHECK(xmlGetAttrIndex(pciNode, "link_width", &index));
  if (index == -1) {
    if (path) {
      char strValue[MAX_STR_LEN];
      NCCLCHECK(ncclTopoGetStrFromSys(path, "max_link_width", strValue));
      int deviceWidth = strtol(strValue, NULL, 0);
      NCCLCHECK(ncclTopoGetStrFromSys(path, "../max_link_width", strValue));
      int portWidth = strtol(strValue, NULL, 0);
      NCCLCHECK(xmlSetAttrInt(pciNode, "link_width", std::min(deviceWidth,portWidth)));
    } else {
      NCCLCHECK(xmlSetAttr(pciNode, "link_width", ""));
    }
  }
  struct ncclXmlNode* parent = pciNode->parent;
  if (parent == NULL) {
    if (path) {
      // Save that for later in case next step is a CPU
      char numaIdStr[MAX_STR_LEN];
      NCCLCHECK(ncclTopoGetStrFromSys(path, "numa_node", numaIdStr));

      // Go up one level in the PCI tree. Rewind two "/" and follow the upper PCI
      // switch, or stop if we reach a CPU root complex.
      int slashCount = 0;
      int parentOffset;
      for (parentOffset = strlen(path)-1; parentOffset>0; parentOffset--) {
        if (path[parentOffset] == '/') {
          slashCount++;
          path[parentOffset] = '\0';
          int start = parentOffset - 1;
          while (start>0 && path[start] != '/') start--;
          // Check whether the parent path looks like "BBBB:BB:DD.F" or not.
          if (checkBDFFormat(path+start+1) == 0) {
            // This a CPU root complex. Create a CPU tag and stop there.
            struct ncclXmlNode* topNode;
            NCCLCHECK(xmlFindTag(xml, "system", &topNode));
            NCCLCHECK(xmlGetSubKv(topNode, "cpu", &parent, "numaid", numaIdStr));
            if (parent == NULL) {
              NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent));
              NCCLCHECK(xmlSetAttrLong(parent, "host_hash", getHostHash()));
              NCCLCHECK(xmlSetAttr(parent, "numaid", numaIdStr));
            }
          } else if (slashCount == 2) {
            // Continue on the upper PCI switch
            for (int i = strlen(path)-1; i>0; i--) {
              if (path[i] == '/') {
                NCCLCHECK(xmlFindTagKv(xml, "pci", &parent, "busid", path+i+1));
                if (parent == NULL) {
                  NCCLCHECK(xmlAddNode(xml, NULL, "pci", &parent));
                  NCCLCHECK(xmlSetAttr(parent, "busid", path+i+1));
                }
                break;
              }
            }
          }
        }
        if (parent) break;
      }
    } else {
      // No information on /sys, attach GPU to unknown CPU
      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "-1"));
      if (parent == NULL) {
        struct ncclXmlNode* topNode;
        NCCLCHECK(xmlFindTag(xml, "system", &topNode));
        NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent));
        NCCLCHECK(xmlSetAttrLong(parent, "host_hash", getHostHash()));
        NCCLCHECK(xmlSetAttr(parent, "numaid", "-1"));
        NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml));
      }
    }
    pciNode->parent = parent;
    // Keep PCI sub devices ordered by PCI Bus ID (Issue #820)
    int subIndex = parent->nSubs;
    const char* newBusId;
    NCCLCHECK(xmlGetAttrStr(pciNode, "busid", &newBusId));
    for (int s=0; s<parent->nSubs; s++) {
      const char* busId;
      NCCLCHECK(xmlGetAttr(parent->subs[s], "busid", &busId));
      if (busId != NULL && strcmp(newBusId, busId) < 0) { subIndex = s; break; }
    }
    if (parent->nSubs == MAX_SUBS) {
      WARN("Error : XML parser is limited to %d subnodes", MAX_SUBS);
      return ncclInternalError;
    }
    for (int s = parent->nSubs; s > subIndex; s--) parent->subs[s] = parent->subs[s-1];
    parent->subs[subIndex] = pciNode;
    parent->nSubs++;
  }
  if (strcmp(parent->name, "pci") == 0) {
    NCCLCHECK(ncclTopoGetXmlFromSys(parent, xml));
  } else if (strcmp(parent->name, "cpu") == 0) {
    NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml));
  }
  free(path);
  return ncclSuccess;
}

ncclTopoGetXmlFromGpu

ncclTopoGetXmlFromGpu则是通过查询NVML接口获区GPU的信息。如GPU的SM版本,GPU的DEV编号,GPU的之间的互联信息(南向互联)如NVLINK等。

ncclResult_t ncclTopoGetXmlFromGpu(struct ncclXmlNode* pciNode, nvmlDevice_t nvmlDev, struct ncclXml* xml, struct ncclXmlNode** gpuNodeRet) {
  struct ncclXmlNode* gpuNode = NULL;
  NCCLCHECK(xmlGetSub(pciNode, "gpu", &gpuNode));
  if (gpuNode == NULL) NCCLCHECK(xmlAddNode(xml, pciNode, "gpu", &gpuNode));

  int index = -1;

  int dev = -1;
  NCCLCHECK(xmlGetAttrIndex(gpuNode, "dev", &index));
  if (index == -1) {
    NCCLCHECK(ncclNvmlDeviceGetIndex(nvmlDev, (unsigned int*)&dev));
    NCCLCHECK(xmlSetAttrInt(gpuNode, "dev", dev));
  }
  NCCLCHECK(xmlGetAttrInt(gpuNode, "dev", &dev));
  if (dev == -1) { *gpuNodeRet = NULL; return ncclSuccess; }

  NCCLCHECK(xmlGetAttrIndex(gpuNode, "sm", &index));
  if (index == -1) {
    int cudaMajor, cudaMinor;
    if (nvmlDev == NULL) {
      cudaDeviceProp devProp;
      CUDACHECK(cudaGetDeviceProperties(&devProp, dev));
      cudaMajor = devProp.major; cudaMinor = devProp.minor;
    } else {
      NCCLCHECK(ncclNvmlDeviceGetCudaComputeCapability(nvmlDev, &cudaMajor, &cudaMinor));
    }
    NCCLCHECK(xmlSetAttrInt(gpuNode, "sm", cudaMajor*10+cudaMinor));
  }
  int sm;
  NCCLCHECK(xmlGetAttrInt(gpuNode, "sm", &sm));

  struct ncclXmlNode* nvlNode = NULL;
  NCCLCHECK(xmlGetSub(gpuNode, "nvlink", &nvlNode));
  if (nvlNode == NULL) {
    // NVML NVLink detection
    int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : (sm < 90) ? 12 : 18;

    if (maxNvLinks > 0 && nvmlDev == NULL) {
      WARN("No NVML device handle. Skipping nvlink detection.");
      maxNvLinks = 0;
    }

    for (int l=0; l<maxNvLinks; ++l) {
      // Check whether we can use this NVLink for P2P
      unsigned canP2P;
      if ((ncclNvmlDeviceGetNvLinkCapability(nvmlDev, l, NVML_NVLINK_CAP_P2P_SUPPORTED, &canP2P) != ncclSuccess) || !canP2P) continue;

      // Make sure the Nvlink is up. The previous call should have trained the link.
      nvmlEnableState_t isActive = NVML_FEATURE_DISABLED;
#if CUDART_VERSION >= 11080
      if (sm >= 90) {
        nvmlFieldValue_t fv;
        fv.fieldId = NVML_FI_DEV_NVLINK_GET_STATE;
        fv.scopeId = l;
        // fv.value will contain NV_FEATURE_ENABLED or NV_FEATURE_DISABLED
        if ((ncclNvmlDeviceGetFieldValues(nvmlDev, 1, &fv) == ncclSuccess) && (fv.nvmlReturn == NVML_SUCCESS))
          isActive = (nvmlEnableState_t) fv.value.uiVal;
      } else /* FALLTHRU to GetNvLinkState if before SM90 */
#endif
      {
        (void) ncclNvmlDeviceGetNvLinkState(nvmlDev, l, &isActive);
      }
      if (isActive != NVML_FEATURE_ENABLED) continue;

      // Try to figure out what's on the other side of the NVLink
      nvmlPciInfo_t remoteProc;
      if (ncclNvmlDeviceGetNvLinkRemotePciInfo(nvmlDev, l, &remoteProc) != ncclSuccess) continue;

      // Make a lower case copy of the bus ID for calling ncclDeviceType
      // PCI system path is in lower case
      char* p = remoteProc.busId;
      char lowerId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
      for (int c=0; c<NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE; c++) {
        lowerId[c] = tolower(p[c]);
        if (p[c] == 0) break;
      }

      NCCLCHECK(xmlGetSubKv(gpuNode, "nvlink", &nvlNode, "target", lowerId));
      if (nvlNode == NULL) {
        NCCLCHECK(xmlAddNode(xml, gpuNode, "nvlink", &nvlNode));
        NCCLCHECK(xmlSetAttr(nvlNode, "target", lowerId));
        NCCLCHECK(xmlSetAttrInt(nvlNode, "count", 1));
      } else {
        int count;
        NCCLCHECK(xmlGetAttrInt(nvlNode, "count", &count));
        NCCLCHECK(xmlSetAttrInt(nvlNode, "count", count+1));
      }
    }
  }
#if CUDART_VERSION >= 11080
  struct ncclXmlNode* c2cNode = NULL;
  NCCLCHECK(xmlGetSub(gpuNode, "c2c", &c2cNode));
  if (c2cNode == NULL) {
      if (sm >= 90) {
        int c2cLinksCount = 0;
        nvmlFieldValue_t fv;
        fv.fieldId = NVML_FI_DEV_C2C_LINK_COUNT;
        if ((ncclNvmlDeviceGetFieldValues(nvmlDev, 1, &fv) == ncclSuccess) && (fv.nvmlReturn == NVML_SUCCESS)) {
          c2cLinksCount = fv.value.uiVal;
          int bw = 0;
	  int count = 0;
          for (int l=0; l<c2cLinksCount; l++) {
            nvmlFieldValue_t fvs[2];
            fvs[0].fieldId = NVML_FI_DEV_C2C_LINK_GET_STATUS;
            fvs[0].scopeId = l;
            fvs[1].fieldId = NVML_FI_DEV_C2C_LINK_GET_MAX_BW;
            fvs[1].scopeId = l;
            if ((ncclNvmlDeviceGetFieldValues(nvmlDev, 2, fvs) == ncclSuccess) &&
                (fvs[0].nvmlReturn == NVML_SUCCESS) &&
                (fvs[0].value.uiVal == 1) &&
                (fvs[1].nvmlReturn == NVML_SUCCESS)) {
              bw = fvs[1].value.uiVal;
	      count++;
            }
          }
          if (count > 0) {
            NCCLCHECK(xmlAddNode(xml, gpuNode, "c2c", &c2cNode));
            NCCLCHECK(xmlSetAttrInt(c2cNode, "bw", bw));
            NCCLCHECK(xmlSetAttrInt(c2cNode, "count", count));
          }
        }
      }
  }
#endif
  // Fill target classes
  for (int s=0; s<gpuNode->nSubs; s++) {
    struct ncclXmlNode* sub = gpuNode->subs[s];
    if (strcmp(sub->name, "nvlink") != 0) continue;
    int index;
    NCCLCHECK(xmlGetAttrIndex(sub, "tclass", &index));
    if (index == -1) {
      const char* busId;
      NCCLCHECK(xmlGetAttr(sub, "target", &busId));
      char* path;
      ncclDebugNoWarn = NCCL_GRAPH;
      getPciPath(busId, &path);
      ncclDebugNoWarn = 0;
      if (path == NULL || strcmp(busId, "fffffff:ffff:ff") == 0) {
        // Remote NVLink device is not visible inside this VM. Assume NVSwitch.
        NCCLCHECK(xmlSetAttr(sub, "tclass", "0x068000"));
      } else {
        NCCLCHECK(ncclTopoSetAttrFromSys(sub, path, "class", "tclass"));
        free(path);
      }
    }
  }
  *gpuNodeRet = gpuNode;
  return ncclSuccess;
}

ncclTopoFillNet

ncclTopoFillNet函数通过Linux的sysfs获区PCI相关信息并按照逻辑构建好XML节点。网卡其他息会在上层模块通过如xmlSetAttrInt,xmlInitAttrInt等属性操作接口补充节点属性。实际上网卡信息来源于网卡设备厂商提供的库函数,NCCL为网络库提供了统一接口,用于适配不同厂商的运行时库。

ncclResult_t ncclTopoFillNet(struct ncclXml* xml, const char* pciPath, const char* netName, struct ncclXmlNode** netNode) {
  NCCLCHECK(xmlFindTagKv(xml, "net", netNode, "name", netName));
  if (*netNode != NULL) return ncclSuccess;

  const char* pciSysPath = pciPath;
  if (pciSysPath) {
    char subSystem[PATH_MAX];
    NCCLCHECK(ncclTopoGetSubsystem(pciSysPath, subSystem));
    // This is not a PCI device (virtual, usb, ...).
    if (strcmp(subSystem, "pci") != 0) {
      INFO(NCCL_GRAPH, "Topology detection: network path %s is not a PCI device (%s). Attaching to first CPU", pciSysPath, subSystem);
      pciSysPath = NULL;
    }
  }

  struct ncclXmlNode* parent = NULL;
  if (pciSysPath) {
    int offset;
    for (offset=strlen(pciSysPath)-1; pciSysPath[offset] != '/'; offset--);
    char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
    strcpy(busId, pciSysPath+offset+1);
    NCCLCHECK(ncclTopoGetPciNode(xml, busId, &parent));
    NCCLCHECK(xmlSetAttrIfUnset(parent, "class", "0x02"));
    NCCLCHECK(ncclTopoGetXmlFromSys(parent, xml));
  } else {
    // Virtual NIC, no PCI device, attach to first CPU
    NCCLCHECK(xmlFindTag(xml, "cpu", &parent));
  }

  struct ncclXmlNode* nicNode = NULL;
  NCCLCHECK(xmlGetSub(parent, "nic", &nicNode));
  if (nicNode == NULL) {
    NCCLCHECK(xmlAddNode(xml, parent, "nic", &nicNode));
  }

  // We know that this net does not exist yet (we searched for it at the
  // beginning of this function), so we can add it.
  NCCLCHECK(xmlAddNode(xml, nicNode, "net", netNode));
  NCCLCHECK(xmlSetAttr(*netNode, "name", netName));
  return ncclSuccess;
}

两类拓扑

NCCL XML用于表示物理拓扑和逻辑拓扑,用NCCL的术语则是ncclTopoSystem和ncclTopoGraph。
NCCL XML模块提供2个解析XML的函数,分别对应xmlTopoFile和xmlGraphFile。通过xmlTopoFile和xmlGraphFile的反序列化可以了解它们的结构。

/* File functions */
#define NCCL_TOPO_XML_VERSION 1
ncclResult_t ncclTopoGetXmlFromFile(const char* xmlTopoFile, struct ncclXml* xml, int warn);
ncclResult_t ncclTopoDumpXmlToFile(const char* xmlTopoFile, struct ncclXml* xml);
#define NCCL_GRAPH_XML_VERSION 1
ncclResult_t ncclTopoGetXmlGraphFromFile(const char* xmlGraphFile, struct ncclXml* xml);

物理拓扑

物理拓扑关注节点和拓扑相关的所有信息,后续需要通过物理拓扑信息进行逻辑拓扑搜索,在不同路径中寻找出最合适的一种。
以下是ncclTopoGetXmlFromFile函数的主要逻辑。
ncclTopoGetXmlFromFile
通过该函数可以理解物理拓扑图的结构。可以看出物理拓扑除了描述设备的PCI拓扑(北向互联)还会描述GPU的南向互联以及NIC互联,这涵盖所有GPU的互联形式。
topo_system_mindmap
以下是A100-SXM4-40GB的物理拓扑图。

<system version="1">
 <cpu numaid="-1" arch="x86_64" vendor="AuthenticAMD" familyid="175" modelid="1">
   <pci busid="0000:03:00.0" class="0x030200" vendor="0x10de" device="0x20b0" subsystem_vendor="0x10de" subsystem_device="0x144e" link_speed="5.0 GT/s PCIe" link_width="32">
     <gpu dev="0" sm="80" rank="0" gdr="1">
       <nvlink target="0000:15:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:05:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:1e:00.0" count="4" tclass="0x030200"/>
     </gpu>
   </pci>
   <pci busid="0000:05:00.0" class="0x030200" vendor="0x10de" device="0x20b0" subsystem_vendor="0x10de" subsystem_device="0x144e" link_speed="5.0 GT/s PCIe" link_width="32">
     <gpu dev="1" sm="80" rank="1" gdr="1">
       <nvlink target="0000:15:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:1e:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:03:00.0" count="4" tclass="0x030200"/>
     </gpu>
   </pci>
   <pci busid="0000:0d:00.0" class="0x020000" vendor="0x15ad" device="0x07b0" subsystem_vendor="0x15ad" subsystem_device="0x07b0" link_speed="5.0 GT/s PCIe" link_width="32">
     <nic>
       <net name="ens192" dev="0" speed="10000" port="0" latency="0.000000" guid="0x0" maxconn="65536" gdr="0"/>
     </nic>
   </pci>
   <pci busid="0000:15:00.0" class="0x030200" vendor="0x10de" device="0x20b0" subsystem_vendor="0x10de" subsystem_device="0x144e" link_speed="5.0 GT/s PCIe" link_width="32">
     <gpu dev="2" sm="80" rank="2" gdr="1">
       <nvlink target="0000:05:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:1e:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:03:00.0" count="4" tclass="0x030200"/>
     </gpu>
   </pci>
   <pci busid="0000:1e:00.0" class="0x030200" vendor="0x10de" device="0x20b0" subsystem_vendor="0x10de" subsystem_device="0x144e" link_speed="5.0 GT/s PCIe" link_width="32">
     <gpu dev="3" sm="80" rank="3" gdr="1">
       <nvlink target="0000:15:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:05:00.0" count="4" tclass="0x030200"/>
       <nvlink target="0000:03:00.0" count="4" tclass="0x030200"/>
     </gpu>
   </pci>
   <nic>
     <net name="br-cacff5436e72" dev="1" speed="10000" port="0" latency="0.000000" guid="0x1" maxconn="65536" gdr="0"/>
     <net name="veth7ca7bdc" dev="2" speed="10000" port="0" latency="0.000000" guid="0x2" maxconn="65536" gdr="0"/>
     <net name="veth79962a8" dev="3" speed="10000" port="0" latency="0.000000" guid="0x3" maxconn="65536" gdr="0"/>
     <net name="vetha81454c" dev="4" speed="10000" port="0" latency="0.000000" guid="0x4" maxconn="65536" gdr="0"/>
     <net name="veth95b9bfa" dev="5" speed="10000" port="0" latency="0.000000" guid="0x5" maxconn="65536" gdr="0"/>
     <net name="veth4af3870" dev="6" speed="10000" port="0" latency="0.000000" guid="0x6" maxconn="65536" gdr="0"/>
     <net name="veth628188a" dev="7" speed="10000" port="0" latency="0.000000" guid="0x7" maxconn="65536" gdr="0"/>
     <net name="veth338552b" dev="8" speed="10000" port="0" latency="0.000000" guid="0x8" maxconn="65536" gdr="0"/>
     <net name="vethab60979" dev="9" speed="10000" port="0" latency="0.000000" guid="0x9" maxconn="65536" gdr="0"/>
     <net name="veth994a543" dev="10" speed="10000" port="0" latency="0.000000" guid="0xa" maxconn="65536" gdr="0"/>
     <net name="vethe6da593" dev="11" speed="10000" port="0" latency="0.000000" guid="0xb" maxconn="65536" gdr="0"/>
   </nic>
 </cpu>
</system>

逻辑拓扑

逻辑拓扑关注GPU和NET之间的互联拓扑以及互联的通道数。
以下是ncclTopoGetXmlGraphFromFile函数的主要逻辑。
ncclTopoGetXmlGraphFromFile
通过该函数可以理解graph图的拓扑。逻辑拓扑信息作为属性保存在图里,该图可以提供如某几个节点(GPU或者NET)通过某种逻辑拓扑(Ring,Tree等)以及该拓扑有多少通道(并行度)。
topo_graph_mindmap
以下是A100-SXM4-40GB上搜索出来的逻辑拓扑图。

<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="12" speedintra="20" speedinter="20" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
    </channel>
  </graph>
  <graph id="1" pattern="1" crossnic="0" nchannels="12" speedintra="20" speedinter="20" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="1"/>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="1"/>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="1"/>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="1"/>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
  <graph id="3" pattern="5" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
</graphs>

以下是graph.h里pattern的定义。

#define NCCL_TOPO_PATTERN_BALANCED_TREE 1   // Spread NIC traffic between two GPUs (Tree parent + one child on first GPU, second child on second GPU)
#define NCCL_TOPO_PATTERN_SPLIT_TREE 2      // Spread NIC traffic between two GPUs (Tree parent on first GPU, tree children on the second GPU)
#define NCCL_TOPO_PATTERN_TREE 3            // All NIC traffic going to/from the same GPU
#define NCCL_TOPO_PATTERN_RING 4            // Ring
#define NCCL_TOPO_PATTERN_NVLS 5            // NVLS+SHARP and NVLS+Tree

可以发现搜索出两种逻辑拓扑结构,NCCL_TOPO_PATTERN_RING和NCCL_TOPO_PATTERN_BALANCED_TREE,这两类拓扑由4块GPU构成。

检测南向互联

南向互联提供GPU间的互联,无需依赖CPU的PCIe架构。NCCL的XML模块通过NVML接口的查询为GPU添加南向互联(NVLINK和C2C)。以下是ncclTopoGetXmlFromGpu函数中检测NVLINK的代码。

  struct ncclXmlNode* nvlNode = NULL;
  NCCLCHECK(xmlGetSub(gpuNode, "nvlink", &nvlNode));
  if (nvlNode == NULL) {
    // #1 Check GPU Architecture to get maxNvlinks
    int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : (sm < 90) ? 12 : 18;
    if (maxNvLinks > 0 && nvmlDev == NULL) {
      WARN("No NVML device handle. Skipping nvlink detection.");
      maxNvLinks = 0;
    }

	// #2 For each link, check P2P Capability, check Link Active State, query remote GPU info
    for (int l=0; l<maxNvLinks; ++l) {
      // #2.1 Check whether we can use this NVLink for P2P
      unsigned canP2P;
      if ((ncclNvmlDeviceGetNvLinkCapability(nvmlDev, l, NVML_NVLINK_CAP_P2P_SUPPORTED, &canP2P) != ncclSuccess) || !canP2P) continue;

      // #2.2 Make sure the Nvlink is up. The previous call should have trained the link.
      nvmlEnableState_t isActive = NVML_FEATURE_DISABLED;
#if CUDART_VERSION >= 11080
      if (sm >= 90) {
        nvmlFieldValue_t fv;
        fv.fieldId = NVML_FI_DEV_NVLINK_GET_STATE;
        fv.scopeId = l;
        // fv.value will contain NV_FEATURE_ENABLED or NV_FEATURE_DISABLED
        if ((ncclNvmlDeviceGetFieldValues(nvmlDev, 1, &fv) == ncclSuccess) && (fv.nvmlReturn == NVML_SUCCESS))
          isActive = (nvmlEnableState_t) fv.value.uiVal;
      } else /* FALLTHRU to GetNvLinkState if before SM90 */
#endif
      {
        (void) ncclNvmlDeviceGetNvLinkState(nvmlDev, l, &isActive);
      }
      if (isActive != NVML_FEATURE_ENABLED) continue;
      
      // #2.3 Try to figure out what's on the other side of the NVLink
      nvmlPciInfo_t remoteProc;
      if (ncclNvmlDeviceGetNvLinkRemotePciInfo(nvmlDev, l, &remoteProc) != ncclSuccess) continue;

      // Make a lower case copy of the bus ID for calling ncclDeviceType
      // PCI system path is in lower case
      char* p = remoteProc.busId;
      char lowerId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
      for (int c=0; c<NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE; c++) {
        lowerId[c] = tolower(p[c]);
        if (p[c] == 0) break;
      }

      NCCLCHECK(xmlGetSubKv(gpuNode, "nvlink", &nvlNode, "target", lowerId));
      if (nvlNode == NULL) {
        NCCLCHECK(xmlAddNode(xml, gpuNode, "nvlink", &nvlNode));
        NCCLCHECK(xmlSetAttr(nvlNode, "target", lowerId));
        NCCLCHECK(xmlSetAttrInt(nvlNode, "count", 1));
      } else {
        int count;
        NCCLCHECK(xmlGetAttrInt(nvlNode, "count", &count));
        NCCLCHECK(xmlSetAttrInt(nvlNode, "count", count+1));
      }
    }
  }
  
  // #3 Fill target pci classes
  for (int s=0; s<gpuNode->nSubs; s++) {
    struct ncclXmlNode* sub = gpuNode->subs[s];
    if (strcmp(sub->name, "nvlink") != 0) continue;
    int index;
    NCCLCHECK(xmlGetAttrIndex(sub, "tclass", &index));
    if (index == -1) {
      const char* busId;
      NCCLCHECK(xmlGetAttr(sub, "target", &busId));
      char* path;
      ncclDebugNoWarn = NCCL_GRAPH;
      getPciPath(busId, &path);
      ncclDebugNoWarn = 0;
      if (path == NULL || strcmp(busId, "fffffff:ffff:ff") == 0) {
        // Remote NVLink device is not visible inside this VM. Assume NVSwitch.
        NCCLCHECK(xmlSetAttr(sub, "tclass", "0x068000"));
      } else {
        NCCLCHECK(ncclTopoSetAttrFromSys(sub, path, "class", "tclass"));
        free(path);
      }
    }
  }
  *gpuNodeRet = gpuNode;
  return ncclSuccess;
}

以下是Nvlink检测的逻辑,通过NVML接口查询后创建XML节点加入物理拓扑图中。

Add Nvlinks
Check Max NVLinks
Check NVLink P2P Capability
Check NVLink Active State
Query Remote GPU Info
Check GPU Archetructure
Fill Target GPU PCI Class

以下例子是创建出的nvlink节点,target属性表示当前GPU连接的目标GPU PCI的BFD地址,count表示nvlink的lane数,tclass是Target GPU的PCI类代码(class codes)(由主类,子类,编程接口组成,例如0x030200的主类:0x03 表示显示控制器(Display Controller),
子类:0x02 表示3D控制器(3D Controller),编程接口:0x00 )。
pci class code

<pci busid="0000:03:00.0" class="0x030200" vendor="0x10de" device="0x20b0" subsystem_vendor="0x10de" subsystem_device="0x144e" link_speed="5.0 GT/s PCIe" link_width="32">
	<gpu dev="0" sm="80" rank="0" gdr="1">
        <nvlink target="0000:15:00.0" count="4" tclass="0x030200"/>
        <nvlink target="0000:05:00.0" count="4" tclass="0x030200"/>
        <nvlink target="0000:1e:00.0" count="4" tclass="0x030200"/>
    </gpu>
</pci>
  • 19
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值