NCCL 故障排除一

最新推荐文章于 2024-08-01 19:19:55 发布

__Sunny__

最新推荐文章于 2024-08-01 19:19:55 发布

阅读量1.4w

点赞数 2

分类专栏： ML/DL linux

linux 同时被 2 个专栏收录

27 篇文章 2 订阅

订阅专栏

ML/DL

18 篇文章 0 订阅

订阅专栏

官方文档 http://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#troubleshooting

========================================================================

Ensure you are familiar with the following known issues and useful debugging strategies.

NCCL calls may return a variety of return codes. Ensure that the return codes are always equal to ncclSuccess . If any call fails, and returns a value different from ncclSuccess , setting NCCL_DEBUG to WARN will make NCCL print an explicit warning message before returning the error.

NCCL调用可能会返回各种返回码。确保返回码始终等于 ncclSuccess 。如果任何调用失败，并返回一个不同于ncclSuccess的值，将 NCCL_DEBUG 设置为 WARN 将使NCCL在返回错误之前打印一个明确的警告消息。

Errors are grouped into different categories.

ncclUnhandledCudaError and ncclSystemError indicate that a call to an external library failed.
ncclInvalidArgument and ncclInvalidUsage indicates there was a programming error in the application using NCCL.

In either case, refer to the NCCL warning message to understand how to resolve the problem.

错误分为不同的类别。

ncclUnhandledCudaError 和 ncclSystemError 表示对外部库的调用失败。
ncclInvalidArgument 和 ncclInvalidUsage 指示使用NCCL的应用程序中存在编程错误。

无论哪种情况，请参阅NCCL警告消息以了解如何解决问题。

5.2. Networking Issues 网络问题

5.2.1. IP Network Interfaces IP 网络接口

NCCL auto-detects which network interfaces to use for inter-node communication. If some interfaces are in state up, however are not able to communicate between nodes, NCCL may try to use them anyway and therefore fail during the init functions or even hang.
For more information about how to specify which interfaces to use, see NCCL Knobs topic, particularly the NCCL_SOCKET_IFNAME knob.
NCCL自动检测哪些网络接口用于节点间通信。如果某些接口处于up状态，但是无法在节点之间进行通信，则NCCL可能会尝试使用它们，从而在init函数期间失败甚至挂起。
有关如何指定要使用哪个接口的更多信息，请参阅 NCCL Knobs 主题，特别是 NCCL_SOCKET_IFNAME 旋钮。

5.2.2. InfiniBand

Before running NCCL on InfiniBand, running low-level InfiniBand tests (and in particular the ib_write_bw test) can help verify which nodes are able to communicate properly.

在InfiniBand上运行NCCL之前，运行低级InfiniBand测试（尤其是ib_write_bw测试）可以帮助验证哪些节点能够正常通信。

5.3. Known Issues

Ensure you are familiar with the following known issues:

Sharing Data 共享数据

In order to share data between ranks, NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system’s limits on these resources may need to be increased accordingly. Please see your system’s documentation for details. In particular, Docker^® containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:

为了在队列之间共享数据，NCCL可能需要IPC的共享系统内存和固定（页面锁定）系统内存资源。操作系统对这些资源的限制可能需要相应的增加。有关详细信息，请参阅您的系统文档。特别是，Docker®容器默认为使用有限的共享和固定内存资源。在容器内使用NCCL时，建议您通过使用以下命令来增加这些资源：

--shm-size=1g --ulimit memlock=-1

in the command line to

nvidia-docker run

Concurrency between NCCL and CUDA calls (NCCL up to 2.0.5 or CUDA 8) NCCL和CUDA调用之间的并发性（NCCL版本不低于2.0.5或CUDA 8）

NCCL uses CUDA kernels to perform inter-GPU communication. The NCCL kernels synchronize with each other, therefore, each kernel requires other kernels on other GPUs to be also executed in order to complete. The application should therefore make sure that nothing prevents the NCCL kernels from being executed concurrently on the different devices of a NCCL communicator.

NCCL使用CUDA内核来执行GPU间通信。 NCCL内核彼此同步，因此，每个内核都需要其他GPU上的内核也执行才能完成。因此，应用程序应该确保没有阻止在NCCL通信器的不同设备上同时执行NCCL内核。

For example, let's say you have a process managing multiple CUDA devices, and, also features a thread which calls CUDA functions asynchronously. In this case, CUDA calls could be executed between the enqueuing of two NCCL kernels. The CUDA call may wait for the first NCCL kernel to complete and prevent the second one from being launched, causing a deadlock since the first kernel will not complete until the second one is executed. To avoid this issue, one solution is to have a lock around the NCCL launch on multiple devices (around ncclGroupStart and ncclGroupEnd when using a single thread, around the NCCL launch when using multiple threads, using thread synchronization if necessary) and take this lock when calling CUDA from the asynchronous thread.

例如，假设您有一个管理多个CUDA设备的进程，并且还具有一个异步调用CUDA函数的线程。在这种情况下，可以在排队的两个NCCL内核之间执行CUDA调用。 CUDA调用可能会等待第一个NCCL内核完成，并阻止第二个内核启动，从而导致死锁，因为直到执行第二个内核，第一个内核才会完成。为了避免这个问题，一个解决方案是锁定多个设备上的NCCL启动（当使用单个线程时围绕 ncclGroupStart 和 ncclGroupEnd ，在使用多个线程时围绕 NCCL launch ，必要时使用线程同步），并在调用异步线程中的CUDA时，使用此锁。

Starting with NCCL 2.1.0, this issue is no longer present when using CUDA 9, unless Cooperative Group Launch is disabled in the NCCL_LAUNCH_MODE = PARALLEL setting.

从NCCL 2.1.0开始，使用CUDA 9时，此问题不再存在，除非在NCCL_LAUNCH_MODE = PARALLEL设置中禁用了“合作组启动”。

Read more at: http://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#ixzz53lG7oopU
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook