LLMs之DeepSeek-V3之CommunicationOpt：DeepEP的简介、安装和使用方法、案例应用之详细攻略

2025年2月25日，DeepEP是一个为混合专家模型(MoE)和专家并行(EP)量身定制的通信库。它提供高吞吐量和低延迟的全对全GPU内核，也称为MoE调度和合并。该库还支持低精度运算，包括FP8。DeepEP针对DeepSeek-V3论文中提出的组限制门控算法进行了优化，提供了一组针对非对称域带宽转发优化的内核，例如将数据从NVLink域转发到RDMA域。这些内核具有高吞吐量，适用于训练和推理预填充任务，并支持SM（流式多处理器）数量控制。对于延迟敏感的推理解码，DeepEP包含一组使用纯RDMA的低延迟内核，以最大限度地减少延迟。该库还引入了一种基于hook的通信计算重叠方法，该方法不占用任何SM资源。需要注意的是，该库中的实现可能与DeepSeek-V3论文中存在细微差异。
总而言之，DeepEP是一个高效的专家并行通信库，它为MoE模型提供了高性能的通信解决方案，并支持多种优化技术，例如低精度运算、非对称域带宽转发和通信计算重叠。

GitHub地址：https://github.com/deepseek-ai/DeepEP

1、特点

>> 高吞吐量和低延迟：提供高吞吐量和低延迟的全对全GPU内核，用于MoE调度和合并。
>> 低精度运算支持：支持低精度运算，包括FP8，降低计算成本。
>> 非对称域带宽转发优化：针对DeepSeek-V3论文中提出的组限制门控算法，优化了非对称域带宽转发（例如NVLink到RDMA），提高了训练和推理预填充任务的吞吐量。
>> SM数量控制：支持SM数量控制，灵活适应不同硬件资源。
>> 低延迟内核：针对延迟敏感的推理解码，提供了一组低延迟内核，使用纯RDMA，最小化延迟。
>> 通信计算重叠：引入基于hook的通信计算重叠方法，不占用SM资源，提高效率。

2、性能

具有 NVLink 和 RDMA 转发功能的普通内核

在 H800（NVLink 最大带宽约为 160GB/s）上测试普通内核，每个 H800 连接一块 CX7 InfiniBand 400Gb/s RDMA 网卡（最大带宽约为 50GB/s）。并且我们遵循 DeepSeek-V3/R1 预训练设置（每批 4096 个标记，7168 个隐藏单元，4 组，8 个专家，FP8 分配和 BF16 合并）。

Type	Dispatch #EP	Bottleneck bandwidth	Combine #EP	Bottleneck bandwidth
Intranode	8	153 GB/s (NVLink)	8	158 GB/s (NVLink)
Internode	16	43 GB/s (RDMA)	16	43 GB/s (RDMA)
Internode	32	44 GB/s (RDMA)	32	47 GB/s (RDMA)
Internode	64	46 GB/s (RDMA)	64	45 GB/s (RDMA)

低延迟内核搭配纯 RDMA

在 H800 上测试低延迟内核，每个 H800 都连接到一块 CX7 InfiniBand 400 Gb/s RDMA 网卡（最大带宽约 50 GB/s）。并且我们遵循典型的 DeepSeek-V3/R1 生产设置（每批 128 个标记，7168 个隐藏单元，8 个专家，FP8 分配和 BF16 合并）。

Dispatch #EP	Latency	RDMA bandwidth	Combine #EP	Latency	RDMA bandwidth
8	163 us	46 GB/s	8	318 us	46 GB/s
16	173 us	43 GB/s	16	329 us	44 GB/s
32	182 us	41 GB/s	32	350 us	41 GB/s
64	186 us	40 GB/s	64	353 us	41 GB/s
128	192 us	39 GB/s	128	369 us	39 GB/s
256	194 us	39 GB/s	256	360 us	40 GB/s

DeepEP的安装和使用方法

1、安装

前提条件

Hopper架构GPU（未来可能支持更多架构或设备）
Python 3.8及以上版本
CUDA 12.3及以上版本
PyTorch 2.1及以上版本
NVLink用于节点内通信
RDMA网络用于节点间通信

下载并安装NVSHMEM依赖项

DeepEP依赖于修改后的NVSHMEM，请参考项目的NVSHMEM安装指南

安装

设置NVSHMEM环境变量，然后使用pip安装：

NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install

安装完成后，在你的Python项目中导入deep_ep即可使用。

性能测试

项目提供了用于测试节点内、节点间和低延迟内核的测试用例：

python tests/test_intranode.py
python tests/test_internode.py
python tests/test_low_latency.py

网络配置

DeepEP已在InfiniBand网络上进行了全面测试，理论上也兼容RDMA over Converged Ethernet (RoCE)。支持通过虚拟通道(VL)进行流量隔离，建议将使用普通内核的工作负载、使用低延迟内核的工作负载和其他工作负载隔离到不同的虚拟通道。低延迟内核支持自适应路由，而普通内核目前不支持（很快会添加支持）。自适应路由在网络负载较重时推荐启用，在网络负载较轻时推荐使用静态路由。

流量隔离

InfiniBand支持通过虚拟通道（Virtual Lanes，简称VL）实现流量隔离。
为了防止不同类型流量之间的干扰，我们建议按以下方式将工作负载分布在不同的虚拟通道上：
使用普通内核的工作负载
使用低延迟内核的工作负载
其他类型的工作负载
对于DeepEP，您可以通过设置NVSHMEM_IB_SL环境变量来控制虚拟通道的分配。

自适应路由

自适应路由是InfiniBand交换机提供的高级路由功能，能够通过多条路径均匀分配流量。目前，低延迟内核支持自适应路由，而普通内核尚不支持（不久后可能会增加支持）。为普通节点内核启用自适应路由可能会导致死锁或数据损坏问题。
对于低延迟内核，启用自适应路由可以完全消除由路由冲突引起的网络拥堵，但同时也会引入额外的延迟。为了获得最佳性能，我们建议以下配置：
在网络负载较重的环境中启用自适应路由
在网络负载较轻的环境中使用静态路由

拥塞控制

在我们的生产环境中并未观察到显著的拥塞现象，因此默认关闭拥塞控制。

2、使用方法

项目提供了用于模型训练/推理预填充和推理解码的示例代码。

模型训练/推理预填充

示例代码展示了如何使用普通内核进行模型训练或推理预填充阶段（不包括反向传播）的通信。代码中使用了Buffer和EventOverlap类进行缓冲区管理和通信计算重叠。 Buffer.set_num_sms() 函数设置使用的SM数量。get_dispatch_layout() 函数计算调度布局。Buffer.dispatch() 函数执行MoE调度，Buffer.combine() 函数执行MoE合并。 EventOverlap 类用于事件管理，实现通信计算重叠。

import torch
import torch.distributed as dist
from typing import List, Tuple, Optional, Union

from deep_ep import Buffer, EventOverlap

# Communication buffer (will allocate at runtime)
_buffer: Optional[Buffer] = None

# Set the number of SMs to use
# NOTES: this is a static variable
Buffer.set_num_sms(24)


# You may call this function at the framework initialization
def get_buffer(group: dist.ProcessGroup, hidden_bytes: int) -> Buffer:
    global _buffer
    
    # NOTES: you may also replace `get_*_config` with your auto-tuned results via all the tests
    num_nvl_bytes, num_rdma_bytes = 0, 0
    for config in (Buffer.get_dispatch_config(group.size()), Buffer.get_combine_config(group.size())):
        num_nvl_bytes = max(config.get_nvl_buffer_size_hint(hidden_bytes, group.size()), num_nvl_bytes)
        num_rdma_bytes = max(config.get_rdma_buffer_size_hint(hidden_bytes, group.size()), num_rdma_bytes)

    # Allocate a buffer if not existed or not enough buffer size
    # NOTES: the adaptive routing configuration of the network **must be off**
    if _buffer is None or _buffer.group != group or _buffer.num_nvl_bytes < num_nvl_bytes or _buffer.num_rdma_bytes < num_rdma_bytes:
        _buffer = Buffer(group, num_nvl_bytes, num_rdma_bytes)
    return _buffer


def get_hidden_bytes(x: torch.Tensor) -> int:
    t = x[0] if isinstance(x, tuple) else x
    return t.size(1) * max(t.element_size(), 2)


def dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
                     topk_idx: torch.Tensor, topk_weights: torch.Tensor,
                     num_experts: int, previous_event: Optional[EventOverlap] = None) -> \
        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor, torch.Tensor, List, Tuple, EventOverlap]:
    # NOTES: an optional `previous_event` means a CUDA event captured that you want to make it as a dependency 
    # of the dispatch kernel, it may be useful with communication-computation overlap. For more information, please
    # refer to the docs of `Buffer.dispatch`
    global _buffer

    # Calculate layout before actual dispatch
    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \
        _buffer.get_dispatch_layout(topk_idx, num_experts,
                                    previous_event=previous_event, async_finish=True,
                                    allocate_on_comm_stream=previous_event is not None)
    # Do MoE dispatch
    # NOTES: the CPU will wait for GPU's signal to arrive, so this is not compatible with CUDA graph
    # For more advanced usages, please refer to the docs of the `dispatch` function
    recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \
        _buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,
                         num_tokens_per_rank=num_tokens_per_rank, num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,
                         is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,
                         previous_event=previous_event, async_finish=True,
                         allocate_on_comm_stream=True)
    # For event management, please refer to the docs of the `EventOverlap` class
    return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event


def dispatch_backward(grad_recv_x: torch.Tensor, grad_recv_topk_weights: torch.Tensor, handle: Tuple) -> \
        Tuple[torch.Tensor, torch.Tensor, EventOverlap]:
    global _buffer

    # The backward process of MoE dispatch is actually a combine
    # For more advanced usages, please refer to the docs of the `combine` function
    combined_grad_x, combined_grad_recv_topk_weights, event = \
        _buffer.combine(grad_recv_x, handle, topk_weights=grad_recv_topk_weights, async_finish=True)

    # For event management, please refer to the docs of the `EventOverlap` class
    return combined_grad_x, combined_grad_recv_topk_weights, event


def combine_forward(x: torch.Tensor, handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \
        Tuple[torch.Tensor, EventOverlap]:
    global _buffer

    # Do MoE combine
    # For more advanced usages, please refer to the docs of the `combine` function
    combined_x, _, event = _buffer.combine(x, handle, async_finish=True, previous_event=previous_event,
                                           allocate_on_comm_stream=previous_event is not None)

    # For event management, please refer to the docs of the `EventOverlap` class
    return combined_x, event


def combine_backward(grad_combined_x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
                     handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \
        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], EventOverlap]:
    global _buffer

    # The backward process of MoE combine is actually a dispatch
    # For more advanced usages, please refer to the docs of the `combine` function
    grad_x, _, _, _, _, event = _buffer.dispatch(grad_combined_x, handle=handle, async_finish=True,
                                                 previous_event=previous_event,
                                                 allocate_on_comm_stream=previous_event is not None)

    # For event management, please refer to the docs of the `EventOverlap` class
    return grad_x, event

推理解码

示例代码展示了如何使用低延迟内核进行推理解码阶段的通信。代码中同样使用了Buffer类，但使用了low_latency_mode=True参数，并使用low_latency_dispatch() 和 low_latency_combine() 函数进行通信。 return_recv_hook=True 参数返回接收hook，用于实现双批次重叠，且不占用任何SM资源。

import torch
import torch.distributed as dist
from typing import Tuple, Optional

from deep_ep import Buffer

# Communication buffer (will allocate at runtime)
# NOTES: there is no SM control API for the low-latency kernels
_buffer: Optional[Buffer] = None


# You may call this function at the framework initialization
def get_buffer(group: dist.ProcessGroup, num_max_dispatch_tokens_per_rank: int, hidden: int, num_experts: int) -> Buffer:
    # NOTES: the low-latency mode will consume much more space than the normal mode
    # So we recommend that `num_max_dispatch_tokens_per_rank` (the actual batch size in the decoding engine) should be less than 256
    global _buffer
    num_rdma_bytes = Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts)

    # Allocate a buffer if not existed or not enough buffer size
    if _buffer is None or _buffer.group != group or not _buffer.low_latency_mode or _buffer.num_rdma_bytes < num_rdma_bytes:
        # NOTES: for best performance, the QP number **must** be equal to the number of the local experts
        assert num_experts % group.size() == 0
        _buffer = Buffer(group, 0, num_rdma_bytes, low_latency_mode=True, num_qps_per_rank=num_experts // group.size())
    return _buffer


def low_latency_dispatch(hidden_states: torch.Tensor, topk_idx: torch.Tensor, num_max_dispatch_tokens_per_rank: int, num_experts: int):
    global _buffer

    # Do MoE dispatch, compatible with CUDA graph (but you may restore some buffer status once you replay)
    recv_hidden_states, recv_expert_count, handle, event, hook = \
        _buffer.low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts,
                                     async_finish=False, return_recv_hook=True)

    # NOTES: the actual tensor will not be received only if you call `hook()`,
    # it is useful for double-batch overlapping, but **without any SM occupation**
    # If you don't want to overlap, please set `return_recv_hook=False`
    # Later, you can use our GEMM library to do the computation with this specific format
    return recv_hidden_states, recv_expert_count, handle, event, hook


def low_latency_combine(hidden_states: torch.Tensor,
                        topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple):
    global _buffer

    # Do MoE combine, compatible with CUDA graph (but you may restore some buffer status once you replay)
    combined_hidden_states, event_overlap, hook = \
        _buffer.low_latency_combine(hidden_states, topk_idx, topk_weights, handle,
                                    async_finish=False, return_recv_hook=True)

    # NOTES: the same behavior as described in the dispatch kernel
    return combined_hidden_states, event_overlap, hook