LLMs之DeepSeek-V3之CommunicationOpt:DeepEP的简介、安装和使用方法、案例应用之详细攻略

LLMs之DeepSeek-V3之CommunicationOpt:DeepEP的简介、安装和使用方法、案例应用之详细攻略

目录

DeepEP的简介

1、特点

2、性能

具有 NVLink 和 RDMA 转发功能的普通内核

低延迟内核搭配纯 RDMA

DeepEP的安装和使用方法

1、安装

前提条件

下载并安装NVSHMEM依赖项

安装

性能测试

网络配置

流量隔离 

自适应路由

拥塞控制

2、使用方法

模型训练/推理预填充

推理解码

3、其他注意事项

DeepEP的案例应用


DeepEP的简介

2025年2月25日,DeepEP是一个为混合专家模型(MoE)和专家并行(EP)量身定制的通信库。它提供高吞吐量低延迟的全对全GPU内核,也称为MoE调度和合并。该库还支持低精度运算,包括FP8。DeepEP针对DeepSeek-V3论文中提出的组限制门控算法进行了优化,提供了一组针对非对称域带宽转发优化的内核,例如将数据从NVLink域转发到RDMA域。这些内核具有高吞吐量,适用于训练推理预填充任务,并支持SM(流式多处理器)数量控制。对于延迟敏感的推理解码,DeepEP包含一组使用纯RDMA的低延迟内核,以最大限度地减少延迟。该库还引入了一种基于hook的通信计算重叠方法,该方法不占用任何SM资源。需要注意的是,该库中的实现可能与DeepSeek-V3论文中存在细微差异
总而言之,DeepEP是一个高效的专家并行通信库,它为MoE模型提供了高性能的通信解决方案,并支持多种优化技术,例如低精度运算、非对称域带宽转发和通信计算重叠。

GitHub地址https://github.com/deepseek-ai/DeepEP

1、特点

>> 高吞吐量和低延迟:提供高吞吐量和低延迟的全对全GPU内核,用于MoE调度和合并。
>> 低精度运算支持:支持低精度运算,包括FP8,降低计算成本。
>> 非对称域带宽转发优化:针对DeepSeek-V3论文中提出的组限制门控算法,优化了非对称域带宽转发(例如NVLink到RDMA),提高了训练和推理预填充任务的吞吐量。
>> SM数量控制:支持SM数量控制,灵活适应不同硬件资源。
>> 低延迟内核:针对延迟敏感的推理解码,提供了一组低延迟内核,使用纯RDMA,最小化延迟。
>> 通信计算重叠:引入基于hook的通信计算重叠方法,不占用SM资源,提高效率。

2、性能

具有 NVLink 和 RDMA 转发功能的普通内核

在 H800(NVLink 最大带宽约为 160GB/s)上测试普通内核,每个 H800 连接一块 CX7 InfiniBand 400Gb/s RDMA 网卡(最大带宽约为 50GB/s)。并且我们遵循 DeepSeek-V3/R1 预训练设置(每批 4096 个标记,7168 个隐藏单元,4 组,8 个专家,FP8 分配和 BF16 合并)。

TypeDispatch #EPBottleneck bandwidthCombine #EPBottleneck bandwidth
Intranode8153 GB/s (NVLink)8158 GB/s (NVLink)
Internode1643 GB/s (RDMA)1643 GB/s (RDMA)
Internode3244 GB/s (RDMA)3247 GB/s (RDMA)
Internode6446 GB/s (RDMA)6445 GB/s (RDMA)

低延迟内核搭配纯 RDMA

在 H800 上测试低延迟内核,每个 H800 都连接到一块 CX7 InfiniBand 400 Gb/s RDMA 网卡(最大带宽约 50 GB/s)。并且我们遵循典型的 DeepSeek-V3/R1 生产设置(每批 128 个标记,7168 个隐藏单元,8 个专家,FP8 分配和 BF16 合并)。

Dispatch #EPLatencyRDMA bandwidthCombine #EPLatencyRDMA bandwidth
8163 us46 GB/s8318 us46 GB/s
16173 us43 GB/s16329 us44 GB/s
32182 us41 GB/s32350 us41 GB/s
64186 us40 GB/s64353 us41 GB/s
128192 us39 GB/s128369 us39 GB/s
256194 us39 GB/s256360 us40 GB/s

DeepEP的安装和使用方法

1、安装

前提条件

Hopper架构GPU(未来可能支持更多架构或设备)
Python 3.8及以上版本
CUDA 12.3及以上版本
PyTorch 2.1及以上版本
NVLink用于节点内通信
RDMA网络用于节点间通信

下载并安装NVSHMEM依赖项

DeepEP依赖于修改后的NVSHMEM,请参考项目的NVSHMEM安装指南

安装

设置NVSHMEM环境变量,然后使用pip安装:

NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install

安装完成后,在你的Python项目中导入deep_ep即可使用。

性能测试

项目提供了用于测试节点内、节点间和低延迟内核的测试用例:

python tests/test_intranode.py
python tests/test_internode.py
python tests/test_low_latency.py

网络配置

DeepEP已在InfiniBand网络上进行了全面测试,理论上也兼容RDMA over Converged Ethernet (RoCE)。 支持通过虚拟通道(VL)进行流量隔离,建议将使用普通内核的工作负载、使用低延迟内核的工作负载和其他工作负载隔离到不同的虚拟通道。 低延迟内核支持自适应路由,而普通内核目前不支持(很快会添加支持)。 自适应路由在网络负载较重时推荐启用,在网络负载较轻时推荐使用静态路由。

流量隔离 

InfiniBand支持通过虚拟通道(Virtual Lanes,简称VL)实现流量隔离。
为了防止不同类型流量之间的干扰,我们建议按以下方式将工作负载分布在不同的虚拟通道上:
使用普通内核的工作负载
使用低延迟内核的工作负载
其他类型的工作负载
对于DeepEP,您可以通过设置NVSHMEM_IB_SL环境变量来控制虚拟通道的分配。

自适应路由

自适应路由是InfiniBand交换机提供的高级路由功能,能够通过多条路径均匀分配流量。目前,低延迟内核支持自适应路由,而普通内核尚不支持(不久后可能会增加支持)。为普通节点内核启用自适应路由可能会导致死锁或数据损坏问题。
对于低延迟内核,启用自适应路由可以完全消除由路由冲突引起的网络拥堵,但同时也会引入额外的延迟。为了获得最佳性能,我们建议以下配置:
在网络负载较重的环境中启用自适应路由
在网络负载较轻的环境中使用静态路由

拥塞控制

在我们的生产环境中并未观察到显著的拥塞现象,因此默认关闭拥塞控制。

2、使用方法

项目提供了用于模型训练/推理预填充和推理解码的示例代码。

模型训练/推理预填充

示例代码展示了如何使用普通内核进行模型训练或推理预填充阶段(不包括反向传播)的通信。代码中使用了Buffer和EventOverlap类进行缓冲区管理和通信计算重叠。 Buffer.set_num_sms() 函数设置使用的SM数量。get_dispatch_layout() 函数计算调度布局。Buffer.dispatch() 函数执行MoE调度,Buffer.combine() 函数执行MoE合并。 EventOverlap 类用于事件管理,实现通信计算重叠。

import torch
import torch.distributed as dist
from typing import List, Tuple, Optional, Union

from deep_ep import Buffer, EventOverlap

# Communication buffer (will allocate at runtime)
_buffer: Optional[Buffer] = None

# Set the number of SMs to use
# NOTES: this is a static variable
Buffer.set_num_sms(24)


# You may call this function at the framework initialization
def get_buffer(group: dist.ProcessGroup, hidden_bytes: int) -> Buffer:
    global _buffer
    
    # NOTES: you may also replace `get_*_config` with your auto-tuned results via all the tests
    num_nvl_bytes, num_rdma_bytes = 0, 0
    for config in (Buffer.get_dispatch_config(group.size()), Buffer.get_combine_config(group.size())):
        num_nvl_bytes = max(config.get_nvl_buffer_size_hint(hidden_bytes, group.size()), num_nvl_bytes)
        num_rdma_bytes = max(config.get_rdma_buffer_size_hint(hidden_bytes, group.size()), num_rdma_bytes)

    # Allocate a buffer if not existed or not enough buffer size
    # NOTES: the adaptive routing configuration of the network **must be off**
    if _buffer is None or _buffer.group != group or _buffer.num_nvl_bytes < num_nvl_bytes or _buffer.num_rdma_bytes < num_rdma_bytes:
        _buffer = Buffer(group, num_nvl_bytes, num_rdma_bytes)
    return _buffer


def get_hidden_bytes(x: torch.Tensor) -> int:
    t = x[0] if isinstance(x, tuple) else x
    return t.size(1) * max(t.element_size(), 2)


def dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
                     topk_idx: torch.Tensor, topk_weights: torch.Tensor,
                     num_experts: int, previous_event: Optional[EventOverlap] = None) -> \
        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor, torch.Tensor, List, Tuple, EventOverlap]:
    # NOTES: an optional `previous_event` means a CUDA event captured that you want to make it as a dependency 
    # of the dispatch kernel, it may be useful with communication-computation overlap. For more information, please
    # refer to the docs of `Buffer.dispatch`
    global _buffer

    # Calculate layout before actual dispatch
    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \
        _buffer.get_dispatch_layout(topk_idx, num_experts,
                                    previous_event=previous_event, async_finish=True,
                                    allocate_on_comm_stream=previous_event is not None)
    # Do MoE dispatch
    # NOTES: the CPU will wait for GPU's signal to arrive, so this is not compatible with CUDA graph
    # For more advanced usages, please refer to the docs of the `dispatch` function
    recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \
        _buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,
                         num_tokens_per_rank=num_tokens_per_rank, num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,
                         is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,
                         previous_event=previous_event, async_finish=True,
                         allocate_on_comm_stream=True)
    # For event management, please refer to the docs of the `EventOverlap` class
    return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event


def dispatch_backward(grad_recv_x: torch.Tensor, grad_recv_topk_weights: torch.Tensor, handle: Tuple) -> \
        Tuple[torch.Tensor, torch.Tensor, EventOverlap]:
    global _buffer

    # The backward process of MoE dispatch is actually a combine
    # For more advanced usages, please refer to the docs of the `combine` function
    combined_grad_x, combined_grad_recv_topk_weights, event = \
        _buffer.combine(grad_recv_x, handle, topk_weights=grad_recv_topk_weights, async_finish=True)

    # For event management, please refer to the docs of the `EventOverlap` class
    return combined_grad_x, combined_grad_recv_topk_weights, event


def combine_forward(x: torch.Tensor, handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \
        Tuple[torch.Tensor, EventOverlap]:
    global _buffer

    # Do MoE combine
    # For more advanced usages, please refer to the docs of the `combine` function
    combined_x, _, event = _buffer.combine(x, handle, async_finish=True, previous_event=previous_event,
                                           allocate_on_comm_stream=previous_event is not None)

    # For event management, please refer to the docs of the `EventOverlap` class
    return combined_x, event


def combine_backward(grad_combined_x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
                     handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \
        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], EventOverlap]:
    global _buffer

    # The backward process of MoE combine is actually a dispatch
    # For more advanced usages, please refer to the docs of the `combine` function
    grad_x, _, _, _, _, event = _buffer.dispatch(grad_combined_x, handle=handle, async_finish=True,
                                                 previous_event=previous_event,
                                                 allocate_on_comm_stream=previous_event is not None)

    # For event management, please refer to the docs of the `EventOverlap` class
    return grad_x, event

推理解码

示例代码展示了如何使用低延迟内核进行推理解码阶段的通信。 代码中同样使用了Buffer类,但使用了low_latency_mode=True参数,并使用low_latency_dispatch() 和 low_latency_combine() 函数进行通信。 return_recv_hook=True 参数返回接收hook,用于实现双批次重叠,且不占用任何SM资源。

import torch
import torch.distributed as dist
from typing import Tuple, Optional

from deep_ep import Buffer

# Communication buffer (will allocate at runtime)
# NOTES: there is no SM control API for the low-latency kernels
_buffer: Optional[Buffer] = None


# You may call this function at the framework initialization
def get_buffer(group: dist.ProcessGroup, num_max_dispatch_tokens_per_rank: int, hidden: int, num_experts: int) -> Buffer:
    # NOTES: the low-latency mode will consume much more space than the normal mode
    # So we recommend that `num_max_dispatch_tokens_per_rank` (the actual batch size in the decoding engine) should be less than 256
    global _buffer
    num_rdma_bytes = Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts)

    # Allocate a buffer if not existed or not enough buffer size
    if _buffer is None or _buffer.group != group or not _buffer.low_latency_mode or _buffer.num_rdma_bytes < num_rdma_bytes:
        # NOTES: for best performance, the QP number **must** be equal to the number of the local experts
        assert num_experts % group.size() == 0
        _buffer = Buffer(group, 0, num_rdma_bytes, low_latency_mode=True, num_qps_per_rank=num_experts // group.size())
    return _buffer


def low_latency_dispatch(hidden_states: torch.Tensor, topk_idx: torch.Tensor, num_max_dispatch_tokens_per_rank: int, num_experts: int):
    global _buffer

    # Do MoE dispatch, compatible with CUDA graph (but you may restore some buffer status once you replay)
    recv_hidden_states, recv_expert_count, handle, event, hook = \
        _buffer.low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts,
                                     async_finish=False, return_recv_hook=True)

    # NOTES: the actual tensor will not be received only if you call `hook()`,
    # it is useful for double-batch overlapping, but **without any SM occupation**
    # If you don't want to overlap, please set `return_recv_hook=False`
    # Later, you can use our GEMM library to do the computation with this specific format
    return recv_hidden_states, recv_expert_count, handle, event, hook


def low_latency_combine(hidden_states: torch.Tensor,
                        topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple):
    global _buffer

    # Do MoE combine, compatible with CUDA graph (but you may restore some buffer status once you replay)
    combined_hidden_states, event_overlap, hook = \
        _buffer.low_latency_combine(hidden_states, topk_idx, topk_weights, handle,
                                    async_finish=False, return_recv_hook=True)

    # NOTES: the same behavior as described in the dispatch kernel
    return combined_hidden_states, event_overlap, hook

3、其他注意事项

项目中使用了未定义行为的PTX指令,以获得极高的性能,但在其他平台上可能无法正常工作。可以通过设置DISABLE_AGGRESSIVE_PTX_INSTRS=1来禁用此功能。建议在你的集群上运行所有测试,并使用最佳自动调整的配置以获得最佳性能。

DeepEP的案例应用

示例代码展示了DeepEP在模型训练/推理预填充和推理解码中的应用,详细说明了如何使用其提供的函数进行MoE调度和合并,以及如何利用通信计算重叠技术提高效率。 性能数据表明,在H800 GPU集群上,DeepEP在不同规模的专家并行下都取得了较高的带宽和较低的延迟。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值