LLMs之DeepSeek-V3之CommunicationOpt:DeepEP的简介、安装和使用方法、案例应用之详细攻略
目录
DeepEP的简介
2025年2月25日,DeepEP是一个为混合专家模型(MoE)和专家并行(EP)量身定制的通信库。它提供高吞吐量和低延迟的全对全GPU内核,也称为MoE调度和合并。该库还支持低精度运算,包括FP8。DeepEP针对DeepSeek-V3论文中提出的组限制门控算法进行了优化,提供了一组针对非对称域带宽转发优化的内核,例如将数据从NVLink域转发到RDMA域。这些内核具有高吞吐量,适用于训练和推理预填充任务,并支持SM(流式多处理器)数量控制。对于延迟敏感的推理解码,DeepEP包含一组使用纯RDMA的低延迟内核,以最大限度地减少延迟。该库还引入了一种基于hook的通信计算重叠方法,该方法不占用任何SM资源。需要注意的是,该库中的实现可能与DeepSeek-V3论文中存在细微差异。
总而言之,DeepEP是一个高效的专家并行通信库,它为MoE模型提供了高性能的通信解决方案,并支持多种优化技术,例如低精度运算、非对称域带宽转发和通信计算重叠。
GitHub地址:https://github.com/deepseek-ai/DeepEP
1、特点
>> 高吞吐量和低延迟:提供高吞吐量和低延迟的全对全GPU内核,用于MoE调度和合并。
>> 低精度运算支持:支持低精度运算,包括FP8,降低计算成本。
>> 非对称域带宽转发优化:针对DeepSeek-V3论文中提出的组限制门控算法,优化了非对称域带宽转发(例如NVLink到RDMA),提高了训练和推理预填充任务的吞吐量。
>> SM数量控制:支持SM数量控制,灵活适应不同硬件资源。
>> 低延迟内核:针对延迟敏感的推理解码,提供了一组低延迟内核,使用纯RDMA,最小化延迟。
>> 通信计算重叠:引入基于hook的通信计算重叠方法,不占用SM资源,提高效率。
2、性能
具有 NVLink 和 RDMA 转发功能的普通内核
在 H800(NVLink 最大带宽约为 160GB/s)上测试普通内核,每个 H800 连接一块 CX7 InfiniBand 400Gb/s RDMA 网卡(最大带宽约为 50GB/s)。并且我们遵循 DeepSeek-V3/R1 预训练设置(每批 4096 个标记,7168 个隐藏单元,4 组,8 个专家,FP8 分配和 BF16 合并)。
Type | Dispatch #EP | Bottleneck bandwidth | Combine #EP | Bottleneck bandwidth |
---|---|---|---|---|
Intranode | 8 | 153 GB/s (NVLink) | 8 | 158 GB/s (NVLink) |
Internode | 16 | 43 GB/s (RDMA) | 16 | 43 GB/s (RDMA) |
Internode | 32 | 44 GB/s (RDMA) | 32 | 47 GB/s (RDMA) |
Internode | 64 | 46 GB/s (RDMA) | 64 | 45 GB/s (RDMA) |
低延迟内核搭配纯 RDMA
在 H800 上测试低延迟内核,每个 H800 都连接到一块 CX7 InfiniBand 400 Gb/s RDMA 网卡(最大带宽约 50 GB/s)。并且我们遵循典型的 DeepSeek-V3/R1 生产设置(每批 128 个标记,7168 个隐藏单元,8 个专家,FP8 分配和 BF16 合并)。
Dispatch #EP | Latency | RDMA bandwidth | Combine #EP | Latency | RDMA bandwidth |
---|---|---|---|---|---|
8 | 163 us | 46 GB/s | 8 | 318 us | 46 GB/s |
16 | 173 us | 43 GB/s | 16 | 329 us | 44 GB/s |
32 | 182 us | 41 GB/s | 32 | 350 us | 41 GB/s |
64 | 186 us | 40 GB/s | 64 | 353 us | 41 GB/s |
128 | 192 us | 39 GB/s | 128 | 369 us | 39 GB/s |
256 | 194 us | 39 GB/s | 256 | 360 us | 40 GB/s |
DeepEP的安装和使用方法
1、安装
前提条件
Hopper架构GPU(未来可能支持更多架构或设备)
Python 3.8及以上版本
CUDA 12.3及以上版本
PyTorch 2.1及以上版本
NVLink用于节点内通信
RDMA网络用于节点间通信
下载并安装NVSHMEM依赖项
DeepEP依赖于修改后的NVSHMEM,请参考项目的NVSHMEM安装指南
安装
设置NVSHMEM环境变量,然后使用pip安装:
NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install
安装完成后,在你的Python项目中导入deep_ep即可使用。
性能测试
项目提供了用于测试节点内、节点间和低延迟内核的测试用例:
python tests/test_intranode.py
python tests/test_internode.py
python tests/test_low_latency.py
网络配置
DeepEP已在InfiniBand网络上进行了全面测试,理论上也兼容RDMA over Converged Ethernet (RoCE)。 支持通过虚拟通道(VL)进行流量隔离,建议将使用普通内核的工作负载、使用低延迟内核的工作负载和其他工作负载隔离到不同的虚拟通道。 低延迟内核支持自适应路由,而普通内核目前不支持(很快会添加支持)。 自适应路由在网络负载较重时推荐启用,在网络负载较轻时推荐使用静态路由。
流量隔离
InfiniBand支持通过虚拟通道(Virtual Lanes,简称VL)实现流量隔离。
为了防止不同类型流量之间的干扰,我们建议按以下方式将工作负载分布在不同的虚拟通道上:
使用普通内核的工作负载
使用低延迟内核的工作负载
其他类型的工作负载
对于DeepEP,您可以通过设置NVSHMEM_IB_SL环境变量来控制虚拟通道的分配。
自适应路由
自适应路由是InfiniBand交换机提供的高级路由功能,能够通过多条路径均匀分配流量。目前,低延迟内核支持自适应路由,而普通内核尚不支持(不久后可能会增加支持)。为普通节点内核启用自适应路由可能会导致死锁或数据损坏问题。
对于低延迟内核,启用自适应路由可以完全消除由路由冲突引起的网络拥堵,但同时也会引入额外的延迟。为了获得最佳性能,我们建议以下配置:
在网络负载较重的环境中启用自适应路由
在网络负载较轻的环境中使用静态路由
拥塞控制
在我们的生产环境中并未观察到显著的拥塞现象,因此默认关闭拥塞控制。
2、使用方法
项目提供了用于模型训练/推理预填充和推理解码的示例代码。
模型训练/推理预填充
示例代码展示了如何使用普通内核进行模型训练或推理预填充阶段(不包括反向传播)的通信。代码中使用了Buffer和EventOverlap类进行缓冲区管理和通信计算重叠。 Buffer.set_num_sms() 函数设置使用的SM数量。get_dispatch_layout() 函数计算调度布局。Buffer.dispatch() 函数执行MoE调度,Buffer.combine() 函数执行MoE合并。 EventOverlap 类用于事件管理,实现通信计算重叠。
import torch
import torch.distributed as dist
from typing import List, Tuple, Optional, Union
from deep_ep import Buffer, EventOverlap
# Communication buffer (will allocate at runtime)
_buffer: Optional[Buffer] = None
# Set the number of SMs to use
# NOTES: this is a static variable
Buffer.set_num_sms(24)
# You may call this function at the framework initialization
def get_buffer(group: dist.ProcessGroup, hidden_bytes: int) -> Buffer:
global _buffer
# NOTES: you may also replace `get_*_config` with your auto-tuned results via all the tests
num_nvl_bytes, num_rdma_bytes = 0, 0
for config in (Buffer.get_dispatch_config(group.size()), Buffer.get_combine_config(group.size())):
num_nvl_bytes = max(config.get_nvl_buffer_size_hint(hidden_bytes, group.size()), num_nvl_bytes)
num_rdma_bytes = max(config.get_rdma_buffer_size_hint(hidden_bytes, group.size()), num_rdma_bytes)
# Allocate a buffer if not existed or not enough buffer size
# NOTES: the adaptive routing configuration of the network **must be off**
if _buffer is None or _buffer.group != group or _buffer.num_nvl_bytes < num_nvl_bytes or _buffer.num_rdma_bytes < num_rdma_bytes:
_buffer = Buffer(group, num_nvl_bytes, num_rdma_bytes)
return _buffer
def get_hidden_bytes(x: torch.Tensor) -> int:
t = x[0] if isinstance(x, tuple) else x
return t.size(1) * max(t.element_size(), 2)
def dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
topk_idx: torch.Tensor, topk_weights: torch.Tensor,
num_experts: int, previous_event: Optional[EventOverlap] = None) -> \
Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor, torch.Tensor, List, Tuple, EventOverlap]:
# NOTES: an optional `previous_event` means a CUDA event captured that you want to make it as a dependency
# of the dispatch kernel, it may be useful with communication-computation overlap. For more information, please
# refer to the docs of `Buffer.dispatch`
global _buffer
# Calculate layout before actual dispatch
num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \
_buffer.get_dispatch_layout(topk_idx, num_experts,
previous_event=previous_event, async_finish=True,
allocate_on_comm_stream=previous_event is not None)
# Do MoE dispatch
# NOTES: the CPU will wait for GPU's signal to arrive, so this is not compatible with CUDA graph
# For more advanced usages, please refer to the docs of the `dispatch` function
recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \
_buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,
num_tokens_per_rank=num_tokens_per_rank, num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,
is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,
previous_event=previous_event, async_finish=True,
allocate_on_comm_stream=True)
# For event management, please refer to the docs of the `EventOverlap` class
return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event
def dispatch_backward(grad_recv_x: torch.Tensor, grad_recv_topk_weights: torch.Tensor, handle: Tuple) -> \
Tuple[torch.Tensor, torch.Tensor, EventOverlap]:
global _buffer
# The backward process of MoE dispatch is actually a combine
# For more advanced usages, please refer to the docs of the `combine` function
combined_grad_x, combined_grad_recv_topk_weights, event = \
_buffer.combine(grad_recv_x, handle, topk_weights=grad_recv_topk_weights, async_finish=True)
# For event management, please refer to the docs of the `EventOverlap` class
return combined_grad_x, combined_grad_recv_topk_weights, event
def combine_forward(x: torch.Tensor, handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \
Tuple[torch.Tensor, EventOverlap]:
global _buffer
# Do MoE combine
# For more advanced usages, please refer to the docs of the `combine` function
combined_x, _, event = _buffer.combine(x, handle, async_finish=True, previous_event=previous_event,
allocate_on_comm_stream=previous_event is not None)
# For event management, please refer to the docs of the `EventOverlap` class
return combined_x, event
def combine_backward(grad_combined_x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
handle: Tuple, previous_event: Optional[EventOverlap] = None) -> \
Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], EventOverlap]:
global _buffer
# The backward process of MoE combine is actually a dispatch
# For more advanced usages, please refer to the docs of the `combine` function
grad_x, _, _, _, _, event = _buffer.dispatch(grad_combined_x, handle=handle, async_finish=True,
previous_event=previous_event,
allocate_on_comm_stream=previous_event is not None)
# For event management, please refer to the docs of the `EventOverlap` class
return grad_x, event
推理解码
示例代码展示了如何使用低延迟内核进行推理解码阶段的通信。 代码中同样使用了Buffer类,但使用了low_latency_mode=True参数,并使用low_latency_dispatch() 和 low_latency_combine() 函数进行通信。 return_recv_hook=True 参数返回接收hook,用于实现双批次重叠,且不占用任何SM资源。
import torch
import torch.distributed as dist
from typing import Tuple, Optional
from deep_ep import Buffer
# Communication buffer (will allocate at runtime)
# NOTES: there is no SM control API for the low-latency kernels
_buffer: Optional[Buffer] = None
# You may call this function at the framework initialization
def get_buffer(group: dist.ProcessGroup, num_max_dispatch_tokens_per_rank: int, hidden: int, num_experts: int) -> Buffer:
# NOTES: the low-latency mode will consume much more space than the normal mode
# So we recommend that `num_max_dispatch_tokens_per_rank` (the actual batch size in the decoding engine) should be less than 256
global _buffer
num_rdma_bytes = Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts)
# Allocate a buffer if not existed or not enough buffer size
if _buffer is None or _buffer.group != group or not _buffer.low_latency_mode or _buffer.num_rdma_bytes < num_rdma_bytes:
# NOTES: for best performance, the QP number **must** be equal to the number of the local experts
assert num_experts % group.size() == 0
_buffer = Buffer(group, 0, num_rdma_bytes, low_latency_mode=True, num_qps_per_rank=num_experts // group.size())
return _buffer
def low_latency_dispatch(hidden_states: torch.Tensor, topk_idx: torch.Tensor, num_max_dispatch_tokens_per_rank: int, num_experts: int):
global _buffer
# Do MoE dispatch, compatible with CUDA graph (but you may restore some buffer status once you replay)
recv_hidden_states, recv_expert_count, handle, event, hook = \
_buffer.low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts,
async_finish=False, return_recv_hook=True)
# NOTES: the actual tensor will not be received only if you call `hook()`,
# it is useful for double-batch overlapping, but **without any SM occupation**
# If you don't want to overlap, please set `return_recv_hook=False`
# Later, you can use our GEMM library to do the computation with this specific format
return recv_hidden_states, recv_expert_count, handle, event, hook
def low_latency_combine(hidden_states: torch.Tensor,
topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple):
global _buffer
# Do MoE combine, compatible with CUDA graph (but you may restore some buffer status once you replay)
combined_hidden_states, event_overlap, hook = \
_buffer.low_latency_combine(hidden_states, topk_idx, topk_weights, handle,
async_finish=False, return_recv_hook=True)
# NOTES: the same behavior as described in the dispatch kernel
return combined_hidden_states, event_overlap, hook
3、其他注意事项
项目中使用了未定义行为的PTX指令,以获得极高的性能,但在其他平台上可能无法正常工作。可以通过设置DISABLE_AGGRESSIVE_PTX_INSTRS=1来禁用此功能。建议在你的集群上运行所有测试,并使用最佳自动调整的配置以获得最佳性能。
DeepEP的案例应用
示例代码展示了DeepEP在模型训练/推理预填充和推理解码中的应用,详细说明了如何使用其提供的函数进行MoE调度和合并,以及如何利用通信计算重叠技术提高效率。 性能数据表明,在H800 GPU集群上,DeepEP在不同规模的专家并行下都取得了较高的带宽和较低的延迟。