LLMs之MoE:《Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts》翻译与解读

LLMs之MoE:《Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts》翻译与解读

导读:这篇论文针对大型语言模型中 Mixture-of-Experts (MoE) 架构的通信开销问题,提出了一种名为 Comet 的优化系统,实现了细粒度的计算-通信重叠显著提高了 MoE 模型的执行效率,为大型语言模型的训练和推理提供了重要的系统级优化方案。

>> 背景痛点:在分布式环境下,MoE 模型的通信开销非常大,尤其是在大型 MoE 模型中,通信时间可能占据模型执行总时间的 47%。现有的方法虽然尝试通过计算和通信的粗粒度重叠来减少通信开销,但这种方法效率低下,计算效率降低,并且延迟隐藏效果不佳。 主要原因有两个:一是计算和通信的粒度不匹配(token-level vs. tile-level);二是计算和通信负载动态变化且不均衡。

>> 具体的解决方案:Comet 系统Comet 系统通过两种关键设计来解决上述问题:

● 基于共享张量依赖性解析方法:分析 MoE 层中计算和通信操作之间共享的张量(shared tensor),通过沿特定维度分解共享张量并重新调度计算任务,来消除计算和通信之间的粒度不匹配,从而实现细粒度的重叠

 自适应工作负载分配机制:将计算和通信任务融合到单个 GPU 内核中,通过线程块专门化(thread block specialization)和自适应线程块分配(adaptive thread block assignment),动态平衡计算和通信负载,最大限度地隐藏通信延迟。

>> 核心思路步骤:Comet 的核心思路是实现细粒度的计算-通信重叠。具体步骤如下:

● 共享张量分析:识别 MoE 层中计算和通信操作共享的张量,并分析其访问模式。

● 共享张量分解:沿特定维度分解共享张量,以打破粗粒度的依赖关系。 选择的维度取决于计算和通信操作的依赖关系。

● 计算任务重新调度:将分解后的子张量重新组织成适合计算的块(tile),并调整计算顺序,以最大限度地重叠计算和通信。

● 线程块专门化:将计算和通信任务分配到不同的线程块,以隔离通信对计算的影响,并提高计算效率。

● 自适应线程块分配:根据输入形状、模型配置和硬件环境,动态调整分配给计算和通信任务的线程块数量,以平衡计算和通信延迟

>> 优势

● 细粒度重叠:实现了计算和通信的细粒度重叠,比现有方法更有效地隐藏通信延迟

● 高计算效率:通过共享张量分解和重新调度,以及线程块专门化,保持了较高的计算效率。

● 自适应性强:自适应工作负载分配机制使其能够适应不同的模型配置、运行时负载和系统环境。

● 显著的性能提升:在多个大型 MoE 模型上取得了显著的性能提升,单 MoE 层加速 1.96 倍,端到端加速 1.71 倍。

>> 论文结论和观点

● Comet 系统有效地解决了 MoE 模型中计算和通信重叠的难题,实现了细粒度的重叠。

● Comet 的设计思想可以应用于其他需要计算和通信重叠的场景。

● Comet 在实际生产环境中得到了验证,节省了大量的 GPU 资源。

● 未来的工作可以探索使用编译器(如 Triton 或 TVM)来进一步优化 Comet 的编程模型。

目录

《Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts》翻译与解读

Abstract

1、Introduction

Figure 1: Analysis of the execution of MoE. (a) Time breakdown of MoE models executed on 8 H800 GPUs using Megatron-LM. (b) An illustration of communication-computation overlap by partitioning an expert computation kernel into two.图 1:MoE 执行情况分析。(a)使用 Megatron-LM 在 8 个 H800 GPU 上执行的 MoE 模型的时间分解。(b)通过将一个专家计算内核划分为两个来实现通信计算重叠的示意图。

Conclusion


《Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts》翻译与解读

地址

论文地址:[2502.19811] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

时间

2025227

最新为3月4日

作者

字节跳动团队等

Abstract

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal.

To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by 1.96× and for end-to-end execution, COMET delivers a 1.71× speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

专家混合(MoE)已被广泛用于将大型语言模型扩展到万亿级参数,同时保持固定的计算成本。在分布式场景中开发大型 MoE 模型会遇到较大的通信开销问题。在流行的模型和框架中,MoE 层的设备间通信可能会占用整个模型执行时间的 47%。因此,现有方法建议将 MoE 层的通信与计算流水线化以实现重叠。然而,这些粗粒度的重叠方案会显著降低计算效率,且延迟隐藏效果不佳。

为此,我们提出了 COMET,这是一种具有细粒度通信计算重叠的优化 MoE 系统。通过数据依赖分析任务重调度,COMET 实现了通信和计算的精确细粒度重叠。通过自适应工作负载分配,COMET 有效消除了细粒度通信瓶颈,并增强了其在各种场景中的适应性。我们的评估表明,COMET 能将单个 MoE 层的执行速度提升 1.96 倍,而在端到端执行方面,COMET 平均可实现 1.71 倍的速度提升。COMET 已在拥有万级 GPU 的集群生产环境中得到应用,节省了数百万 GPU 小时

1、Introduction

Recent advancements in large language models have revolutionized multiple domains, including natural language processing [36, 35], computer vision [16] and multi-modal perception [14, 3]. These achievements demonstrate that scaling up model size can significantly enhance model capacity. However, the growth in model parameters poses substantial challenges for the deployment of such giant models, as computational resources increasingly constrain model capacity [28].

To this end, Mixture-of-Experts (MoE) [29] introduces a sparse structure, within which only part of the parameters is activated. Instead of interacting with all parameters in dense models, MoE models allow each input to interact with only a few experts. For example, the Mixtral-8x7B model [12] comprises 45 billion parameters in total, while only 14 billion parameters are active during runtime. Nowadays, MoE has emerged as a key architecture for scaling models to trillion-plus parameters.

近期大型语言模型的进展彻底变革了多个领域,包括自然语言处理[36, 35]、计算机视觉[16]和多模态感知[14, 3]。这些成就表明,扩大模型规模能够显著增强模型容量。然而,模型参数的增长给这些巨型模型的部署带来了巨大挑战,因为计算资源日益成为模型容量的限制因素[28]。

为此,专家混合(MoE)[29]引入了一种稀疏结构,在这种结构中,只有部分参数被激活。与密集模型中所有参数都相互作用不同,MoE 模型允许每个输入仅与少数专家相互作用。例如,Mixtral-8x7B 模型[12]总共包含 450 亿个参数,但在运行时只有 140 亿个参数处于活跃状态。如今,MoE 已成为将模型扩展到万亿参数以上的关键架构。

The increase in parameter size in MoE models allows for the integration of greater amounts of information, but it poses challenges in expert placement. A typical approach is to distribute the experts across different GPUs as a single GPU cannot store all experts [13]. Consequently, during the execution of MoE layers, there is an intensive need for data exchange among GPUs. In the forward pass of several popular MoE models, the communication among devices accounts for 47% of the total execution time on average, as shown in Figure 1(a).

In a distributed environment, executing an MoE layer involves data reception, expert computation, and data transmission, as depicted in in Figure 1(b). To reduce communication overhead, one effective strategy is to pipeline the process, overlapping communication with expert computation  [10, 8, 31, 32]. This approach involves partitioning input data into smaller data chunks, allowing decomposed communication and computation phases to overlap. In the example in Figure 1(b), the received input data is divided into two chunks, and this coarse-grained overlapping reduces the overall execution time relative to non-pipelined execution.

MoE 模型中参数规模的增加允许整合更多的信息,但这也给专家的布局带来了挑战。一种典型的方法是将专家分布在不同的 GPU 上,因为单个 GPU 无法存储所有专家[13]。因此,在执行 MoE 层期间,GPU 之间存在大量的数据交换需求。在几个流行的 MoE 模型的前向传播过程中,设备之间的通信平均占总执行时间的 47%,如图 1(a)所示。

在分布式环境中,执行一个 MoE 层涉及数据接收、专家计算和数据传输,如图 1(b)所示。为了减少通信开销,一种有效的策略是流水线处理,使通信与专家计算重叠[10, 8, 31, 32]。这种方法涉及将输入数据分割成较小的数据块,使分解的通信和计算阶段能够重叠。在图 1(b)中的示例中,接收到的输入数据被分为两块,这种粗粒度的重叠相对于非流水线执行减少了总体执行时间。

The overlapping in existing mechanisms remains suboptimal due to two primary inefficiencies. First, the efficiency of partitioned experts declines as the data chunks assigned to each expert become smaller, potentially leading to under-utilization of GPU computational resources (e.g., the total compute time of experts after partitioning t1+t2 exceeds the original time t). The coarse-grained partitioning results in unavoidable GPU idle time during the initial and final communication phases, such as when receiving data for chunk 1 and sending data for chunk 2, which do not overlap with computation. Consequently, minimizing the non-overlapping time in these phases while maintaining computational efficiency is crucial. This is challenging because the data dependency between communication and computation is complex and it is hard to be overlapped in a fine-grained granularity efficiently. Second, due to the dynamic nature of MoE, the input shapes for experts are various at runtime, thereby posing diverse communication and computation burdens on GPUs. Encapsulating communication and computation tasks into separate kernels on different streams, like almost all the prior researches do, restricts control over hardware resources and results in non-deterministic kernel performance, thereby hindering seamless overlap (e.g., the computation of chunk 1 and the receiving of chunk 2 are misaligned). The second challenge, therefore, is to dynamically ensure precise allocation of hardware resources between computation and communication workloads at runtime.

现有机制中的重叠仍不理想,主要是由于两个主要的低效因素。首先,随着分配给每个专家的数据块变小,分区专家的效率会下降,这可能会导致 GPU 计算资源的利用率降低(例如,分区后的专家总计算时间 t1 + t2 超过原始时间 t)。粗粒度的分区会导致在初始和最终通信阶段不可避免地出现 GPU 空闲时间,例如在接收数据块 1 和发送数据块 2 时,这些阶段与计算不重叠。因此,在保持计算效率的同时,尽量减少这些阶段中的非重叠时间至关重要。这具有挑战性,因为通信和计算之间的数据依赖关系复杂,难以在细粒度上高效重叠。其次,由于 MoE 的动态特性,专家在运行时的输入形状各不相同,从而给 GPU 带来了不同的通信和计算负担。像几乎所有先前的研究那样,将通信和计算任务封装到不同流上的单独内核中,限制了对硬件资源的控制,并导致内核性能的不确定性,从而阻碍了无缝重叠(例如,块 1 的计算和块 2 的接收不一致)。第二个挑战在于在运行时动态确保计算和通信工作负载之间硬件资源的精确分配。

The complex data dependency, and the dynamic computation and communication workloads in MoE impede existing systems to realize efficient communication-computation overlap. We therefore propose Comet, a system that enables fine-grained communication-computation overlapping for efficient MoE execution. Comet introduces two key designs: 1) A dependency resolving method that identifies complex data dependencies between communication and computation operations in MoE, enabling optimized computation-communication pipeline structuring. 2) An adaptive workload assignment method that dynamically allocates GPU thread blocks to different workloads within a kernel, balancing communication and computation to improve latency concealment.

Comet facilitates fine-grained overlapping in MoE by analyzing shared data buffers between communication and computation operations, referred to as shared tensor. By decomposing the shared tensors along specific dimensions and reorganizing tensor data along with intra-operator execution order, Comet eliminates the granularity mismatches between communication and computation, thereby enabling fine-grained overlapping. To ensure precise resource allocation and effective latency concealment, Comet integrates communication and computation tasks within fused GPU kernels. Through thread block specialization, Comet isolates the impact of communication on computation performance , maintaining high computational efficiency. By adjusting the number of thread blocks allocated to each workload, Comet effectively balances communication and computation latencies and reduces bubbles in overlapping.

We have integrated Comet into Megatron-LM [33] and verified the capability of Comet with various parallel strategies. Our extensive experiments on Nvidia H800 and L20 clusters show that Comet delivers 1.96× speedup for typical MoE layers, and 1.71× speedup for end-to-end MoE model execution (Mixtral-8x7B [12], Qwen2-MoE [2], Phi3.5-MoE [1]) on average, compared with the SOTA MoE systems. Comet has been deployed to accelerate training and inference of large MoE models in production clusters comprising over ten thousand GPUs, achieving savings of millions of GPU hours. Comet introduces a fine-grained pipelined programming model for computation and communication. We will open-source COMET, aiming to inspire further optimizations, such as implementing the programming model in Comet using compilers like Triton [26] or TVM [6].

在 MoE 中,复杂的依赖关系以及动态的计算和通信工作负载阻碍了现有系统实现高效的通信计算重叠。因此,我们提出了 Comet 系统,该系统能够实现细粒度的通信计算重叠,从而高效执行 MoE。Comet 引入了两个关键设计:1)一种依赖解析方法,用于识别 MoE 中通信和计算操作之间的复杂数据依赖关系,从而优化计算通信流水线结构。2)一种自适应工作负载分配方法,该方法在内核中动态分配 GPU 线程块到不同的工作负载,平衡通信和计算以提高延迟隐藏能力。

Comet 通过分析通信和计算操作之间的共享数据缓冲区(称为共享张量)来促进 MoE 中的细粒度重叠。通过沿特定维度分解共享张量,并根据操作内部执行顺序重新组织张量数据,Comet 消除了通信和计算之间的粒度不匹配,从而实现了细粒度重叠。为了确保精确的资源分配和有效的延迟隐藏,Comet 将通信和计算任务整合到融合的 GPU 内核中。通过线程块专业化,Comet 隔离了通信对计算性能的影响,保持了较高的计算效率。通过调整分配给每个工作负载的线程块数量,Comet 有效地平衡了通信和计算延迟,并减少了重叠中的空泡。

我们已将 Comet 集成到 Megatron-LM [33] 中,并使用各种并行策略验证了 Comet 的能力。我们在 Nvidia H800 和 L20 集群上进行了大量实验,结果表明,与最先进的 MoE 系统相比,Comet 为典型的 MoE 层带来了 1.96 倍的速度提升,对于端到端的 MoE 模型执行(Mixtral-8x7B [12]、Qwen2-MoE [2]、Phi3.5-MoE [1])平均带来了 1.71 倍的速度提升。Comet 已部署到包含超过一万张 GPU 的生产集群中,用于加速大型 MoE 模型的训练和推理,节省了数百万 GPU 小时。Comet 引入了一种细粒度的计算和通信流水线编程模型。我们将开放 COMET 的源代码,旨在激发进一步的优化,例如使用像 Triton [26] 或 TVM [6] 这样的编译器在 Comet 中实现编程模型。

Figure 1: Analysis of the execution of MoE. (a) Time breakdown of MoE models executed on 8 H800 GPUs using Megatron-LM. (b) An illustration of communication-computation overlap by partitioning an expert computation kernel into two.图 1:MoE 执行情况分析。(a)使用 Megatron-LM 在 8 个 H800 GPU 上执行的 MoE 模型的时间分解。(b)通过将一个专家计算内核划分为两个来实现通信计算重叠的示意图。

Conclusion

In this paper, we propose Comet, a MoE system that aims to achieve fine-grained communication and computation overlapping for MoE. Comet features two key designs to achieve seamless overlapping without impact the computational efficiency: Shared tensor based dependency resolving that enables fine-grained overlapping, while eliminating the bottleneck caused by fine-grained communication I/O; The workload assignment mechanism that promises precise and adaptive overlapping of operators, inducing maximal latency concealing. Comet achieves 1.96× speedup in a single MoE layer and 1.71× speedup in the end-to-end execution of MoE models, compared with existing literature.

在本文中,我们提出了 Comet,这是一种旨在实现 MoE 系统中细粒度通信与计算重叠的 MoE 系统。Comet 具有两个关键设计,可在不影响计算效率的情况下实现无缝重叠:基于共享张量的依赖关系解析,能够实现细粒度重叠,同时消除由细粒度通信 I/O 引起的瓶颈;工作负载分配机制,可确保操作符的精确和自适应重叠,从而实现最大延迟隐藏。与现有文献相比,Comet 在单个 MoE 层中实现了 1.96 倍的速度提升,在 MoE 模型的端到端执行中实现了 1.71 倍的速度提升。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值