【代码解析】用PyTorch实现混合专家（MoE）语言模型

最新推荐文章于 2025-04-28 11:00:42 发布

Kaydeon

最新推荐文章于 2025-04-28 11:00:42 发布

阅读量1.3k

点赞数 18

文章标签： pytorch 语言模型人工智能

本文链接：https://blog.csdn.net/weixin_42744466/article/details/143261989

版权

在深度学习和自然语言处理领域，混合专家（Mixture of Experts, MoE）模型因其卓越的性能和可扩展性而受到广泛关注。最近，DeepSeek-V2作为一个强大的开源MoE语言模型，以其创新的Transformer架构实现了经济高效的训练和推理，引起了社区的极大兴趣。本文将详细介绍DeepSeek-V2模型的架构细节，并提供PyTorch实现代码，帮助读者深入理解这一先进的模型。

DeepSeek-V2模型概述

DeepSeek-V2是一个具有2360亿参数的开源混合专家（MoE）语言模型，每个令牌激活21亿参数，支持最大128K令牌的上下文长度。在开源模型中，DeepSeek-V2实现了顶级性能，成为最强大的开源MoE语言模型之一。在MMLU（多模态机器学习）基准测试中，DeepSeek-V2以较少的激活参数实现了顶尖的性能。与前代模型DeepSeek 67B相比，DeepSeek-V2显著提升了性能，降低了42.5%的训练成本，减少了93.3%的KV缓存，并将最大生成吞吐量提高了5.76倍。

架构细节

DeepSeek-V2整合了两种创新架构：用于前馈网络（FFNs）的DeepSeekMoE架构和用于注意力机制的多头隐性注意力（MLA）。下面我们将详细讨论这两种架构。

DeepSeekMoE

在标准的MoE架构中，每个令牌被分配给一个或两个专家，每个MoE层都有多个在结构上与标准前馈网络（FFN）相同的专家。DeepSeekMoE引入了细粒度专家分割和共享专家隔离两种策略来增强专家的专业化。

细粒度专家分割：通过切分FFN中的中间隐藏维度，将所有专家分割成更细的粒度，以在每个专家中更有针对性地获取知识。
共享专家隔离：隔离某些专家作为始终被激活的共享专家，旨在捕获不同上下文中的共同知识，并通过将共同知识压缩到这些共享专家中，减少其他路由专家之间的冗余。

多头隐性注意力（MLA）

多头隐性注意力（MLA）相较于标准的多头注意力（MHA）实现了更优的性能，并且显著减少了KV缓存，提高了推理效率。MLA将键（Key）和值（Value）共同压缩成一个潜在向量，而不是缓存键（Key）和值（Value）矩阵，这使得缓存的项目数量更少。

PyTorch实现

接下来，我们将提供DeepSeek-V2模型的PyTorch实现代码，包括门控模型实现和MoE实现。

门控模型实现

class MoEGate(torch.nn.Module):
    def __init__(self, num_experts_per_tok: int, n_routed_experts: int, routed_scaling_factor: int, topk_method: str, n_group: int, topk_group: int, hidden_size: int):
        super().__init__()
        self.top_k = num_experts_per_tok
        self.n_routed_experts = n_routed_experts
        self.routed_scaling_factor = routed_scaling_factor
        self.topk_method = topk_method
        self.n_group = n_group
        self.topk_group = topk_group
        self.weight = torch.nn.Parameter(torch.empty((self.n_routed_experts, hidden_size)))
        torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))

    def forward(self, x: torch.Tensor):
        batch, seq_len, h = x.shape
        hidden_states = x.view(-1, h)
        logits = torch.nn.functional.linear(hidden_states.type(torch.float32), self.weight.type(torch.float32), None)
        scores = logits.softmax(dim=-1, dtype=torch.float32)
        if self.topk_method == "greedy":
            topk_weight, topk_idx = torch.topk(scores, k=self.top_k, dim=-1, sorted=False)
        elif self.topk_method == "group_limited_greedy":
            group_scores = (scores.view(batch * seq_len, self.n_group, -1).max(dim=-1).values)
            group_idx = torch.topk(group_scores, k=self.topk_group, dim=-1, sorted=False)[1]  # [n, top_k_group]
            group_mask = torch.zeros_like(group_scores)  # [n, n_group]
            group_mask.scatter_(1, group_idx, 1)  # [n, n_group]
            score_mask = (
                group_mask.unsqueeze(-1)
                .expand(
                    batch * seq_len, self.n_group, self.n_routed_experts // self.n_group
                )
                .reshape(batch * seq_len, -1)
            )  # [n, e]
            tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
            topk_weight, topk_idx = torch.topk(
                tmp_scores, k=self.top_k, dim=-1, sorted=False
            )
        return topk_idx, topk_weight

MoE实现

class MoE(torch.nn.Module):
    def __init__(self, dim: int, routed_scaling_factor: int, topk_method: str, n_group: int, topk_group: int, hidden_dim: int | None = None, n_routed_experts: int = 12, num_experts_per_tok: int = 4, n_shared_experts: int = 2, mlp: str = "swiglu"):
        super().__init__()
        self.experts_per_rank = n_routed_experts
        self.num_experts_per_tok = num_experts_per_tok
        self.n_shared_experts = n_shared_experts
        mlp_block = SwiGLU
        self.experts = torch.nn.ModuleList([mlp_block(dim, hidden_dim) for i in range(n_routed_experts)])
        self.gate = MoEGate(num_experts_per_tok, n_routed_experts, routed_scaling_factor, topk_method, n_group, topk_group, dim)
        self.shared_experts = mlp_block(dim, hidden_dim * n_shared_experts)

    def forward(self, x: torch.Tensor):
        identity = x
        orig_shape = x.shape
        topk_idx, topk_weight = self.gate(x)
        x = x.view(-1, x.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
        x = x.repeat_interleave(self.num_experts_per_tok, dim=0)
        y = torch.empty_like(x)
        y = y.type(x.dtype)
        for i, expert in enumerate(self.experts):
            y[flat_topk_idx == i] = expert(x[flat_topk_idx == i]).to(dtype=x.dtype)
        y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)

        y = y.view(*orig_shape)
        output = y + self.shared_experts(identity)
        return output

多头隐性注意力（MLA）实现

class MLA(torch.nn.Module):
    def __init__(self, model_args: DeepseekConfig):
        super().__init__()
        d_model = model_args.d_model
        self.num_heads = model_args.num_heads
        self.head_dim = model_args.d_model // model_args.num_heads
        self.attn_dropout = torch.nn.Dropout(model_args.dropout)
        self.res_dropout = torch.nn.Dropout(model_args.dropout)
        self.flash_attn = hasattr(torch.nn.functional, "scaled_dot_product_attention")

        self.q_lora_rank = model_args.q_lora_rank
        self.qk_rope_head_dim = model_args.qk_rope_head_dim
        self.kv_lora_rank = model_args.kv_lora_rank
        self.v_head_dim = model_args.v_head_dim
        self.qk_nope_head_dim = model_args.qk_nope_head_dim
        self.q_head_dim = model_args.qk_nope_head_dim + model_args.qk_rope_head_dim
        self.q_a_proj = torch.nn.Linear(d_model, model_args.q_lora_rank, bias=False)
        self.q_a_layernorm = RMSNorm(model_args.q_lora_rank)
        self.q_b_proj = torch.nn.Linear(model_args.q_lora_rank, self.num_heads * self.q_head_dim, bias=False)
        self.kv_a_proj_with_mqa = torch.nn.Linear(d_model,model_args.kv_lora_rank + model_args.qk_rope_head_dim,bias=False,)
        self.kv_a_layernorm = RMSNorm(model_args.kv_lora_rank)
        self.kv_b_proj = torch.nn.Linear(model_args.kv_lora_rank,self.num_heads * (self.q_head_dim - self.qk_rope_head_dim +
            self.v_head_dim),bias=False,
)
        self.o_proj = torch.nn.Linear(self.num_heads * self.v_head_dim,d_model, bias=False,)

    def forward(self, x: torch.Tensor, mask: torch.Tensor, freqs_cis) -> torch.Tensor:
        batch, seq_len, d_model = x.shape
        q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(x)))
        q = q.view(batch, seq_len, self.num_heads, self.q_head_dim).transpose(1, 2)
        q_nope, q_pe = torch.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
        compressed_kv = self.kv_a_proj_with_mqa(x)
        compressed_kv, k_pe = torch.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
        k_pe = k_pe.view(batch, seq_len, 1, self.qk_rope_head_dim).transpose(1, 2)
        kv = (self.kv_b_proj(self.kv_a_layernorm(compressed_kv))
            .view(batch, seq_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
            .transpose(1, 2))
        k_nope, value_states = torch.split(kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
        q_pe, k_pe = apply_rope(q_pe, k_pe, freqs_cis)
        k_pe = k_pe.transpose(2, 1)
        q_pe = q_pe.transpose(2, 1)
        query_states = k_pe.new_empty(batch, self.num_heads, seq_len, self.q_head_dim)
        query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
        query_states[:, :, :, self.qk_nope_head_dim :] = q_pe
        key_states = k_pe.new_empty(batch, self.num_heads, seq_len, self.q_head_dim)
        key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
        key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
        attn_mtx = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
        attn_mtx = attn_mtx + mask[:, :, :seq_len, :seq_len]
        attn_mtx = torch.nn.functional.softmax(attn_mtx.float(), dim=-1).type_as(key_states)
        attn_mtx = self.attn_dropout(attn_mtx)
        output = torch.matmul(attn_mtx, value_states)  # (batch, n_head, seq_len, head_dim)
        output = output.transpose(1, 2).contiguous().view(batch, seq_len, self.num_heads * self.v_head_dim)
        output = self.o_proj(output)
        output = self.res_dropout(output)
        return output

总结

本文详细介绍了DeepSeek-V2语言模型，这是一个强大的开源混合专家（MoE）语言模型，采用创新的架构来提高训练和推理的经济性和效率。DeepSeek-V2采用了两种核心技术：细粒度专家分割和共享专家隔离，这两种策略显著提高了专家的专业化水平。此外，文章还介绍了多头隐性注意力（MLA），这是一种改进的注意力机制，通过低秩键值联合压缩和解耦旋转位置嵌入，优化了模型的存储和计算效率。

混合专家（MoE）模型是一种集成学习方法，它将多个专家模型结合起来，以提高整体的性能和泛化能力。在自然语言处理领域，MoE模型通常用于处理大规模数据集和复杂任务，因为它们可以有效地扩展模型容量，同时保持训练和推理的效率。

Transformer架构是当前自然语言处理领域的主流模型之一，它通过自注意力机制来捕捉序列数据中的长距离依赖关系。DeepSeek-V2模型在此基础上引入了MoE架构，进一步增强了模型的表达能力和灵活性。

多头注意力机制是Transformer架构中的关键组成部分，它允许模型同时关注序列的不同部分，从而提高模型对信息的捕捉能力。MLA作为一种改进的多头注意力机制，通过低秩键值联合压缩和解耦旋转位置嵌入，优化了模型的存储和计算效率。