代码解析！Qwen-3-MoE通义第三代模型！

最新推荐文章于 2025-05-08 08:30:00 发布

AGI大模型老王

最新推荐文章于 2025-05-08 08:30:00 发布

阅读量2.2k

点赞数 17

文章标签： Qwen-3 Agent 人工智能 AI大模型程序员大模型大模型教程

本文链接：https://blog.csdn.net/2401_85390073/article/details/147477172

版权

Qwen-3 系列包含两个模型：

Qwen-3，对标 Qwen-2 的升级版
Qwen-3-MoE，最新的 MoE 结构的模型

改进点总结：

Qwen-3：
- Attention使用了和LLaMA-4一样的QK RMSNorm
Qwen-3-MoE：
- FFN使用了MoE架构，支持间隔k层使用。
- MoE的专家支持Top-K加权输出。
- 训练时使用了Switch Transformer的辅助损失。

Qwen-3 模型架构

先回顾一下Qwen-2的模型结构，然后再来对比Qwen-3的改进在哪：

GQA
SwiGLU 激活函数
RoPE 位置编码
增加注意力机制中 QKV bias
RMSNorm 归一化。

1.1. Attention 模块差异

1.1.1. QK预归一化（QK RMSNorm）

Qwen3：在 Qwen3Attention 中，在对输入做线性变换得到的查询和键向量上调用了专门的归一化层（Qwen3RMSNorm），在 forward 方法中，查询和键向量分别先经过这些层再进行后续的旋转位置编码（RoPE）处理：

self.q_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)self.k_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)

Qwen2：在 Qwen2Attention 的实现中，并未显式对查询和键向量做类似的预归一化操作。其直接通过线性层得到查询、键、值，再应用 RoPE（调用 apply_rotary_pos_emb），这样设计上可能略有不同，体现了一定的架构改进和优化。

1.1.2. 取消投影层的偏置

Qwen3：在 Qwen3Attention 中，所有线性层（q_proj、k_proj、v_proj、o_proj）在构造时都通过 config.attention_bias 来决定是否使用 bias（默认不使用）：

self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias)

Qwen2：在 Qwen2Attention 中，查询、键和值的投影均明确设置了 bias=True，而输出投影层 o_proj 则设置为 bias=False：

self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=True)

1.2. MLP

默认使用的仍然是SwiGLU结构（无偏置MLP）

1.3. RMSNorm

这块和Qwen2一样的，只是回顾一下前几天解析的LLaMA-4的L2Norm本质上也就是RMSNorm，并且Qwen3RMSNorm还用在了QK_Norm上，也就是说，Attention模块的修改和LLaMA-4一样。

class Qwen3RMSNorm(nn.Module):    def __init__(self, hidden_size, eps=1e-6):        """        Qwen3RMSNorm is equivalent to T5LayerNorm        """        super().__init__()        self.weight = nn.Parameter(torch.ones(hidden_size))        self.variance_epsilon = eps    def forward(self, hidden_states):        input_dtype = hidden_states.dtype        hidden_states = hidden_states.to(torch.float32)        variance = hidden_states.pow(2).mean(-1, keepdim=True)        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)        return self.weight * hidden_states.to(input_dtype)    def extra_repr(self):        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"

2. Qwen-3-MoE 模型架构

这里我们以Qwen-3为基础，分析Qwen-3-MoE的架构变化：

2.1. Attention 模块

Attention 模块的改动和 Qwen-3 是一致的。

2.2. MoE FFN 模块

MoE架构使用一个nn.Linear(config.hidden_size, config.num_experts, bias=False)门控来决策每一个token要使用哪些专家来计算。并且这里选择专家的逻辑：

基于Top-K策略选取。
选取的Top-K专家基于Top-K的加和加权累计输出current_hidden_states。

class Qwen3MoeSparseMoeBlock(nn.Module):    def __init__(self, config):        super().__init__()        self.num_experts = config.num_experts          # 总的专家数        self.top_k = config.num_experts_per_tok        # 每个 token 仅使用 top_k 个专家        self.norm_topk_prob = config.norm_topk_prob    # 是否对选出的 top_k 概率进行归一化        # gating: 用于给每个 token 计算对应到 num_experts 个专家上的 logits（路由分数）        self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)        # experts: 每个专家都是一个独立的 MLP，这里用一个 ModuleList 存放        self.experts = nn.ModuleList(            [Qwen3MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]        )def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:    batch_size, sequence_length, hidden_dim = hidden_states.shape    # (batch_size, seq_len, hidden_dim) -> (batch_size*seq_len, hidden_dim)    hidden_states = hidden_states.view(-1, hidden_dim)    # ===================== 1) 计算 gating 得分 =====================    # router_logits: (batch_size * sequence_length, num_experts)    router_logits = self.gate(hidden_states)    # 对路由分数做 softmax，得到对各专家的概率分布    routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)    # 只取最大的 top_k 个专家和相应概率    routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)    # norm_topk_prob: 是否把选出来的 top_k 概率再做归一化，使它们之和为 1    if self.norm_topk_prob:        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)    # 将概率 cast 回输入相同的 dtype (如 FP16/BF16 等)    routing_weights = routing_weights.to(hidden_states.dtype)    # ===================== 2) 初始化输出张量     final_hidden_states = torch.zeros(        (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device    )    # ===================== 3) 构造 one-hot mask，标识每个 token 被分配给哪些专家     # expert_mask: (num_experts, batch_size*sequence_length, top_k)    #   - one_hot会先得到 (batch_size*sequence_length, num_experts) 的one_hot编码，然后我们选了 top_k 列    #   - permute(2,1,0) 目的是把 expert 这个维度放在最前面，便于后面 for expert_idx in ... 的循环使用    expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)    # ===================== 4) 分别计算每个专家的输出并累加     for expert_idx in range(self.num_experts):        expert_layer = self.experts[expert_idx]        # idx, top_x 记录哪些 token 被分配给了 expert_idx        # expert_mask[expert_idx] 是 (batch_size*sequence_length, top_k) 的 one-hot        # torch.where 返回 (行index, 列index)        # 其中 "top_x" 表示第几个 token 的行索引，"idx" 表示 top_k 中第几个列位置        idx, top_x = torch.where(expert_mask[expert_idx])        # 取出对应 token 的 hidden_states        current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)        # 进入该 expert 的 MLP，并乘以对该 expert 的 gating 权重        current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]        # 把该 expert 的输出累加回 final_hidden_states        # index_add_ 是原位操作，会把 current_hidden_states 累加到 final_hidden_states[top_x] 上        final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))    # 最后 reshape 回 (batch_size, sequence_length, hidden_dim) 形式    final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)    return final_hidden_states, router_logits

2.2. 间隔decoder_sparse_step使用开启MoE

if (layer_idx not in config.mlp_only_layers) and (config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0):    self.mlp = Qwen3MoeSparseMoeBlock(config)else:    self.mlp = Qwen3MoeMLP(config, intermediate_size=config.intermediate_size)

Qwen3-MoE 专门多了一个 load_balancing_loss_func()；在 Qwen3MoeForCausalLM 的 forward 中，如果 output_router_logits=True，则会计算该辅助损失并加到主 loss 上。

这个辅助损失的设计和Switch Transformer是一致的，具体来说：

def load_balancing_loss_func(    gate_logits: Union[torch.Tensor, Tuple[torch.Tensor], None],    num_experts: Optional[int] = None,    top_k=2,    attention_mask: Optional[torch.Tensor] = None,) -> Union[torch.Tensor, int]:    ...    if attention_mask is None:        # Compute the percentage of tokens routed to each experts        tokens_per_expert = torch.mean(expert_mask.float(), dim=0)        # Compute the average probability of routing to these experts        router_prob_per_expert = torch.mean(routing_weights, dim=0)    else:        batch_size, sequence_length = attention_mask.shape        num_hidden_layers = concatenated_gate_logits.shape[0] // (batch_size * sequence_length)        # Compute the mask that masks all padding tokens as 0 with the same shape of expert_mask        expert_attention_mask = (            attention_mask[None, :, :, None, None]            .expand((num_hidden_layers, batch_size, sequence_length, top_k, num_experts))            .reshape(-1, top_k, num_experts)            .to(compute_device))        # Compute the percentage of tokens routed to each experts        tokens_per_expert = torch.sum(expert_mask.float() * expert_attention_mask, dim=0) / torch.sum(            expert_attention_mask, dim=0        )        # Compute the mask that masks all padding tokens as 0 with the same shape of tokens_per_expert        router_per_expert_attention_mask = (            attention_mask[None, :, :, None]            .expand((num_hidden_layers, batch_size, sequence_length, num_experts))            .reshape(-1, num_experts)            .to(compute_device)        )        # Compute the average probability of routing to these experts        router_prob_per_expert = torch.sum(routing_weights * router_per_expert_attention_mask, dim=0) / torch.sum(            router_per_expert_attention_mask, dim=0        )    overall_loss = torch.sum(tokens_per_expert * router_prob_per_expert.unsqueeze(0))    return overall_loss * num_experts

如何学习AI大模型？

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。