Deepseek-MOE-16B-chat如何从输入到输出-代码解析

红酒暖心也暖胃

已于 2025-03-10 15:10:26 修改

阅读量884

点赞数 26

CC 4.0 BY-SA版权

分类专栏： nlp 文章标签：人工智能深度学习

于 2025-03-06 17:08:52 首次发布

本文链接：https://blog.csdn.net/zpp13hao1/article/details/146061071

nlp 专栏收录该内容

25 篇文章

订阅专栏

出发点

之前写的几篇关于大模型从输入到输出的代码都是transformer结构的代码，这次挑战一下MOE的代码，我是基于Deepseek-MOE-16B-chat的代码做的修改，这中间关于模型结构deepseek-llm-7b-chat调整的内容不是很多，主要是MOE的部分做了调整。请先阅读上一篇，关于完全一样的内容，这里不再重复

模型结构

# 这里强调一下，transformers==4.46.0
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "/usr/downloads/deepseek-moe-16b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "你是谁?"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
# ' 我是DeepSeek Chat，一个基于DeepSeek大语言模型开发的智能AI机器人。我可以回答各种问题，包括但不限于科学、文化、历史、技术、生活等方面的问题。如果您有任何问题，欢迎随时向我提问。'

"""
DeepseekForCausalLM(
  (model): DeepseekModel(
    (embed_tokens): Embedding(102400, 2048)
    (layers): ModuleList(
      (0): DeepseekDecoderLayer(
        (self_attn): DeepseekAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): DeepseekRotaryEmbedding()
        )
        (mlp): DeepseekMLP(
          (gate_proj): Linear(in_features=2048, out_features=10944, bias=False)
          (up_proj): Linear(in_features=2048, out_features=10944, bias=False)
          (down_proj): Linear(in_features=10944, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): DeepseekRMSNorm()
        (post_attention_layernorm): DeepseekRMSNorm()
      )
      (1-27): 27 x DeepseekDecoderLayer(
        (self_attn): DeepseekAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): DeepseekRotaryEmbedding()
        )
        (mlp): DeepseekMoE(
          (experts): ModuleList(
            (0-63): 64 x DeepseekMLP(
              (gate_proj): Linear(in_features=2048, out_features=1408, bias=False)
              (up_proj): Linear(in_features=2048, out_features=1408, bias=False)
              (down_proj): Linear(in_features=1408, out_features=2048, bias=False)
              (act_fn): SiLU()
            )
          )
          (gate): MoEGate()
          (shared_experts): DeepseekMLP(
            (gate_proj): Linear(in_features=2048, out_features=2816, bias=False)
            (up_proj): Linear(in_features=2048, out_features=2816, bias=False)
            (down_proj): Linear(in_features=2816, out_features=2048, bias=False)
            (act_fn): SiLU()
          )
        )
        (input_layernorm): DeepseekRMSNorm()
        (post_attention_layernorm): DeepseekRMSNorm()
      )
    )
    (norm): DeepseekRMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=102400, bias=False)
)
"""

模型从tokenizer、model的加载和Deepseek-llm-7b-chat是一模一样的，模型结构的有一些变化，简单说一下看出来的差别

为了保证参数量，hidden_size舍弃了4096，使用了2048
layer0是和之前的一样的
layer1-27是添加了MOE的
MOE中每个专家头（64个）都是一个MLP层，只是维度没有升高，反而降到1408
layer1-27中每层都带有一个共享专家头，同样为MLP层，维度从2048升到2816
为了每个token可以选择使用哪些专家头，带有一个Gate层来做选择（模型架构看不出来，但可以想一下，输入维度是2048，输出维度是64，最简单的就是一个linear层）

参数量

简单分析一下参数量，其实从模型结构里就能很明白的看出来了，我这里就是记录一下

419430400+2048+(4096+16777216+8650752*64+17301504+131072)*27+67239936+16777216+4096=16330903552
# 激活数量只有28亿
419430400+2048+(4096+16777216+8650752*6+17301504+131072)*27+67239936+16777216+4096=2828650496
"""
# 注意一下，所有的线性层都没有bias
# word embedding+lm_head
2048*102400*2=419430400
# 最后一层后面的LN
2048
# 0层有的参数
# LN
2048*2=4096
# self_attn
2048*2048*4=16777216
# MLP
2048*10944*3=67239936

# 1-27层每层都有的参数
# LN
2048*2=4096
# self_attn
2048*2048*4=16777216
# 64个专家都有的
2048*1408*3=8650752
# 共享专家
2048*2816*3=17301504
# gate
2048*64=131072
"""

不变的部分

主流程图、model部分、MLP层、RMSNorm、attention部分、rotaryEmbedding部分都是不变的，只是将代码中的Llama改为了Deepseek
这部分省略了

DecoderLayer

其实这部分只有在初始化的地方有一点不一样，流程图是一样的

class DeepseekDecoderLayer(torch.nn.Module):
    """A single transformer layer.

    Transformer layer takes input with size [s, b, h] and returns an
    output of the same size.
    """

    def __init__(self, config: DeepseekConfig, layer_idx, device=None):
        super(DeepseekDecoderLayer, self).__init__()
        # Layernorm on the input data.
        self.input_layernorm = DeepseekRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

        # Self attention.
        self.self_attn = DeepseekAttention(config=config, layer_idx=layer_idx)

        # Layernorm on the attention output
        self.post_attention_layernorm = DeepseekRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

        # MLP
        # ------------------------------------------------------------------------------
        # 唯一一个改动点
        # 当layer_idx符合条件的时候，使用DeepseekMOE，其他时候使用DeepseekMLP
        # 这里是从layer_idx=1到27都是使用的DeepseekMOE
        self.mlp = DeepseekMoE(config) if (config.n_routed_experts is not None and  \
                                           layer_idx >= config.first_k_dense_replace and layer_idx % config.moe_layer_freq == 0) \
                                        else DeepseekMLP(config)

MOEGate


class MoEGate(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.top_k = config.num_experts_per_tok
        self.n_routed_experts = config.n_routed_experts

        self.scoring_func = config.scoring_func
        self.alpha = config.aux_loss_alpha
        self.seq_aux = config.seq_aux

        # topk selection algorithm
        self.norm_topk_prob = config.norm_topk_prob
        self.gating_dim = config.hidden_size
        self.weight = nn.Parameter(torch.empty((self.n_routed_experts, self.gating_dim)))
        self.reset_parameters()

    def reset_parameters(self) -> None:
        import torch.nn.init  as init
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
    
    def forward(self, hidden_states):
        bsz, seq_len, h = hidden_states.shape        
        ### compute gating score
        hidden_states = hidden_states.view(-1, h)# bsz*seq_len, h
        logits = F.linear(hidden_states, self.weight, None)# 64,2048得到bsz*seq_len, 64
        scores = logits.softmax(dim=-1)
 
        ### select top-k experts
        topk_weight, topk_idx = torch.topk(scores, k=self.top_k, dim=-1, sorted=False)# bsz*seq_len, 6
        
        ### norm gate to sum 1
        if self.top_k > 1 and self.norm_topk_prob:
            denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
            topk_weight = topk_weight / denominator

        ### expert-level computation auxiliary loss
        aux_loss = None
        return topk_idx, topk_weight, aux_loss

整体操作就像流程图那样，比较简单

DeepseekMoE

MOE
从流程图可以看出来，通过hidden_states得到topk_idx和topk_weight以后，将这两个和hidden_states一起输入moe_infer函数得到y，将y和经过shared_experts后的结果相加，得到最后的输出。这个流程比较清晰，比较难得是moe_infer


class DeepseekMoE(nn.Module):
    """
    A mixed expert module containing shared experts.
    """
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts_per_tok = config.num_experts_per_tok# 6
        self.experts = nn.ModuleList([DeepseekMLP(config, intermediate_size = config.moe_intermediate_size) for i in range(config.n_routed_experts)])
        self.gate = MoEGate(config)
        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekMLP(config=config, intermediate_size = intermediate_size)
    
    def forward(self, hidden_states):# 流程图中画得出来的
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])# bsz*seq_len,h
        flat_topk_idx = topk_idx.view(-1)# bsz*seq_len*6
        y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity)
        return y
    
    @torch.no_grad()
    def moe_infer(self, x, flat_topk_idx, topk_weight):
        expert_cache = torch.zeros_like(x)# bsz*seq_len,h
        idxs = flat_topk_idx.argsort()# 按升序排列元素的索引
        tokens_per_expert = flat_topk_idx.bincount().cpu().numpy().cumsum(0)
        token_idxs = idxs // self.num_experts_per_tok
        for i, end_idx in enumerate(tokens_per_expert):
            start_idx = 0 if i == 0 else tokens_per_expert[i-1]
            if start_idx == end_idx:
                continue
            expert = self.experts[i]
            exp_token_idx = token_idxs[start_idx:end_idx]
            expert_tokens = x[exp_token_idx]
            expert_out = expert(expert_tokens)
            expert_out.mul_(topk_weight[idxs[start_idx:end_idx]])
            expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum')
        return expert_cache

先简单叙述一下moe_infer在干一件什么样的事情OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER，如下图所示，它是在完成右边根据gate的结果找到对应的expert、然后乘上权重，最后相加的事情。

如果让我来做这个事情，那我肯定按照token来遍历啊，下面先来说官方代码

import torch
from torch import nn
from transformers import AutoModelForCausalLM

model_name = "/usr/downloads/deepseek-moe-16b-chat"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
# 基础设置-专家的头数
num_experts_per_tok = 6
# 经过gate处理后，8个token，每个token对应了6个头，拉平后的结果
flat_topk_idx = torch.tensor([12, 36, 39, 56, 58, 40, 6, 9, 26, 44, 50, 55, 8, 26, 30, 32, 63, 16, 6, 16, 28, 36, 49, 48, 30, 35, 57, 59, 61, 18, 28, 35, 40, 54, 55, 63, 2, 21, 56, 59, 63, 52, 23, 32, 33, 38, 39, 25]).to(model.device)
# 每个专家对应的权重
topk_weight = torch.randn(48,1, dtype=torch.bfloat16).to(model.device)
# 8个token对应的向量值
hidden_states = torch.randn(8,2048, dtype=torch.bfloat16).to(model.device)
# 从model中找到layer1的64个专家
experts = model.model.layers[1].mlp.experts

# 模型做法
def moe_infer(x, flat_topk_idx, topk_weight):
	expert_cache = torch.zeros_like(x)# bsz*seq_len,h
	idxs = flat_topk_idx.argsort()# 按升序排列元素的索引
	tokens_per_expert = flat_topk_idx.bincount().cpu().numpy().cumsum(0)
	token_idxs = idxs // num_experts_per_tok
	for i, end_idx in enumerate(tokens_per_expert):
	    start_idx = 0 if i == 0 else tokens_per_expert[i-1]
	    if start_idx == end_idx:
	        continue
	    expert = experts[i]
	    exp_token_idx = token_idxs[start_idx:end_idx]
	    expert_tokens = x[exp_token_idx]
	    expert_out = expert(expert_tokens)
	    expert_out.mul_(topk_weight[idxs[start_idx:end_idx]])
	    expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum')
	return expert_cache

final = moe_infer(hidden_states, flat_topk_idx, topk_weight)

按照token来遍历的代码写好以后，我发现了一个之前没注意到的问题，那就是在bfloat16的情况下，每个值相加的顺序会影响最后的结果，同理float16、float32也是有影响的，int是没有影响的。

import torch
a = torch.randn(10,40,dtype=torch.bfloat16)
b = torch.randn(10,40,dtype=torch.bfloat16)
c = torch.randn(10,40,dtype=torch.bfloat16)
aa = a + b + c
bb = a + c + b
aa == bb
"""
tensor([[ True, False,  True,  True, False,  True,  True,  True,  True, False,
         False, False,  True,  True,  True, False,  True,  True,  True,  True,
          True,  True, False,  True,  True,  True,  True,  True,  True,  True,
         False, False,  True,  True,  True,  True,  True,  True, False,  True]
         ....
"""
-0.2090-0.3535+0.8047
# 0.24219999999999997
-0.2090+0.8047-0.3535
# 0.24220000000000003

按照token来遍历-错误做法

# 相加顺序不对，结果也不能完全对上
def moe_infer_1(x, flat_topk_idx, topk_weight):
    expert_cache = torch.zeros_like(x)
    for i, end_idx in enumerate(x):# token遍历
        weight = topk_weight[i]
        for j in range(6):# 每个token对应的专家遍历
            print (flat_topk_idx[i*6+j])
            expert = experts[flat_topk_idx[i*6+j]]# 找到对应的专家
            expert_out = expert(end_idx)# 专家处理该token
            expert_cache[i] = expert_cache[i] + expert_out.mul_(topk_weight[i*6+j])# 每个专家处理后的结果相加
    return expert_cache

final1 = moe_infer_1(hidden_states, flat_topk_idx, topk_weight)# 相加顺序不对，结果也不能完全对上
final == final1
# False
# 其中token1和token5是一样的，因为它对应的专家序列刚好是升序的

按照token来遍历-正确做法

def moe_infer_2(x, flat_topk_idx, topk_weight):
	expert_cache = torch.zeros_like(x)
	for i, end_idx in enumerate(x):
		aa, bb = flat_topk_idx[i*6:(i+1)*6].sort()# 先将专家头排序,aa记录了升序后的专家，bb记录了原始位置
		for j in range(6):
			expert = experts[aa[j]]
			expert_out = expert(end_idx)
			expert_cache[i] = expert_cache[i] + expert_out.mul_(topk_weight[i*6 + bb[j]])
	return expert_cache
final2 = moe_infer_2(hidden_states, flat_topk_idx, topk_weight)# 相加顺序对了，结果能完全对上
final == final2
# True

官方做法

官方做法-demo

	for i, end_idx in enumerate(tokens_per_expert):
        start_idx = 0 if i == 0 else tokens_per_expert[i-1]# 找到前一个专家出现的次数和
        if start_idx == end_idx:# 如果当前专家出现的次数和与前一个专家出现的次数和相等，那么则当前专家未出现过
            continue
        expert = experts[i]# i对应的是专家的index
        exp_token_idx = token_idxs[start_idx:end_idx]# 找到能对应到这个专家的token行 idx
        expert_tokens = x[exp_token_idx]# 根据token行idx找到对应的token hidden_states
        expert_out = expert(expert_tokens)# 专家处理这个token
        #print (expert_out.shape)
        #print (topk_weight[idxs[start_idx:end_idx]].shape)
        expert_out.mul_(topk_weight[idxs[start_idx:end_idx]])# 先找到token对应的权重，将专家输出与权重相乘
        expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum')# 将每个token对应的专家输出相加

几点说明：

scatter_reduce_ pytorch中的scatter_add_函数解析
按照专家头遍历复杂度是O(n)，而按照token遍历复杂度是O(6*num_of_token)，当token数量很大时，这个复杂度很高的
我尽量写的清楚明白了，具体逻辑还是要好好思考一下的

最后一点代码量

到此基本就写完了代码，最后补充上一点import和实际调用的代码

import copy
import re

import torch
import torch.nn.functional as F
from torch import nn
from typing import Optional, Tuple, List, Callable

from transformers.modeling_utils import PreTrainedModel
from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
from configuration_deepseek import DeepseekConfig# 这里有点变动，是deepseek，不是Deepseek
from transformers.activations import ACT2FN
import math

把这些代码保存成deepseek.py，放在deepseek-MOE-16b-chat的代码中，就可以正常使用了

from deepseek import *
from transformers import AutoTokenizer
model_path = "/usr/downloads/deepseek-moe-16b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = DeepseekForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

messages = [
    {"role": "user", "content": "你是谁?"}
]

response = model.chat(tokenizer, messages)
print (response)