Qwen1.5-14B-Chat如何从输入到输出-代码解析

红酒暖心也暖胃

已于 2025-03-21 09:06:04 修改

阅读量309

点赞数 4

分类专栏： nlp 文章标签：深度学习 python 人工智能

于 2025-03-10 10:28:15 首次发布

本文链接：https://blog.csdn.net/zpp13hao1/article/details/146143941

版权

nlp 专栏收录该内容

25 篇文章

订阅专栏

出发点

这次开始的模型是qwen1.5-14b-chat，我是基于Llama的代码做的修改，这中间关于模型结构其实没有任何区别，但我为了将input-output对应上，还是用了一天的时间，也是我一开始就疏忽了原因吧。请先阅读上一篇，关于完全一样的内容，这里不再重复
Qwen1.5 介绍

模型结构

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
device = "cuda" # the device to load the model onto
model_name = "/usr/downloads/Qwen1.5-14B-Chat"   
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "介绍一下LLM"
messages = [
    {"role": "system", "content": "你是一个有用的AI助手"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(device)
# 为了保证response没有随机性
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.do_sample = False
model.generation_config.repetition_penalty
# 1.05
#####################################
# model.generation_config.repetition_penalty = 1# 我终于找到原因了
#####################################

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print (response)
#LLM，全称为Master of Laws（法学硕士），是一种研究生学位，专为那些已经拥有法学学士学位的人设计，旨在提供更深入的法律知识和专业技能。LLM课程通常针对特定的法律领域，如国际法、商业法、知识产权法、税法、环境法等，或者可以是法学理论研究。

#在LLM课程中，学生会进行高级法律分析、案例研究、研讨会和可能的独立研究项目，以深化他们对法律体系的理解，提升批判性思考和解决问题的能力。许多LLM课程也鼓励学生参与教学助理工作或学术研究，以便积累实践经验并为未来的职业生涯做准备。

#LLM学位在全球范围内被广泛认可，对于希望在法律界进一步发展，如成为律师、法律顾问、法律教授或从事国际法律事务的人来说，这是一个重要的学术里程碑。完成LLM后，学生通常需要通过国家或地区的律师资格考试才能执业。
"""
Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 5120)
    (layers): ModuleList(
      (0-39): 40 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=5120, out_features=5120, bias=True)
          (k_proj): Linear(in_features=5120, out_features=5120, bias=True)
          (v_proj): Linear(in_features=5120, out_features=5120, bias=True)
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=5120, out_features=13696, bias=False)
          (up_proj): Linear(in_features=5120, out_features=13696, bias=False)
          (down_proj): Linear(in_features=13696, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((5120,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((5120,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((5120,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbedding()
  )
  (lm_head): Linear(in_features=5120, out_features=152064, bias=False)
)
"""

模型从model、tokenizer的加载和之前的模型没有差别
Model的生成方式默认是sample
差别主要是model.generation_config.repetition_penalty
这个参数来自于generation_config.json，涉及到采样策略了，是我们之前没见过的领域了。
这个模型里有两个rotary_emb，默认使用的是layers的，self_attn中的是不使用的，我的代码里都没有初始化self_attn中的rotary_emb，因为我一开始没注意到，2025.03.21才注意到

参数量

"""
# 注意一下，只有三个线性层有bias
# word embedding+lm_head
152064*5120*2=1557135360
# 最后一层后面的LN
5120
# 0层有的参数
# LN
5120*2=10240
# self_attn weight
5120*5120*4=104857600
# self_attn bias 注意这里o_proj没有bias
5120*3=15360
# MLP
5120*13696*3=210370560
"""
1557135360+5120+(10240+104857600+15360+210370560)*40=14167290880

RepetitionPenaltyLogits

logits
在每步时对之前出现过的词的概率做出惩罚，即降低出现过的字的采样概率，让模型趋向于解码出没出现过的词

class RepetitionPenaltyLogitsProcessor():
    r"""
    [`LogitsProcessor`] that prevents the repetition of previous tokens through a penalty. This penalty is applied at
    most once per token. Note that, for decoder-only models like most LLMs, the considered tokens include the prompt.

    In the original [paper](https://arxiv.org/pdf/1909.05858.pdf), the authors suggest the use of a penalty of around
    1.2 to achieve a good balance between truthful generation and lack of repetition. To penalize and reduce
    repetition, use `penalty` values above 1.0, where a higher value penalizes more strongly. To reward and encourage
    repetition, use `penalty` values between 0.0 and 1.0, where a lower value rewards more strongly.

    Args:
        penalty (`float`):
            The parameter for repetition penalty. 1.0 means no penalty. Above 1.0 penalizes previously generated
            tokens. Between 0.0 and 1.0 rewards previously generated tokens.
    """

    def __init__(self, penalty: float):
        self.penalty = penalty

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        score = torch.gather(scores, 1, input_ids)
        # if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities
        score = torch.where(score < 0, score * self.penalty, score / self.penalty)
        scores.scatter_(1, input_ids, score)
        return scores

代码修改的部分

代码中需要将Llama修改为Qwen2，注意bias中q_proj、k_proj、v_proj是True，其他几个都是False
添加RepetitionPenaltyLogitsProcessor

class Qwen2ForCausalLM(Qwen2PreTrainedModel):
	def chat(self, tokenizer, query):
	...
	    next_token_logits = hidden_states[:, -1, :].float()
	    # 注意一下，这里的数据类型需要转换为float
	    if self.generation_config.repetition_penalty>1:
	        logits_processor = RepetitionPenaltyLogitsProcessor(self.generation_config.repetition_penalty)
	        next_token_scores = logits_processor(input_ids, next_token_logits)
	        next_token_logits = next_token_scores
	    # 修改logits
	
	
	    next_tokens = torch.argmax(next_token_logits, dim=-1)
	...

最后一点代码量

""" PyTorch Qwen2 model. """

import copy
import re

import torch
import torch.nn.functional as F
from torch import nn
from typing import Optional, Tuple, List, Callable

from transformers.modeling_utils import PreTrainedModel
from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
from configuration_qwen2 import Qwen2Config
from transformers.activations import ACT2FN

把这些代码保存成llama.py，放在Qwen1.5-14B-Chat的代码中，就可以正常使用了


from llama import *
from transformers import AutoTokenizer
model_path = "/usr/downloads/Qwen1.5-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = Qwen2ForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

print (model.generation_config)

prompt = "介绍一下LLM"
messages = [
    {"role": "system", "content": "你是一个有用的AI助手"},
    {"role": "user", "content": prompt}
]

response = model.chat(tokenizer, messages)
print (response)