简易上手internLM2.5-1.8b并缩小规模训练自己的LLM

程序猿李巡天

于 2024-09-14 18:02:14 发布

阅读量523

点赞数 6

本文链接：https://blog.csdn.net/m0_59235945/article/details/142264109

版权

在自然语言处理的浩瀚宇宙中，模型的优化与微调是提升性能的关键步骤。本文将带领大家深入探索 InternLM2.5-1.8B 模型的内部结构，并通过实际代码示例展示如何读取、分析、缩减以及微调这一庞大的模型。

读取模型并查看模型结构

这里使用transformers的 AutoTokenizer 和 AutoModel 读取模型并查看模型结构。

模型读取

采用 InternLM2.5-1.8B 作为基础模型读取。

model_name = "./Shanghai_AI_Laboratory/internlm2_5-1_8b"      # 加载tokenizer      tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)      # 加载模型      model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to("cuda")

查看模型结构并计算参数量

使用 print(model) 查看 InternLM2.5 1.8B 的模型

InternLM2ForCausalLM(     (model): InternLM2Model(       (tok_embeddings): Embedding(92544, 2048, padding_idx=2)       (layers): ModuleList(         (0-23): 24 x InternLM2DecoderLayer(           (attention): InternLM2Attention(             (wqkv): Linear(in_features=2048, out_features=4096, bias=False)             (wo): Linear(in_features=2048, out_features=2048, bias=False)             (rotary_emb): InternLM2DynamicNTKScalingRotaryEmbedding()           )           (feed_forward): InternLM2MLP(             (w1): Linear(in_features=2048, out_features=8192, bias=False)             (w3): Linear(in_features=2048, out_features=8192, bias=False)             (w2): Linear(in_features=8192, out_features=2048, bias=False)             (act_fn): SiLU()           )           (attention_norm): InternLM2RMSNorm()           (ffn_norm): InternLM2RMSNorm()         )       )       (norm): InternLM2RMSNorm()     )     (output): Linear(in_features=2048, out_features=92544, bias=False)   )

由此可以看出 InternLM2.5 1.8B 模型的特点：

嵌入层:

使用 Embedding 层,词表大小为 92,544,嵌入维度为 2,048

主体结构:

包含 24 个 InternLM2DecoderLayer

解码器层:

分别用于注意力层和前馈网络层之后
包含两个扩展维度的线性层和一个降维的线性层
使用 SiLU 激活函数
使用线性层进行 Q、K、V 的投影
采用旋转位置编码 (Rotary Embedding)
注意力机制 (InternLM2Attention):
前馈网络 (InternLM2MLP):
两个层归一化 (InternLM2RMSNorm):

输出层:

线性层,将隐藏状态映射回词表大小(92,544)

其他特点:

使用RMSNorm而非LayerNorm
隐藏维度为2,048
中间层(MLP)维度为8,192

使用以下函数对模型参数量进行计算

def print_nparams(model):          """Calculate the total number of model parameters"""          nparams = sum(p.numel() for p in model.parameters())          print(f"The total number of parameters is: {nparams}")

print_nparams(model) 后结果如下：

The total number of parameters is: 1889110016

可以看到该模型参数量为 1889110016 也就是1.8B。

模型的down-scale

对模型的主体DecoderLayer进行缩减

而1.8B有24个InternLM2DecoderLayer，能否对其DecoderLayer进行缩减来达到down-scale的目的呢？

from copy import deepcopy      model.model.layers = deepcopy(layers[:5]) + deepcopy(layers[-5:])      model.model.tok_embeddings = deepcopy(model.model.tok_embeddings)      model.output = deepcopy(model.output)      print(model.config)

model.model.layers = deepcopy(layers[:5]) + deepcopy(layers[-5:])在这行代码中，我们正在重新定义模型的层结构。我们取了原始 layers 的前5层和后5层的深拷贝，然后将它们组合在一起形成新的层结构。我认为这可能是在进行模型压缩或修剪，只保留了原始模型的一部分层。
model.model.tok_embeddings = deepcopy(model.model.tok_embeddings)这里，我们对模型的词元嵌入层进行了深拷贝。目的是想保留原始的嵌入层，同时允许它在后续训练中独立变化。
model.output = deepcopy(model.output)在这行代码中，我们对模型的输出层进行了深拷贝。这同样是为了保留原始的输出层结构，同时允许它在后续操作中独立变化。

最终使用 print(model.config) 查看config

InternLM2Config {     "_name_or_path": "/root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b",     "architectures": [       "InternLM2ForCausalLM"     ],     "attn_implementation": "eager",     "auto_map": {       "AutoConfig": "configuration_internlm2.InternLM2Config",       "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",       "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"     },     "bias": false,     "bos_token_id": 1,     "eos_token_id": 2,     "hidden_act": "silu",     "hidden_size": 2048,     "initializer_range": 0.02,     "intermediate_size": 8192,     "max_position_embeddings": 32768,     "model_type": "internlm2",     "num_attention_heads": 16,     "num_hidden_layers": 8,     "num_key_value_heads": 8,     "pad_token_id": 2,     "rms_norm_eps": 1e-05,     "rope_scaling": {       "factor": 2.0,       "type": "dynamic"     },     "rope_theta": 1000000,     "tie_word_embeddings": false,     "torch_dtype": "bfloat16",     "transformers_version": "4.39.0",     "use_cache": true,     "vocab_size": 92544   }

对我们进行 down-scale 的模型查看 config，分析如下：

模型架构:

使用的是 “InternLM2ForCausalLM” 架构
模型类型为"internlm2"

模型规模:

隐藏层大小(hidden_size): 2048
中间层大小(intermediate_size): 8192
注意力头数量(num_attention_heads): 16
关键值头数量(num_key_value_heads): 8
隐藏层数量(num_hidden_layers): 8
词表大小(vocab_size): 92544

序列长度:

最大位置嵌入 (max_position_embeddings): 32768，这表示模型可以处理非常长的序列

激活函数:

隐藏层激活函数 (hidden_act): “silu”

特殊token:

开始 token ID(bos_token_id): 1
结束 token ID(eos_token_id): 2
填充 token ID(pad_token_id): 2

优化相关:

使用 RMSNorm，epsilon 值(rms_norm_eps): 1e-05
不使用偏置项 (bias): false
使用缓存 (use_cache): true

位置编码:

使用RoPE（Rotary Positional Embedding）
RoPE缩放(rope_scaling): 动态缩放，因子为2.0
RoPE theta参数(rope_theta): 1000000

数据类型:

使用bfloat16精度(torch_dtype): “bfloat16”

模型测试

既然对模型进行了down-scale，我们猜猜只用剩下的权重能不能进行正常的输出呢？

from transformers import TextStreamer      prompt = "你好，我是"      inputs = tokenizer(prompt, return_tensors="pt").to(model.device)      streamer = TextStreamer(       tokenizer,       skip_prompt=True,       skip_special_tokens=True   )      outputs = model.generate(       **inputs,       streamer=streamer,       use_cache=True,       max_new_tokens=64,       do_sample=False   )

于是输出结果为：

胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸胸

存储当前的模型权重

新的模型我们就叫 InternLM-community 吧！顺手保存一下

import os   os.makedirs('./data/InternLM-community', exist_ok=True)   model.save_pretrained('./data/InternLM-community')   tokenizer.save_pretrained('./data/InternLM-community')

小语言模型微调

数据获取

使用的是wanjuan2.0

import openxlab      openxlab.login(ak='', sk='') #进行登录，输入对应的AK/SK      from openxlab.dataset import info      info(dataset_repo='OpenDataLab/WanJuanCC') #数据集信息及文件列表查看      from openxlab.dataset import get      get(dataset_repo='OpenDataLab/WanJuanCC', target_path='/root/code/nanointernlm/data/wanjuan')  # 数据集下载      from openxlab.dataset import download      download(dataset_repo='OpenDataLab/WanJuanCC',source_path='/README.md', target_path='/root/code/nanointernlm/data/wanjuan') #数据集文件下载

WanJuan2.0(WanJuan-CC) 是从 CommonCrawl 获取的一个 1T Tokens 的 高质量英文网络文本数据集。结果显示，与各类开源英文CC语料在 Perspective API 不同维度的评估上，WanJuan2.0 都表现出更高的安全性。此外，通过在4个验证集上的困惑度（PPL）和6下游任务的准确率，也展示了WanJuan2.0 的实用性。WanJuan2.0 在各种验证集上的PPL表现出竞争力，特别是在要求更高语言流畅性的tiny-storys等集上。通过与同类型数据集进行1B模型训练对比，使用验证数据集的困惑度（perplexity）和下游任务的准确率作为评估指标，实验证明，WanJuan2.0 显著提升了英文文本补全和通用英文能力任务的性能。

微调

数据处理部分，编写脚本，将数据处理成后续需要的格式。

接着请出我们的老朋友 XTuner，并进行增量预训练。

增量预训练是一种用于提高大型语言模型性能的方法，它允许模型通过不断学习新数据来更新其知识库和理解能力。

对应的config和脚本代码可以点击文末的阅读原文跳转至代码仓库中获取。

小语言模型测试

使用 TextStreamer 对模型进行测试。

from transformers import TextStreamer      prompt = "Hello"   inputs = tokenizer(prompt, return_tensors="pt").to(model.device)      streamer = TextStreamer(       tokenizer,       skip_prompt=True,       skip_special_tokens=True   )      outputs = model.generate(       **inputs,       streamer=streamer,       use_cache=True,       max_new_tokens=64,       do_sample=False   )

首先导入了TextStreamer类，这是用于流式输出生成文本的工具。
设置了一个简单的提示词"Hello"。
使用 tokenizer 将提示词转换为模型可以理解的输入格式，并将其移动到模型所在的设备上（可能是 GPU 或 CPU）。
创建了一个 TextStreamer 对象，用于控制文本的流式输出。设置 skip_prompt=True 表示不输出原始提示词，skip_special_tokens=True 表示不输出特殊标记。
使用model.generate()方法生成文本，主要参数包括：

使用之前准备的输入
设置streamer为刚创建的TextStreamer对象
use_cache=True启用缓存以提高生成速度
max_new_tokens=64限制最多生成64个新token
do_sample=False表示不使用采样，而是使用贪婪解码（每次选择最可能的下一个词）

得到的返回为：

, I’m a 20-year-old who is studying Psychology and I’m looking for a job. I’m looking for a job in a medical laboratory. I’m looking for a job in a laboratory. I’m looking for a job in a laboratory. I’m looking for a job in

看上去确实比刚刚胸胸胸好不少。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述