ChatGLM2-6B微调实践

weixin_43870390

已于 2023-10-07 10:44:56 修改

阅读量1.5k

点赞数

文章标签： chatgpt

于 2023-09-07 17:25:46 首次发布

本文链接：https://blog.csdn.net/weixin_43870390/article/details/132737336

版权

可以用来微调ChatGLM2-6B的开源项目

https://github.com/THUDM/ChatGLM2-6B
https://github.com/hiyouga/ChatGLM-Efficient-Tuning
https://github.com/hiyouga/LLaMA-Efficient-Tuning
第一个是ChatGLM2官方的git地址，工程可以实现使用P-Tuning的方式进行微调；可以使用命令行、web、API等方式进行测试体验；
第二个和第三个git工程同属于一个开发人，第二个git是早期版本，只能微调ChatGLM和ChatGLM2模型；
第三个git是作者当前维护的版本，可以微调的模型很多，包括ChatGLM2 LLaMA LLaMA-2 BaiChuan BLOOM BLOOMZ InternLM Qwen Falcon等，第三个git工程同时支持多种训练方式，Pre-training、Supervised Finetuning、 Reward Modeling、 PPO Traing、 DPO training等方式多样，综合考虑第三个git更全面，最终选择使用第三个git项目进行大模型微调实验；

ChatGLM2-6B模型下载

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b

使用上述命令下载模型实现
从这里手动下载模型参数文件，并将下载的文件替换到本地的 chatglm2-6b 目录下。
此方法来自csdn：https://blog.csdn.net/qq_41185868/article/details/131427832
一共大约12G

使用LLaMA-Efficient-Tuning微调ChatGLM2-6b

下载代码

git clone https://github.com/hiyouga/LLaMA-Efficient-Tuning

安装依赖

cd LLaMA-Efficient-Tuning
pip install -r requirements.txt
pip install -U Jinja2 -i https://mirrors.aliyun.com/pypi/simple

数据准备
数据格式：

"instruction": "Paraphrase the following sentence",
"input": "The scientists conducted a rigorous experiment",
"output": "A thorough investigation was executed by the researchers."
"history": []

解释：无论是哪个数据集，都必须有instruction，如果没有则语料无效；
一共四个字段：instruction input output history，
如果是pt只用instruction
如果是sft用到instruction和output
如果是rm用到instruction，且output length必须大于1
如果是ppo dpo则，只用instruction
代码内部默认会将上述四个字段映射为：prompt query response history
如果你的数据集的输入名字不是instruction，可以通过columns建立映射关系，映射为prompt query response history，具体实现是：在data/dataset_info.json中增加自己的数据集和映射关系。比如下面的代码：

"mofang": {
      "file_name": "mofang.json",
      "file_sha1": "e9432eee04a1ce3495df215b4287e8cf48005bb0",
      "columns": {
        "prompt": "input",
        "query": "",
        "response": "target",
        "history": ""
    }

mofang是运行src/train_bash.py时候的dataset参数，指定训练数据集名字；上述代码会将mofang映射为data/mofang.json文件，同时，把mofang.json中的input映射为训练数据集中的prompt，target映射为训练数据集内部的response。

另外还有一个system这个字段，该字段可以在template中设定，也可以在训练时传入。
训练时传入：用system_prompt进行设定；
用template传入：src/llmtuner/extras/template.py ，修改如下代码中的system的内容即可。

register_template(
    name="chatglm2",
    prefix=[
        {"token": "[gMASK]"},
        {"token": "sop"},
        "{{system}}"
    ],
    prompt=[
        "[Round {{idx}}]\n\n问：{{query}}\n\n答："
    ],
    system="",
    sep=[
        "\n\n"
    ]
)

Template
上面的一段代码就是一个template。训练代码中dataset前处理会根据template，将训练语料处理成特定的input格式。由于LLaMA-Efficient-Tuning功能很强大，支持非常多的模型训练，每个模型又有特定的输入数据格式，所以不同的模型会有不同的template。
这里只讲ChatGLM2的template。
套路：
模型最终的input数据：bos + prefix顺序拼接 + sep + prompt + response + eos + (sep + bos + prompt + …)
模型最终的label数据：response + eos
例子：
如果mofang.json中数据为

"input": "你好",
"target": "您好，我是AI智能助手，请问有什么需要帮助？"

模型真正input：

 [Round 0]

问：你好

答：您好，我是AI智能助手，请问有什么需要帮助？

模型真正label：您好，我是AI智能助手，请问有什么需要帮助？

input_ids: [gMASK]的id，sop的id，sep的id ，上述文字的id，eosid
label_ids: 上述id中，作为input的部分全部标记为-100，其余不变，表示response的id，

注：chatglm2模型bos为空；如上述template,sep=‘\n\n’; system为空；

模型微调

单卡

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path pre_model/chatglm2-6b \
    --do_train \
    --dataset mofang \
    --template chatglm2 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir temp_sft_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --save_steps 2 \
    --learning_rate 5e-5 \
    --num_train_epochs 8.0 \
    --plot_loss \
    --fp16 \
    --max_source_length 2048 \
    --max_samples 100

微调不同的模型，需要设置不同的template和lora_target，不同模型的具体参数，参考LLaMA-Efficient-Tuning首页的Supported Models；
max_samples用于调试，选择多少条作为训练集；真正训练时去掉；
stage部分说明是哪种微调方式，该git工程支持多种，pt，sft，rm，ppo，dpo，自己根据需求选择即可，我这里用的是有监督微调；
在model_name_or_path部分填入第二步下载的模型的本地地址；
dataset指定自己的数据集名字，我这里是mofang；

多卡
首先安装deepspeed
pip install deepspeed

nohup deepspeed --num_gpus 4 --master_port=9901 src/train_bash.py \
    --deepspeed ds_config.json \
    --stage sft \
    --model_name_or_path pre_model/chatglm2-6b \
    --do_train \
    --dataset mofang \
    --template chatglm2 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir mofang_sft_train1_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 100 \
    --save_steps 4000 \
    --learning_rate 5e-5 \
    --num_train_epochs 10.0 \
    --plot_loss \
    --bf16 \
    --max_source_length 2048 \
     > train_20230906.log 2>&1 &

ds_config.json就是LLaMA-Efficient-Tuning工程首页作者提供的配置；
指定卡export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
多机命令
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
deepspeed --num_gpus 8
–num_nodes 2
–master_addr Addr
–master_port Port
–hostfile Hostfile
src/train_bash.py
–stage sft
–model_name_or_path “/mnt/model/Llama-2-70b-hf/”
–do_train
–dataset zr_test_math
–finetuning_type lora
–output_dir /mnt/output/70B/
–overwrite_cache
–overwrite_output_dir
–per_device_train_batch_size 8
–gradient_accumulation_steps 1
–lr_scheduler_type cosine
–logging_steps 1
–save_steps 1000
–learning_rate 5e-5
–num_train_epochs 1
–plot_loss
–fp16
–lora_target q_proj,v_proj
–deepspeed “/mnt/deepspeed/deepspeed.json”
出字他人的issue

导出模型

python src/export_model.py
–model_name_or_path pre_model/chatglm2-6b
–template chatglm2
–finetuning_type lora
–checkpoint_dir temp_sft_checkpoint
–output_dir mofang_export_model_temp

lora训练后生成的只有lora的参数，需要和原始的chatglm2-6b模型混合，混合之后的模型就可以直接用于后期测试，变成了一个模型。

测试模型

两种都行
第一种用导出的模型测试

python src/web_demo.py \
    --model_name_or_path mofang_export_model_temp \
    --template chatglm2 \
    --finetuning_type lora

第二种，用checkpoint结合原始chatglm2-6b模型测试

python src/web_demo.py \
    --model_name_or_path pre_model/chatglm2-6b \
    --template chatglm2 \
    --finetuning_type lora \
    --checkpoint_dir temp_sft_checkpoint

其他学习过程

https://article.juejin.cn/post/7254032949230256189

https://github.com/hiyouga/ChatGLM-Efficient-Tuning
实际操作训练博客：https://article.juejin.cn/post/7254032949230256189
注意：此工程不再维护，转到了工程https://github.com/hiyouga/LLaMA-Efficient-Tuning

可以用web UI训练，训练入口train_web.py
提供FastEdit，进行LM模型编辑注入最新知识，工程：https://github.com/hiyouga/FastEdit
对其demo API可以将finetune之后的模型插入到任意基于chatgpt的应用中
支持fintune chatGLM2-6B
支持 4-biy LoRA训练，–quantization_bit 4
支持训练LLaMA和BLOOM模型，参考工程： https://github.com/hiyouga/LLaMA-Efficient-Tuning
ChatGLM-Efficient-Tuning, 基于PEFT的高效ChatGLM-6B微调工具
支持dev集
支持RLHF Reinforcement Learning with Human Feedback
支持LoRA训练的模型的权重合并
支持训练量化模型4/8bit
支持从checkpoint开始训练
支持使用多个训练集训练

目前实现了针对以下高效微调方式的支持

LoRA：仅微调低秩适应器
P-Tuning V2：仅仅微调前缀编码器
Freeze Tuning：仅仅微调后几层的全链接层
全量微调：微调模型所有参数

https://github.com/hiyouga/LLaMA-Efficient-Tuning

支持继续训练，transformers 4.31.0
支持RoPE scaling扩大LLaMA模型的上下文长度
针对指令微调模型支持DPO训练✨❓
支持训练Qwen-7B模型
支持流式加载data
开源了两个指令微调的13B模型：https://huggingface.co/hiyouga/baichuan-13b-sft✨❓
https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat✨❓
支持LLaMA-2模型
开发web UI用于训练，评估、推理
支持Baichuan-13B模型训练
提供FastEdit工具，对LM模型注入新知识：https://github.com/hiyouga/FastEdit
支持Falcon-7B/40B模型训练
提供了可以复现训练一个对话模型的样例，https://huggingface.co/hiyouga/baichuan-7b-sft ✨❓
对其demo API和openAI一样，从而可以在任何基于chatgpt的应用中插入训练的模型
支持训练Baichuan-7B模型
支持量化训练和推理
支持BLOOM BLOOMZ模型训练

Qwen：阿里通义千问 7B
LLaMA：meta 7B 13B
ChatGLM：清华 6B
Baichuan：百川智能
chatgpt： openAI 175B

pt sft 有什么区别？数据不同，一个是无监督数据，一个是有监督数据；

基础知识/概念

DDP DP
torch.nn.DataParallel(DP)
torch.nn.DistributedDataParallel(DDP)
DP 相比 DDP 使用起来更友好（代码少），但是 DDP 支持多机多卡，训练速度更快，而且负载相对要均衡一些。所以优先选用 DDP 吧。
BPE
https://www.datalearner.com/blog/1051671195034710
token和单词数一般是按照0.75进行换算的，比如token最长2048，就是指大约2048*0.75=1500个单词
对于中文来讲，有些词组对应一个token，所以token比较短。
对于英文来见过，token比单词数(token数的0.75倍)要多，但是比字符数要少。
关于大模型的一些知识
https://zhuanlan.zhihu.com/p/624918286
topk topp temperature
https://www.cnblogs.com/deali/p/llm-2.html

经常遇到的默认 top-p 值就是 0.7/0.8 这样，还是那个说法，设置太低模型的输出太固定，设置太高，模型彻底放飞自我也不好。
一般来说，prompt 越长，描述得越清楚，模型生成的输出质量就越好，置信度越高，这时可以适当调高 temperature 的值；反过来，如果 prompt 很短，很含糊，这时再设置一个比较高的 temperature 值，模型的输出就很不稳定了。
这里的 decoding strategy 可以选择
- greedy decoding: 总是选择最高分的 token，有用但是有些弊端，详见下文
- top-k: 从 tokens 里选择 k 个作为候选，然后根据它们的 likelihood scores 来采样
- top-p: 候选词列表是动态的，从 tokens 里按百分比选择候选词
- top-k 与 top-p 为选择 token 引入了随机性，让其他高分的 token 有被选择的机会，不像 greedy decoding 一样总是选最高分的。

其他学习地址：
https://baijiahao.baidu.com/s?id=1770262029151452182&wfr=spider&for=pc
超长llm评测

2个商业大语言模型GPT-3.5-Turbo-16K与Claude-1.3-100K在超长上下文评测任务中表现十分稳定，完胜所有开源模型。更加悲剧的是国产翘楚ChatGLM2-6B模型，超长话题检索任务中，在超过6K之后性能急剧下降，准确率在10K、13K、16K上直接降低到了0！LongChat在16K以内输入场景中表现不错

https://zhuanlan.zhihu.com/p/643856746
chatglm2-6b微调

https://blog.csdn.net/qq_41185868/article/details/131427832
chatglm2-6b简介安装使用方法

https://baijiahao.baidu.com/s?id=1769835821474647681&wfr=spider&for=pc
chatglm2评测

https://www.heywhale.com/mw/project/64984a7b72ebe240516ae79c
部署与微调

https://baijiahao.baidu.com/s?id=1771926480948343579&wfr=spider&for=pc
思维链(Chain-of-thought，CoT)原理详解 ✅ 过程推理；zero-shot和few-shot，few-shot有人类标注，zero-shot没有人类标注

https://zhuanlan.zhihu.com/p/644106525 ✅
llm的几个阶段 ‼️

https://zhuanlan.zhihu.com/p/647843635
下载huggingface模型的方法 1. 手动下载 2. clone 3.

疑问和解答
（1）什么是base模型，base模型就是预训练的模型，没有经过sft和rlhf。所以base模型知识会更多一些。指标可能更好一些。✅
ChatGLM2-6B (base)
ChatGLM2-6B 默认放的是chat模型

ChatGLM2-6B基座的上下文长度从2K提升到了32k，对话阶段使用8K上下文训练，还有更长的上下文版本32K；更开放的使用协议；推理速度提升；在各种测试集上模型性能更强；多个维度的能力提升，包括数理逻辑、知识推理、长文档理解。推理速度提升依赖：混合了GLM目标函数、使用Flash Attention、Multi-QueryAttention技术

（2）什么是指令调优？对齐调优? 分别对应了llm中的sft即有监督微调和RLHF模型对齐。前者提高或者解锁llms的能力；后者使llms和人类的价值观和偏好一致。 ✅
（3）ICL是什么意思？in-context learning，Chain-fo-thought是一种特殊的icl； ✅
（4）input化为token的时候，是短了还是长了。短了
（5）bf16是什么？fp16？bf16是一种全新的浮点格式，16位脑浮点，BrainFloating，降低数字精度，从而减少让张量相乘所需的运算资源和功耗；FP32是单精度浮点数，机器学习默认使用float32；FP16是半精度浮点数。采用BF16/FP16吞吐量可以翻倍，内存需求减半，但是两者精度不同，BF16可表示的整数范围更广，但是尾数精度较小；FP16可表示的整数范围较小，但是尾数精度较高。具体可以参考文章：https://blog.csdn.net/hellochenlian/article/details/132010077 ✅

（6）PRE_SEQ_LEN soft prompt长度怎么理解和max_source_length的区别？✅
https://www.yht7.com/news/263257 有对于PRE SEQ LEN长度的解释 ‼️‼️好文章
pre_seq_len和max_source_length的含义是差不多的，或者取值是可以参照的？

不完全是的，pre_seq_len和max_source_length的含义是不同的，但是它们之间有一定的关系。pre_seq_len是指自然语言指令的长度，而max_source_length是指整个输入序列的最大长度。一般来说，pre_seq_len应该小于或等于max_source_length，因为输入序列除了包含指令之外，还可能包含其他内容，例如上下文信息或对话历史。如果pre_seq_len大于max_source_length，那么模型就无法接收到完整的指令，也就无法正确地生成输出内容。因此，pre_seq_len和max_source_length的取值需要协调一致，以保证模型能够有效地利用输入信息。

这种区分在模型训练时是怎么体现的，对指令有特殊操作么？❓ 当前git没有这个参数

（7）P-Tuning-v2 方法会冻结全部的模型参数？看下面的解释是，训练时模型只修改prefixEncoder的参数，而且会另外保存，原始moxing ✅
在 P-tuning v2 训练时模型只保存 PrefixEncoder 部分的参数： transformer的其中一个参数
–model_name_or_path THUDM/chatglm2-6b
–ptuning_checkpoint $CHECKPOINT_PATH
（8）使用自己的数据集：需要指定prompt_column response_column，对话数据需要指定history_column，将自动把聊天历史拼接。注意超过输入长度max_source_length 的内容会被截断。✅
https://github.com/THUDM/ChatGLM2-6B/blob/main/ptuning/README.md
（9）Lora训练原理 ptuning训练原理 https://www.yht7.com/news/263257
https://article.juejin.cn/post/7254032949230256189
1）、P-Tunning：P-Tuning是指在预训练模型的输入层插入一些可训练的连续向量（Prompt），作为任务相关的信息，然后只对这些向量进行微调，而冻结预训练模型的其他参数。这种方法可以减少微调的参数量和数据量，提高微调的效率和泛化能力，但也可能会降低模型的交互性和生成质量。
2）、LoRA：LoRA是指在预训练模型的每一层注入一些可训练的低秩矩阵（Low-Rank Adaptation），用于捕捉下游任务的低秩变化，然后只对这些矩阵进行微调，而冻结预训练模型的其他参数。这种方法可以减少微调的参数量和计算量，提高微调的效率和推理速度，同时保持模型的生成质量。
3）、Finetune：Finetune是指对预训练模型的所有参数进行微调，以适应下游任务。这种方法可以充分利用预训练模型的知识，但也需要较多的计算资源和数据量，可能会导致过拟合或灾难性遗忘。

Freeze: 即参数冻结，对原始模型部分参数进行冻结操作，仅训练部分参数，以达到在单卡或不进行TP或PP操作，就可以对大模型进行训练。
P-Tuning: 在输入的embedding层前，将prompt转换为可学习的额外一层embedding层.(这里我也没有太懂=)
LoRA: 在大型语言模型上对指定参数（权重矩阵）并行增加额外的低秩矩阵，并在模型训练过程中，仅训练额外增加的并行低秩矩阵的参数,冻结其他参数。当“秩值”远小于原始参数维度时，新增的低秩矩阵参数量也就很小。在下游任务tuning时，仅须训练很小的参数，但能获取较好的表现结果。

(10)P-Tuning训练原理和LoRA训练原理和实现，主要是config❓
(11)已知的大语言模型llm，哪些是自回归的，哪些是非自回归的？ ❓ PrefixLM CausalLM
T5和Chatglm是prefix LM，其他的都是Causal LM，ChatGPT系列就是典型的Causal LM
(12) ✅
[gMASK]：在提供前缀上下文的句子末尾，随机长度的长空白。
tokenizer.encode输出为 [gMASK, sop, 真实文本token]
https://zhuanlan.zhihu.com/p/641063879

严格按照官方prompt构建输入输出（“注意[Round 1]很重要, 不能删减”）:
输入：“[Round 1]\n\n问：{}\n\n答：”
输出：“{}”
输入id: [gMASK, sop, 输入tokens, gMASK, sop]
输出id: [输出tokens, EOS]

(13)https://zhuanlan.zhihu.com/p/614508046?utm_id=0
（14）tokenizer的使用 ✅
https://blog.csdn.net/Wang_Dou_Dou_/article/details/127360500

https://huggingface.co/docs/tokenizers/python/latest/api/reference.html?highlight=padding#tokenizers.Tokenizer.padding

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(“pre_model/chatglm2-6b”, use_fast=False, padding_side=“left”, trust_remote_code=True)

from tokenization_chatglm import ChatGLMTokenizer
tokenizer = ChatGLMTokenizer.from_pretrained(“.”)
tokenizer(“爱你”)
{‘input_ids’: [64790, 64792, 30910, 34893], ‘attention_mask’: [1, 1, 1, 1], ‘position_ids’: [0, 1, 2, 3]}
tokenizer.decode([64790, 64792, 30910, 13, 13, 790, 30951, 517, 30910, 30940, 30996])
‘\n\n [Round 0]’
tokenizer.get_vocab()
tokenizer.get_command(‘[gMASK]’)
64790

tokenizer.get_command(‘sop’)
64792
tokenizer.get_command(‘’)
2
tokenizer.get_prefix_tokens()
[64790, 64792]
tokenizer.special_tokens
{‘’: 1, ‘’: 2, ‘’: 0}’
tokenizer.convert ？？？
add_special_tokens=False encode 在开头不会出现64790, 64792，即[gMASK] sop

(15)padding left什么意思？一般不都是padding——right嘛 ❓
（16）skip_special_tokens=False 怎么理解？ ✅
tokenizer解码的时候，要不要跳过特殊字符
https://blog.csdn.net/qq_28790663/article/details/117073917
（17）attention_mask 全1？ ❓
(18) 终于搞明白了Chatglm2的模版 ✅
无论是哪个数据集，都必须有instruction，如果没有语料无效；
一共四个字短：instruction input output history
也可以通过columns自己建立映射关系
内部会进行映射，分别映射为prompt query response
如果是pt只用instruction
如果是sft用到instruction和output
如果是rm用到instruction，且output必须大于1
如果是ppo dpo则，只用instruction
前处理根据template，处理成特定格式
system可以从两个地方传入进去
template的合成训练数据的顺序是：先prefix，prefix按照顺序进行拼接，prefix里如果有system，则也拼接进去；拼接分割符；接着拼接prompt，这个地方对input进行加工，可能加入固定的前置和后置字符；回应+eos 前面的形成数据对；多轮拼接分割符 bos prompt 回应 eos 等等

最后生成三项
input_ids ：包含问题和回答
attention_mask ：全1
labels：问题部分是-100，其他和input_ids一样
打印dataset的列名： dataset.column_names

(19)如何多卡训练 ✅

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
–stage sft
–model_name_or_path pre_model/chatglm2-6b
–do_train
–dataset mofang
–template chatglm2
–finetuning_type lora
–lora_target query_key_value
–output_dir temp_sft_checkpoint
–overwrite_cache
–per_device_train_batch_size 4
–gradient_accumulation_steps 4
–lr_scheduler_type cosine
–logging_steps 1
–save_steps 2
–learning_rate 5e-5
–num_train_epochs 8.0
–plot_loss
–fp16
–max_source_length 2048
–max_samples 100

accelerate config # configure the environment
accelerate launch src/train_bash.py # arguments (same as above)
–stage sft
–model_name_or_path pre_model/chatglm2-6b
–do_train
–dataset mofang
–template chatglm2
–finetuning_type lora
–lora_target query_key_value
–output_dir temp2_sft_checkpoint
–overwrite_cache
–per_device_train_batch_size 4
–gradient_accumulation_steps 4
–lr_scheduler_type cosine
–logging_steps 1
–save_steps 2
–learning_rate 5e-5
–num_train_epochs 4.0
–plot_loss
–fp16
–max_source_length 2048
–max_samples 100

pip install deepspeed

–max_samples 100
nohup deepspeed --num_gpus 4 --master_port=9901 src/train_bash.py
–deepspeed ds_config.json
–stage sft
–model_name_or_path pre_model/chatglm2-6b
–do_train
–dataset mofang
–template chatglm2
–finetuning_type lora
–lora_target query_key_value
–output_dir mofang_sft_train1_checkpoint
–overwrite_cache
–per_device_train_batch_size 8
–gradient_accumulation_steps 4
–lr_scheduler_type cosine
–logging_steps 100
–save_steps 4000
–learning_rate 5e-5
–num_train_epochs 10.0
–plot_loss
–bf16
–max_source_length 2048
> train_20230906.log 2>&1 &

(20)如何导出模型 ✅

python src/export_model.py
–model_name_or_path pre_model/chatglm2-6b
–template chatglm2
–finetuning_type lora
–checkpoint_dir temp_sft_checkpoint
–output_dir mofang_export_model_temp
（21）如何测试demo ✅

两种都行
python src/web_demo.py
–model_name_or_path mofang_export_model_temp
–template chatglm2
–finetuning_type lora

python src/web_demo.py
–model_name_or_path pre_model/chatglm2-6b
–template chatglm2
–finetuning_type lora
–checkpoint_dir temp_sft_checkpoint

(22)如何继续训练，只要改变epoch数就可以接着之前的训练 ✅
（23）大模型预训练方法 ❓
https://zhuanlan.zhihu.com/p/625896377?utm_id=0&wd=&eqid=b0a62b010001ac39000000036482d751
（24）deepspeed ❓
https://zhuanlan.zhihu.com/p/630734624?utm_id=0
https://blog.csdn.net/qq_44193969/article/details/132612837
（25）当预测经常出现重复时，可以稍微调大温度系数。温度系数越高，说明概率分布越趋于平缓。✅
（26）vicuna是什么模型？ ✅
基于llama训练的对话AI模型：比如Alpaca 斯坦福 vicuna伯克利
（27）上下文长度长了以后，显存占用很多，改善方法：梯度检查点gradient checkpointing、闪存注意力flash attention
FSDP也能降低显存
（29）什么是PPO训练 DPO训练 ❓
PPO 强化学习
（30）相关length，都什么含义，训练时怎么根据自己的数据集调整？✅
–model_max_length MODEL_MAX_LENGTH tokenizer的参数，默认非常大，没问题
–max_source_length MAX_SOURCE_LENGTH input的最长长度
The maximum total input sequence length after
–max_target_length MAX_TARGET_LENGTH target的最长长度
The maximum total output sequence length after
–group_by_length [GROUP_BY_LENGTH]
length together when batching. (default: False)
–length_column_name LENGTH_COLUMN_NAME
Column name with precomputed lengths to use when
grouping by length. (default: length)
–generation_max_length GENERATION_MAX_LENGTH
The max_length to use on each evaluation loop when predict_with_generate=True. Will default to the max_length value of the model configuration. (default: None)
–max_length MAX_LENGTH
The maximum length the generated tokens can have. It can be overridden by max_new_tokens. (default:None)
–length_penalty LENGTH_PENALTY 长度罚项
Exponential penalty to the length that is used with
–predict_with_generate [PREDICT_WITH_GENERATE]
Whether to use generate to calculate generative
metrics (ROUGE, BLEU). (default: False)
（31）模型很大的时候，可以用模型并行，方法是： DeepSpeed ZeRO-3
配置参考：

{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

（32）QLoRA’s 4/8 只支持accelerate
（33）PEFT LORA介绍
https://blog.csdn.net/weixin_44826203/article/details/129733930
大型语言模型lora调教指南
https://www.bilibili.com/video/BV1yu411L7JN/?spm_id_from=333.337.search-card.all.click&vd_source=6be4e5549b9f25b79411bbf5e0db04ce
(34) chatglm2-6B模型介绍和微调实践-bilibili
https://www.bilibili.com/video/BV1x34y1A7uQ/?spm_id_from=333.1007.tianma.2-2-5.click&vd_source=6be4e5549b9f25b79411bbf5e0db04ce