Baichuan-7B模型训练和推理
概述
- baichaun2 github地址:https://github.com/baichuan-inc/Baichuan2
环境部署
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple deepspeed==0.9.2
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple numpy==1.23.5
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple sentencepiece==0.1.97
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch==2.0.0
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple transformers==4.29.1
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple xformers==0.0.20
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboard
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple datasets
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple accelerate
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboard
# 运行下面方案才安装 mpi4py ,否则会报错!!!!!!
# 方案1
sudo apt update
sudo apt install openmpi-bin
# 方案2
sudo apt update
sudo apt-get install libopenmpi-dev
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple mpi4py
模型推理
Baichuan2-7B-Chat-4bits
量化模型运行的python代码
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("/home/shidonghai/Baichuan-7B/Baichuan2-7B-Chat-4bits", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/home/shidonghai/Baichuan-7B/Baichuan2-7B-Chat-4bits", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained("/home/shidonghai/Baichuan-7B/Baichuan2-7B-Chat-4bits")
messages = []
messages.append({"role": "user", "content": "解释一下“温故而知新”"})
response = model.chat(tokenizer, messages)
print(response)
推理错误解决
AttributeError: ‘list’ object has no attribute ‘as_dict’
(baichuan) shidonghai@shidonghai:~/Baichuan-7B$ python 123.py
Traceback (most recent call last):
File "123.py", line 5, in <module>
model = AutoModelForCausalLM.from_pretrained("/home/shidonghai/Baichuan-7B/Baichuan2-7B-Chat-4bits", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/home/shidonghai/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat-4bits/modeling_baichuan.py", line 656, in from_pretrained
dispatch_model(model, device_map=device_map)
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/accelerate/big_modeling.py", line 343, in dispatch_model
check_device_map(model, device_map)
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 1165, in check_device_map
all_model_tensors = [name for name, _ in model.state_dict().items()]
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1897, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1897, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1897, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 2 more times]
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1894, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/home/shidonghai/anaconda3/envs/baichuan/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 237, in _save_to_state_dict
for k, v in self.weight.quant_state.as_dict(packed=True).items():
AttributeError: 'list' object has no attribute 'as_dict'
降低bitsandbytes版本 pip install bitsandbytes==0.41.0
量化模型部署
根据官网提供的量化推理GPU占用,如果只是在GPU本地上进行查看,则推荐使用int4模型。
Precision | GPU Mem (GB) |
---|---|
bf16 / fp16 | 26.0 |
int8 | 15.8 |
int4 | 9.7 |
模型地址
https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bits
https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits
https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat
https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat
https://huggingface.co/baichuan-inc/Baichuan2-13B-Base
https://huggingface.co/baichuan-inc/Baichuan2-7B-Base
https://huggingface.co/baichuan-inc/Baichuan2-7B-Intermediate-Checkpoints
模型训练
单GPU下运行
在运行前将原deepspeed的参数"stage": 2, 改为 “stage”: 3, 暂时不知道到为什么会报错,好像是版本的问题。
"zero_optimization": {
"stage": 3,
"contiguous_gradients": false,
"allgather_bucket_size": 1e8,
"reduce_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true
},
在train.py的文件中新建以下文件夹
-
data_dir:存放语料的文件
-
checkpoints:存储deepspeed的训练文件;
- 下面的参数是训练多少步骤进行存储检查点文件,个人推荐至少设置2000以上,如果每一次迭代存储一次,30分钟左右存储就会达到3T
parser.add_argument("--steps_per_epoch", type=int, default=4096, help="Step intervals to save checkpoint")
运行下面的命令进行执行即可。
#!/bin/bash
deepspeed train.py \
--deepspeed \
--deepspeed_config config/deepspeed.json