mindformers大模型训练-qwen1.5-72B
1.环境准备
docker镜像:
swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.2_mindspore2.3:20240722
启动镜像:
docker run --rm --name mf_qwen --privileged --shm-size=8g -it -u root --ipc=host --network=host --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /etc/localtime:/etc/localtime -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /var/log/npu/:/usr/slog -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /local/path:/opt/work swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.2_mindspore2.3:20240722 /bin/bash
相关环境:
mindformers 1.2.0
mindieclient 1.0rc1
mindietorch 1.0rc1+torch2.1.0.abi0
mindpet 1.0.4
mindspore 2.3.0
依赖包安装
-
安装mindformers
镜像中未安装mindformers
下载安装包,解压后,python setup.py install -
安装官方Qwen1.5模型权重转换所需的依赖软件包:
(0722版本的镜像已包含这些依赖包,可跳过)
pip install torch transformers>=4.37.2 transformers_stream_generator einops accelerate -i http://10.183.157.39/artifactory/api/pypi/alipipy/simple --trusted-host 10.183.157.39
环境配置
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export LD_PRELOAD=$LD_PRELOAD:/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/torch/lib/../../torch.libs/libgomp-6e1a1d1b.so.1.0.0
设置环境变量
export MS_DEV_SIDE_EFFECT_LOAD_ELIM=3 # 去除TensorMove
export MS_MEMORY_POOL_RECYCLE=1 # 内存优化
export GE_NOT_CUT=1 # 内存优化
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
export HCCL_CONNECT_TIMEOUT=120
export HCCL_ENTRY_LOG_ENABLE=1
export HCCL_SOCKET_IFNAME=fsb_bond.2400
export ENABLE_CELL_REUSE=1
export MS_FORCE_NO_INLINE=1
2.训练准备
2.1 权重转换
模型权重转换
- torch权重转mindspore权重
运行convert_weight.py
转换脚本,将huggingface的权重转换为完整的ckpt权重。
python research/qwen1_5/convert_weight.py --torch_ckpt_dir /opt/work/pretrained_models/Qwen1p5-72B-Chat --mindspore_ckpt_path /opt/work/mindspore_ckpt/qwen1p5_72b_chat_new.ckpt
# 参数说明:
torch_ckpt_dir: 预训练权重文件所在的目录, 此参数必须
mindspore_ckpt_path: 转换后的输出文件存放路径, 默认为'./transform.ckpt'
# 执行结果:
。。。
model.layers.79.ffn_norm.weight (8192,)
model.norm_out.weight (8192,)
lm_head.weight (152064, 8192)
2.2 数据准备
2.2.1 数据集制作r1.1.0
目前提供alpaca数据集的预处理脚本用于全参微调任务。
数据集下载链接如下:
执行alpaca_converter.py
,将原始数据集转换为指定格式。
python research/qwen/alpaca_convert
er.py --data_path /opt/work/dataset/code_alpaca/code_alpaca_20k.json --output_path /opt/work/src/train_ms/qwen/dataset/alpaca-data-conversation.json
# 参数说明
# data_path: 存放alpaca数据的路径
# output_path: 输出转换后对话格式的数据路径
转换后格式样例:
{
"id": "1",
"conversations": [
{
"from": "user",
"value": "Create an array of length 5 which contains all even numbers between 1 and 10."
},
{
"from": "assistant",
"value": "arr = [2, 4, 6, 8, 10]"
}
]
},
执行qwen_preprocess.py
,进行数据预处理和Mindrecord数据生成。
python research/qwen/qwen_preprocess.py --input_glob /opt/work/src/train_ms/qwen/dataset/alpaca-data-conversation.json --model_file /opt/work/pretrained_models/Qwen-14B_base/qwen.tiktoken --seq_length 4096 --output_file /opt/work/src/train_ms/qwen/dataset/alpaca-4096.mindrecord
Transformed 20022 records.
Transform finished, output files refer: /opt/work/src/train_ms/qwen/dataset/alpaca-4096.mindrecord
2.2.2 数据集制作
-
alpaca 数据预处理
执行
research/qwen1_5/alpaca_converter.py
,将原始数据集转换为指定格式。python alpaca_converter.py \ --data_path /opt/work/dataset/code_alpaca/code_alpaca_20k.json \ --output_path /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2.json # 参数说明 data_path: 输入下载的文件路径 output_path: 输出文件的保存路径
执行
research/qwen1_5/qwen1_5_preprocess.py
文件,进行数据预处理和Mindrecord数据生成。python research/qwen1_5/qwen1_5_preprocess.py --dataset_type 'qa' --input_glob /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2.json --vocab_file /opt/work/pretrained_models/Qwen1p5-72B-Chat/vocab.json --merges_file /opt/work/pretrained_models/Qwen1p5-72B-Chat/merges.txt --seq_length 4096 --output_file /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord # 参数说明 dataset_type: 预处理数据类型 input_glob: 转换后的alpaca的文件路径 vocab_file: vocab.json文件路径 merges_file: merges.txt文件路径 seq_length: 输出数据的序列长度 output_file: 输出文件的保存路径 #运行结果 Transformed 20022 records. Transform finished, output files refer: /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord
3.训练执行
3.1 lora
TODO
3.2 sft全量调参
模型训练性能
Config | Task | Datasets | SeqLength | Phase | Performance(tokens/s/p) |
---|---|---|---|---|---|
qwen1.5-7b | text_generation | wikitext-103-v1 | 32768 | Pretrain | 1048 |
qwen1.5-14b | text_generation | wikitext-103-v1 | 32768 | Pretrain | 675 |
qwen1.5-72b | text_generation | wikitext-103-v1 | 32768 | Pretrain | 186 |
qwen1.5-7b | text_generation | alpaca | 4096 | Finetune | 2457 |
qwen1.5-14b | text_generation | alpaca | 4096 | Finetune | 1077 |
qwen1.5-72b | text_generation | alpaca | 2048 | Finetune | 180.2 |
多机多卡全量调参
- 注意:
- 多机多卡执行脚本进行分布式训练需要分别在不同节点运行脚本,并将参数MASTER_ADDR设置为主节点的ip地址,
- 所有节点设置的ip地址相同,不同节点之间仅参数NODE_RANK不同。
- 多机训练要设置hccl的网卡, export HCCL_SOCKET_IFNAME=fsb_bond.2400
仿真测试
用于验证并行训练参数设置
#设置仿真环境变量
export MS_SIMULATION_LEVEL=1
export RANK_SIZE=32
export RANK_ID=31
export MS_MEMORY_STATISTIC=1
export GLOG_v=2
python run_qwen1_5.py --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml --run_mode finetune --use_parallel True --train_dataset /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord
2024-09-02 16:57:52,110 - mindformers[mindformers/core/callback/callback.py:333] - INFO - 5.1% |██ | 0.28539 samples/s/p 2:53:21 }
[WARNING] PRE_ACT(32,fffefa39f120,python):2024-09-02-16:57:52.124.835 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:138] AllocTensorMem] Need Profile Memory, Memory pool alloc, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528, from persistent mem: 0, need recycle: 0
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.136 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.363 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602560, used by event mem: 0, device address addr: 0x1341c0000400, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.403 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602048, used by event mem: 0, device address addr: 0x1341c0000200, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.439 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601536, used by event mem: 0, device address addr: 0x1341c0000600, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.473 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601024, used by event mem: 0, device address addr: 0x1341c0000000, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.562 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:138] AllocTensorMem] Need Profile Memory, Memory pool alloc, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528, from persistent mem: 0, need recycle: 0
[WARNING] PRE_ACT(32,ffff01bcf120,python):2024-09-02-16:57:55.728.118 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.731.344 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602560, used by event mem: 0, device address addr: 0x1341c0000600, size: 512
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.731.581 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602048, used by event mem: 0, device address addr: 0x1341c0000400, size: 512
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.731.923 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601536, used by event mem: 0, device address addr: 0x1341c0000200, size: 512
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.732.058 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601024, used by event mem: 0, device address addr: 0x1341c0000000, size: 512
2024-09-02 16:57:55,732 - mindformers[mindformers/core/callback/callback.py:259] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2024-09-02 16:57:55,732 - mindformers[mindformers/core/callback/callback.py:323] - INFO - { Epoch:[ 1/ 5], step:[ 320/ 1251], loss: 0.000, per_step_time: 1803ms, lr: 0.0, overflow cond: False, loss_scale: unavailable
2024-09-02 16:57:55,732 - mindformers[mindformers/core/callback/callback.py:333] - INFO - 5.1% |██ | 0.27724 samples/s/p 2:58:23 }
训练任务
以qwen1_5-72b
4机32卡为例,启动多机微调任务。
采用4节点训练,选取1-4服务器
-
修改
research/qwen1_5/finetune_qwen1_5_72b.yaml
parallel_config: data_parallel: 1 model_parallel: 8 pipeline_stage: 4 micro_batch_num: 48 vocab_emb_dp: True gradient_aggregation_group: 4
-
执行分布式启动命令
在多机上同时拉起任务,将参数
MASTER_ADDR
设置为主节点的ip地址, 所有节点设置的ip地址相同,不同节点之间仅参数NODE_RANK
不同,具体可参考使用指南在mindformers工作目录下,执行:
# 节点0,节点ip为172.191.132.5,作为主节点,总共32卡且每个节点8卡 bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \ --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \ --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \ --use_parallel True \ --run_mode finetune \ --auto_trans_ckpt True \ --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \ 32 8 172.191.132.5 8118 0 output/msrun_log False 300 # 节点1,节点ip为172.191.132.2,节点0与节点1启动命令仅参数NODE_RANK不同 bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \ --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \ --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \ --use_parallel True \ --run_mode finetune \ --auto_trans_ckpt True \ --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \ 32 8 172.191.132.5 8118 1 output/msrun_log False 300 # 节点2,节点ip为172.191.132.3,节点0与节点2启动命令仅参数NODE_RANK不同 bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \ --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \ --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \ --use_parallel True \ --run_mode finetune \ --auto_trans_ckpt True \ --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \ 32 8 172.191.132.5 8118 2 output/msrun_log False 300 # 节点3,节点ip为172.191.132.4,节点0与节点3启动命令仅参数NODE_RANK不同 bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \ --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \ --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \ --use_parallel True \ --run_mode finetune \ --auto_trans_ckpt True \ --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \ 32 8 172.191.132.5 8118 3 output/msrun_log False 300 # 参数说明 config: 配置文件路径 load_checkpoint: 权重文件夹路径, 权重按照'model_dir/rank_0/xxx.ckpt'格式存放 auto_trans_ckpt: 自动权重转换开关 run_mode: 运行模式, 微调时设置为finetune train_data: 训练数据集路径
4. 常见错误
4.1 权重转换报错 动态库丢失
ImportError: /root/miniconda3/envs/mf1.1_ms2.3_py39/lib/python3.9/site-packages/torch/lib/…/…/torch.libs/libgomp-6e1a1d1b.so.1.0.0: cannot allocate memory in static TLS block
export LD_PRELOAD=$LD_PRELOAD:/root/miniconda3/envs/mf1.1_ms2.3_py39/lib/python3.9/site-packages/torch/lib/…/…/torch.libs/libgomp-6e1a1d1b.so.1.0.0
4.2 编译成环
编译成环了,可以添加以下环境变量规避
export ENABLE_CELL_REUSE=1
export MS_DEV_CELL_REUSE=1
export MS_ENABLE_FORMAT_MODE=1
export MS_ASCEND_CHECK_OVERFLOW_MODE=“INFNAN_MODE”