LLM训推适配-[昇腾910B]-Qwen1.5-72B模型SFT

mindformers大模型训练-qwen1.5-72B

1.环境准备

docker镜像:
swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.2_mindspore2.3:20240722

启动镜像:
docker run --rm --name mf_qwen --privileged --shm-size=8g -it -u root --ipc=host --network=host --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /etc/localtime:/etc/localtime -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /var/log/npu/:/usr/slog -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /local/path:/opt/work swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.2_mindspore2.3:20240722 /bin/bash

相关环境:
mindformers 1.2.0
mindieclient 1.0rc1
mindietorch 1.0rc1+torch2.1.0.abi0
mindpet 1.0.4
mindspore 2.3.0

依赖包安装

  1. 安装mindformers
    镜像中未安装mindformers
    下载安装包,解压后,python setup.py install

  2. 安装官方Qwen1.5模型权重转换所需的依赖软件包:
    (0722版本的镜像已包含这些依赖包,可跳过)

pip install torch transformers>=4.37.2 transformers_stream_generator einops accelerate -i http://10.183.157.39/artifactory/api/pypi/alipipy/simple --trusted-host 10.183.157.39

环境配置

source /usr/local/Ascend/ascend-toolkit/set_env.sh
export LD_PRELOAD=$LD_PRELOAD:/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/torch/lib/../../torch.libs/libgomp-6e1a1d1b.so.1.0.0

设置环境变量

export MS_DEV_SIDE_EFFECT_LOAD_ELIM=3  # 去除TensorMove
export MS_MEMORY_POOL_RECYCLE=1  # 内存优化
export GE_NOT_CUT=1   # 内存优化
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE"
export HCCL_CONNECT_TIMEOUT=120
export HCCL_ENTRY_LOG_ENABLE=1
export HCCL_SOCKET_IFNAME=fsb_bond.2400
export ENABLE_CELL_REUSE=1
export MS_FORCE_NO_INLINE=1

2.训练准备

2.1 权重转换

模型权重转换
  • torch权重转mindspore权重

运行convert_weight.py转换脚本,将huggingface的权重转换为完整的ckpt权重。

python research/qwen1_5/convert_weight.py --torch_ckpt_dir /opt/work/pretrained_models/Qwen1p5-72B-Chat --mindspore_ckpt_path /opt/work/mindspore_ckpt/qwen1p5_72b_chat_new.ckpt

# 参数说明:
torch_ckpt_dir:      预训练权重文件所在的目录, 此参数必须
mindspore_ckpt_path: 转换后的输出文件存放路径, 默认为'./transform.ckpt'

# 执行结果:
。。。
model.layers.79.ffn_norm.weight (8192,)
model.norm_out.weight (8192,)
lm_head.weight (152064, 8192)

2.2 数据准备

2.2.1 数据集制作r1.1.0

目前提供alpaca数据集的预处理脚本用于全参微调任务。

数据集下载链接如下:

执行alpaca_converter.py,将原始数据集转换为指定格式。

python research/qwen/alpaca_convert
er.py --data_path /opt/work/dataset/code_alpaca/code_alpaca_20k.json --output_path /opt/work/src/train_ms/qwen/dataset/alpaca-data-conversation.json
# 参数说明
# data_path: 存放alpaca数据的路径
# output_path: 输出转换后对话格式的数据路径

转换后格式样例:

   {
    "id": "1",
    "conversations": [
      {
        "from": "user",
        "value": "Create an array of length 5 which contains all even numbers between 1 and 10."
      },
      {
        "from": "assistant",
        "value": "arr = [2, 4, 6, 8, 10]"
      }
    ]
  },

执行qwen_preprocess.py,进行数据预处理和Mindrecord数据生成。

python research/qwen/qwen_preprocess.py --input_glob /opt/work/src/train_ms/qwen/dataset/alpaca-data-conversation.json --model_file /opt/work/pretrained_models/Qwen-14B_base/qwen.tiktoken --seq_length 4096 --output_file /opt/work/src/train_ms/qwen/dataset/alpaca-4096.mindrecord

Transformed 20022 records.
Transform finished, output files refer: /opt/work/src/train_ms/qwen/dataset/alpaca-4096.mindrecord

2.2.2 数据集制作
  • alpaca 数据预处理

    执行research/qwen1_5/alpaca_converter.py,将原始数据集转换为指定格式。

    python alpaca_converter.py \
     --data_path /opt/work/dataset/code_alpaca/code_alpaca_20k.json \
     --output_path /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2.json
    
    # 参数说明
    data_path:   输入下载的文件路径
    output_path: 输出文件的保存路径
    

    执行research/qwen1_5/qwen1_5_preprocess.py文件,进行数据预处理和Mindrecord数据生成。

    python research/qwen1_5/qwen1_5_preprocess.py --dataset_type 'qa' --input_glob /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2.json --vocab_file /opt/work/pretrained_models/Qwen1p5-72B-Chat/vocab.json --merges_file /opt/work/pretrained_models/Qwen1p5-72B-Chat/merges.txt --seq_length 4096 --output_file /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord
    
    # 参数说明
    dataset_type: 预处理数据类型
    input_glob:   转换后的alpaca的文件路径
    vocab_file:   vocab.json文件路径
    merges_file:  merges.txt文件路径
    seq_length:   输出数据的序列长度
    output_file:  输出文件的保存路径
    
    #运行结果
    Transformed 20022 records.
    Transform finished, output files refer: /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord
    
    

3.训练执行

3.1 lora

TODO

3.2 sft全量调参

模型训练性能
ConfigTaskDatasetsSeqLengthPhasePerformance(tokens/s/p)
qwen1.5-7btext_generationwikitext-103-v132768Pretrain1048
qwen1.5-14btext_generationwikitext-103-v132768Pretrain675
qwen1.5-72btext_generationwikitext-103-v132768Pretrain186
qwen1.5-7btext_generationalpaca4096Finetune2457
qwen1.5-14btext_generationalpaca4096Finetune1077
qwen1.5-72btext_generationalpaca2048Finetune180.2
多机多卡全量调参
  • 注意:
    • 多机多卡执行脚本进行分布式训练需要分别在不同节点运行脚本,并将参数MASTER_ADDR设置为主节点的ip地址,
    • 所有节点设置的ip地址相同,不同节点之间仅参数NODE_RANK不同。
    • 多机训练要设置hccl的网卡, export HCCL_SOCKET_IFNAME=fsb_bond.2400

仿真测试
用于验证并行训练参数设置

#设置仿真环境变量
export MS_SIMULATION_LEVEL=1
export RANK_SIZE=32
export RANK_ID=31
export MS_MEMORY_STATISTIC=1
export GLOG_v=2


python run_qwen1_5.py --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml --run_mode finetune --use_parallel True --train_dataset /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord

2024-09-02 16:57:52,110 - mindformers[mindformers/core/callback/callback.py:333] - INFO -    5.1% |██                                                | 0.28539 samples/s/p  2:53:21 }
[WARNING] PRE_ACT(32,fffefa39f120,python):2024-09-02-16:57:52.124.835 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:138] AllocTensorMem] Need Profile Memory, Memory pool alloc, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528, from persistent mem: 0, need recycle: 0
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.136 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.363 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602560, used by event mem: 0, device address addr: 0x1341c0000400, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.403 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602048, used by event mem: 0, device address addr: 0x1341c0000200, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.439 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601536, used by event mem: 0, device address addr: 0x1341c0000600, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.473 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601024, used by event mem: 0, device address addr: 0x1341c0000000, size: 512
[WARNING] PRE_ACT(32,fffeff3bf120,python):2024-09-02-16:57:53.930.562 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:138] AllocTensorMem] Need Profile Memory, Memory pool alloc, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528, from persistent mem: 0, need recycle: 0
[WARNING] PRE_ACT(32,ffff01bcf120,python):2024-09-02-16:57:55.728.118 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 46895575040, used by event mem: 0, device address addr: 0x1341c0000000, size: 18797974528
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.731.344 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602560, used by event mem: 0, device address addr: 0x1341c0000600, size: 512
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.731.581 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097602048, used by event mem: 0, device address addr: 0x1341c0000400, size: 512
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.731.923 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601536, used by event mem: 0, device address addr: 0x1341c0000200, size: 512
[WARNING] PRE_ACT(32,ffff2a2b1180,python):2024-09-02-16:57:55.732.058 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:691] CombineMemBuf] Need Profile Memory, Memory pool free, total mem: 2199023255552, peak mem: 46895575040, in use mem: 28097601024, used by event mem: 0, device address addr: 0x1341c0000000, size: 512
2024-09-02 16:57:55,732 - mindformers[mindformers/core/callback/callback.py:259] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2024-09-02 16:57:55,732 - mindformers[mindformers/core/callback/callback.py:323] - INFO - { Epoch:[  1/  5], step:[  320/ 1251], loss: 0.000, per_step_time: 1803ms, lr: 0.0, overflow cond: False, loss_scale: unavailable
2024-09-02 16:57:55,732 - mindformers[mindformers/core/callback/callback.py:333] - INFO -    5.1% |██                                                | 0.27724 samples/s/p  2:58:23 }

训练任务
qwen1_5-72b4机32卡为例,启动多机微调任务。
采用4节点训练,选取1-4服务器

  1. 修改research/qwen1_5/finetune_qwen1_5_72b.yaml

    parallel_config:
      data_parallel: 1
      model_parallel: 8
      pipeline_stage: 4
      micro_batch_num: 48
      vocab_emb_dp: True
      gradient_aggregation_group: 4
    
  2. 执行分布式启动命令

    在多机上同时拉起任务,将参数MASTER_ADDR设置为主节点的ip地址, 所有节点设置的ip地址相同,不同节点之间仅参数NODE_RANK不同,具体可参考使用指南

    在mindformers工作目录下,执行:

    # 节点0,节点ip为172.191.132.5,作为主节点,总共32卡且每个节点8卡
    bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \
     --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \
     --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \
     --use_parallel True \
     --run_mode finetune \
     --auto_trans_ckpt True \
     --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \
    32 8 172.191.132.5 8118 0 output/msrun_log False 300
    
    # 节点1,节点ip为172.191.132.2,节点0与节点1启动命令仅参数NODE_RANK不同
    bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \
     --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \
     --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \
     --use_parallel True \
     --run_mode finetune \
     --auto_trans_ckpt True \
     --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \
    32 8 172.191.132.5 8118 1 output/msrun_log False 300
    
    # 节点2,节点ip为172.191.132.3,节点0与节点2启动命令仅参数NODE_RANK不同
    bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \
     --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \
     --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \
     --use_parallel True \
     --run_mode finetune \
     --auto_trans_ckpt True \
     --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \
    32 8 172.191.132.5 8118 2 output/msrun_log False 300
    
    # 节点3,节点ip为172.191.132.4,节点0与节点3启动命令仅参数NODE_RANK不同
    bash scripts/msrun_launcher.sh "research/qwen1_5/run_qwen1_5.py \
     --config /opt/work/src/train_ms/qwen/finetune_qwen1p5_72b_node4.yaml \
     --load_checkpoint /opt/work/mindspore_ckpt/qwen1p5_72b_chat.ckpt \
     --use_parallel True \
     --run_mode finetune \
     --auto_trans_ckpt True \
     --train_data /opt/work/dataset/dataset_mf/code_alpaca_20k_msg_v1.2_4096.mindrecord" \
    32 8 172.191.132.5 8118 3 output/msrun_log False 300
    
    # 参数说明
    config:          配置文件路径
    load_checkpoint: 权重文件夹路径, 权重按照'model_dir/rank_0/xxx.ckpt'格式存放
    auto_trans_ckpt: 自动权重转换开关
    run_mode:        运行模式, 微调时设置为finetune
    train_data:      训练数据集路径
    

4. 常见错误

4.1 权重转换报错 动态库丢失

ImportError: /root/miniconda3/envs/mf1.1_ms2.3_py39/lib/python3.9/site-packages/torch/lib/…/…/torch.libs/libgomp-6e1a1d1b.so.1.0.0: cannot allocate memory in static TLS block

export LD_PRELOAD=$LD_PRELOAD:/root/miniconda3/envs/mf1.1_ms2.3_py39/lib/python3.9/site-packages/torch/lib/…/…/torch.libs/libgomp-6e1a1d1b.so.1.0.0

4.2 编译成环

编译成环了,可以添加以下环境变量规避
export ENABLE_CELL_REUSE=1
export MS_DEV_CELL_REUSE=1
export MS_ENABLE_FORMAT_MODE=1
export MS_ASCEND_CHECK_OVERFLOW_MODE=“INFNAN_MODE”

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值