Yi-34B微调训练

环境安装

  • 基础环境

# 创建环境
conda create -n llama_factory python=3.10
conda activate llama_factory
# 按照个人情况选择CUDA版本
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 安装微调工具
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -r requirements.txt
# 安装分布式加速训练库
pip install deepspeed
  • 下载模型

# 镜像地址:https://hf-mirror.com/
conda activate base
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download --local-dir-use-symlinks False 01-ai/Yi-34B-Chat --local-dir Yi-34B-Chat
# 若下载中断导致文件校验失败,可通过--include或--exclude指定多个文件,以空格分隔

训练过程

  • 配置文件

# 参考:https://github.com/hiyouga/LLaMA-Factory/issues/256
vi ds_config_lora.json
{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
  • 训练脚本

deepspeed --include localhost:0,1,2,3,4,5,6,7 src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path /path/to/model/Yi-34B-Chat/ \
    --dataset alpaca_gpt4_zh \
    --template yi \
    --finetuning_type lora \
    --lora_target k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --output_dir ./yi_sft_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --deepspeed "./ds_config_lora.json"

ZeRO-1:分割Optimizer States;ZeRO-2:分割Optimizer States与Gradients;ZeRO-3:分割Optimizer States、Gradients与Parameters

  • 量化训练

--quantization_bit 4
# ValueError: DeepSpeed ZeRO-3 is incompatible with quantization.
# 注意:参数配置行之间不要保留注释,会导致其后的配置不生效

常见问题

内存占用超高

since you are offloading both parameters and optimizer state to CPU you would need roughly 18 bytes per model parameter. That means for 7B model you would need ~126GB of CPU memory. Please see page 3 of https://arxiv.org/pdf/1910.02054.pdf for a discussion of the memory breakdown.

参考:How to calculate the cpu memory required for DeepSpeedZeRoOffload initialization? · Issue #3606 · microsoft/DeepSpeed · GitHub

通过启用DeepSpeed的ZeRO-3优化,可直接将模型拆分加载到显存中,初始化和训练过程中一直保持较低的内存占用。

参考文档

  • 21
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值