Yi-34B微调训练

lsjlnd

已于 2024-01-12 18:06:28 修改

阅读量1.1k

点赞数 21

文章标签： llama nlp

于 2024-01-02 11:33:51 首次发布

本文链接：https://blog.csdn.net/lsjlnd/article/details/135336984

版权

环境安装

基础环境

# 创建环境
conda create -n llama_factory python=3.10
conda activate llama_factory
# 按照个人情况选择CUDA版本
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 安装微调工具
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -r requirements.txt
# 安装分布式加速训练库
pip install deepspeed

下载模型

# 镜像地址：https://hf-mirror.com/
conda activate base
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download --local-dir-use-symlinks False 01-ai/Yi-34B-Chat --local-dir Yi-34B-Chat
# 若下载中断导致文件校验失败，可通过--include或--exclude指定多个文件，以空格分隔

训练过程

配置文件

# 参考：https://github.com/hiyouga/LLaMA-Factory/issues/256
vi ds_config_lora.json
{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

训练脚本

deepspeed --include localhost:0,1,2,3,4,5,6,7 src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path /path/to/model/Yi-34B-Chat/ \
    --dataset alpaca_gpt4_zh \
    --template yi \
    --finetuning_type lora \
    --lora_target k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --output_dir ./yi_sft_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --deepspeed "./ds_config_lora.json"

ZeRO-1：分割Optimizer States；ZeRO-2：分割Optimizer States与Gradients；ZeRO-3：分割Optimizer States、Gradients与Parameters

量化训练

--quantization_bit 4
# ValueError: DeepSpeed ZeRO-3 is incompatible with quantization.
# 注意：参数配置行之间不要保留注释，会导致其后的配置不生效

常见问题

内存占用超高

since you are offloading both parameters and optimizer state to CPU you would need roughly 18 bytes per model parameter. That means for 7B model you would need ~126GB of CPU memory. Please see page 3 of https://arxiv.org/pdf/1910.02054.pdf for a discussion of the memory breakdown.

参考：How to calculate the cpu memory required for DeepSpeedZeRoOffload initialization? · Issue #3606 · microsoft/DeepSpeed · GitHub

通过启用DeepSpeed的ZeRO-3优化，可直接将模型拆分加载到显存中，初始化和训练过程中一直保持较低的内存占用。