微调大模型实践

YueTann

已于 2024-09-12 23:57:31 修改

阅读量486

点赞数 3

文章标签： transformer

于 2024-09-12 15:43:18 首次发布

本文链接：https://blog.csdn.net/weixin_38812492/article/details/142178000

版权

微调大模型试验
其中，重点关注：数据清洗，数据合成方法，sft 的 task 种类、sft 的数据量级

数据加载

国内用户建议到 https://modelscope.cn/datasets 下载数据，但是下载后发现并不能和huggingface datasets无缝衔接，而是报了个错

AttributeError: ‘MsDataset’ object has no attribute ‘column_names’

因此，可以继续采用魔搭下载数据，但是转换到dataset适应的形式，顺便也对整个数据过程更加了解一下。

但最简单的修改方法是:

dataset = MsDataset.load()
train_dataset = dataset.to_hf_dataset()  # 魔搭社区下载

然后是:

https://github.com/modelscope/modelscope/blob/a903ec7a898f5dfb44349e2ce15971ec5f08e528/examples/pytorch/llm/utils/dataset.py#L34
https://github.com/hiyouga/LLaMA-Factory/blob/6c94305e4746c9a735ff62a6428e295d1a67da52/src/llmtuner/data/loader.py#L83

几种方法

train_dataset = load_from_disk(args.dataset_name, split="train[:1024]")

def preprocess_function(examples):

        queries = examples["sentence"]
        queries = get_detailed_instruct(task, queries)
        batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
        batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

        result = {f"sentence_{k}": v for k, v in batch_dict.items()}

        queries = examples["positive"]
        batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
        batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

        for k, v in batch_dict.items():
            result[f"positive_{k}"] = v
        
        queries = examples["negative"]
        batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
        batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

        for k, v in batch_dict.items():
            result[f"negative_{k}"] = v

        result["labels"] = [0] * len(examples["sentence"]) 
        return result
 
 processed_datasets = dataset.map(
        preprocess_function,
        batched=True,
        remove_columns=dataset["train"].column_names,
        desc="Running tokenizer on dataset",
    )

数据构造

百川例子

Deepspeed zero0

LoRA
1 80GB A100

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "bf16": {
      "enabled": "auto"
    },
    "zero_optimization": {
      "stage": 0,
      "allgather_partitions": true,
      "allgather_bucket_size": 5e8,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": 5e8,
      "contiguous_gradients": true,
      "round_robin_gradients": true
    }
  }

{'loss': 1.3997, 'grad_norm': 1.9448336362838745, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}                                           
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:07<00:00,  3.71s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.
  warnings.warn(
{'train_runtime': 620.0575, 'train_samples_per_second': 16.128, 'train_steps_per_second': 0.252, 'train_loss': 1.4265562815543933, 'epoch': 1.0}   
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:20<00:00,  3.97s/it]
Training seconds: 638.3791897296906 seconds.
Training minutes: 10.64 minutes.
Peak reserved memory = 60.09 GB.
Peak reserved memory for training = 60.09 GB.
Peak reserved memory % of max memory = 75.918 %.
Peak reserved memory for training % of max memory = 75.918 %.

deepspeed zero2 no offload

LoRA

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1e-10
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 1e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

{'loss': 1.366, 'grad_norm': 2.294084072113037, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}                                             
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:14<00:00,  3.73s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.
  warnings.warn(
{'train_runtime': 622.2199, 'train_samples_per_second': 16.071, 'train_steps_per_second': 0.251, 'train_loss': 1.4371743569007287, 'epoch': 1.0}   
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [10:22<00:00,  3.99s/it]
Training seconds: 631.9961657524109 seconds.
Training minutes: 10.53 minutes.
Peak reserved memory = 59.59 GB.
Peak reserved memory for training = 59.59 GB.
Peak reserved memory % of max memory = 75.286 %.
Peak reserved memory for training % of max memory = 75.286 %.

deepspeed zero2 offload

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "bf16": {
      "enabled": "auto"
    },
    "zero_optimization": {
      "stage": 2,
      "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
      },
      "allgather_partitions": true,
      "allgather_bucket_size": 5e8,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": 5e8,
      "contiguous_gradients": true,
      "round_robin_gradients": true
    }
  }

RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f087b065750>

rm -rf /tmp/torch_extentions/*
https://github.com/microsoft/DeepSpeed/issues/889#issuecomment-808357696
torch和deepspeed版本的问题

deepspeed zero3 offload

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "bf16": {
      "enabled": "auto"
    },
    "zero_optimization": {
      "stage": 3,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
    }
  }

{'loss': 1.4062, 'grad_norm': 2.122793574276295, 'learning_rate': 1.589403973509934e-05, 'epoch': 0.96}                                            
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [20:17<00:00,  7.65s/it]/root/.conda/envs/demo/lib/python3.10/site-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /root/share/model_repos/internlm2-chat-7b - will assume that the vocabulary was not modified.
  warnings.warn(
{'train_runtime': 1225.8007, 'train_samples_per_second': 8.158, 'train_steps_per_second': 0.127, 'train_loss': 1.4307525463593311, 'epoch': 1.0}   
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [20:25<00:00,  7.86s/it]
Training seconds: 1227.789188861847 seconds.
Training minutes: 20.46 minutes.
Peak reserved memory = 65.516 GB.
Peak reserved memory for training = 48.928 GB.
Peak reserved memory % of max memory = 82.773 %.
Peak reserved memory for training % of max memory = 61.816 %.