通过fine-tuning 微调LLM模型实现传统NLP任务

1 篇文章 0 订阅
1 篇文章 0 订阅

前言

大模型+指令微调的组合终究与追求精度提升的文本理解类任务不太契合。在足量的标注数据场景下,精度上难以匹敌传统的BERT式微调方法。但是大模型毕竟在参数量和学习的知识信息量级上要远超过往的BERT簇模型,所以从理论上来看,只要能够充分利用大模型庞大的知识量,其在文本理解能力上必然是超越BERT簇模型的。指令微调+Prompt工程的大模型生成式方法在文本理解类任务上并没有充分利用到大模型的丰富知识,那么能否参考BERT式的微调方法,将大模型的参数权重作为基座,去针对性适配下游任务呢?答案是可行的,因为大模型本质也是一个transformer模型网络,只不过预训练的方式不同而已,只需要在网络的最后一层添加对应的任务层即可。不过在实际落地时,这种方式可能面临这样的问题:

目前主流的大模型参数通常在7B以上的量级,使用这种参数量的模型即使是使用lora微调,训练和在线推理预测的成本也是不小的,为了某个单个任务的精度提升而去过拟合一个大模型看上去得不偿失。

不过,上述问题在谷歌发布了gemma 2B模型后得到了极大的缓解。相对于7B的参数量,13B左右的模型在训练成本与推理的时延等方面都能得到足够的控制。因此,本次实验就以gemma 2B 为基准模型,来探索它实现BERT式微调方法后的效果。

一、安装 LLaMA Factory

!nvidia-smi
Wed Jul 31 14:49:16 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:0E:00.0 Off |                  N/A |
| 38%   35C    P8             16W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
!git clone https://gitclone.com/github.com/hiyouga/LLaMA-Factory.git
Cloning into 'LLaMA-Factory'...
remote: 对象计数中: 15692, 完成.[K
remote: 压缩对象中: 100% (4034/4034), 完成.[K
remote: Total 15692 (delta 11655), reused 15452 (delta 11497)[K 
Receiving objects: 100% (15692/15692), 221.51 MiB | 1022.00 KiB/s, done.
Resolving deltas: 100% (11655/11655), done.

下面的步骤可以启动后先做别的

%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]
/root/LLaMA-Factory
CITATION.cff  README.md     [0m[01;34mdocker[0m/         requirements.txt  [01;34mtests[0m/
LICENSE       README_zh.md  [01;34mevaluation[0m/     [01;34mscripts[0m/          [01;34mtrain_data[0m/
MANIFEST.in   [01;34massets[0m/       [01;34mexamples[0m/       setup.py
Makefile      [01;34mdata[0m/         pyproject.toml  [01;34msrc[0m/
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Obtaining file:///root/LLaMA-Factory
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
......

Installing collected packages: llamafactory
  Attempting uninstall: llamafactory
    Found existing installation: llamafactory 0.8.3.dev0
    Uninstalling llamafactory-0.8.3.dev0:
      Successfully uninstalled llamafactory-0.8.3.dev0
Successfully installed llamafactory-0.8.3.dev0
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m

二、下载LLM模型

如果已经下载好模型,不用再次下载,这里我们从魔搭下载

!pip install modelscope
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting modelscope
  Downloading http://mirrors.aliyun.com/pypi/packages/38/37/9fe505ebc67ba5e0345a69d6e8b2ee8630523975b484d221691
Installing collected packages: modelscope
Successfully installed modelscope-1.16.1
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download(
    'AI-ModelScope/gemma-2b-it', 
     cache_dir='/root/models/gemma-2b-it' 
    )
Downloading: 100%|██████████| 627/627 [00:00<00:00, 1.40kB/s]
Downloading: 100%|██████████| 38.0/38.0 [00:00<00:00, 132B/s]
Downloading: 100%|██████████| 9.34G/9.34G [09:20<00:00, 17.9MB/s]  
Downloading: 100%|██████████| 137/137 [00:00<00:00, 494B/s]
Downloading: 100%|██████████| 4.61G/4.61G [04:35<00:00, 18.0MB/s]
Downloading: 100%|██████████| 64.0M/64.0M [00:02<00:00, 31.4MB/s]
Downloading: 100%|██████████| 13.2k/13.2k [00:02<00:00, 5.88kB/s]
Downloading: 100%|██████████| 23.1k/23.1k [00:00<00:00, 26.7kB/s]
Downloading: 100%|██████████| 636/636 [00:00<00:00, 1.99kB/s]
Downloading: 100%|██████████| 16.7M/16.7M [00:01<00:00, 17.0MB/s]
Downloading: 100%|██████████| 4.04M/4.04M [00:00<00:00, 8.94MB/s]
Downloading: 100%|██████████| 33.4k/33.4k [00:00<00:00, 72.6kB/s]

三、数据转换

我们需要把不同的数据格式转换成alpaca 格式的训练数据

3.1、安装数据转换必须的依赖

!pip install pandas
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: pandas in ./miniconda3/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: python-dateutil>=2.8.2 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: tzdata>=2022.7 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: numpy>=1.22.4 in ./miniconda3/lib/python3.10/site-packages (from pandas) (1.26.3)
Requirement already satisfied: pytz>=2020.1 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in ./miniconda3/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m

3.2 MRPC

不同的数据要通过代码转换

import pandas as pd
import json

创建指令,判断 Sentence 1 和 Sentence 2 是否 语意等价

instruction="Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not."

3.2.1 csv 格式转

df = pd.read_csv('./LLaMA-Factory/train_data/MRPC_tsv/train.tsv',sep='\t',on_bad_lines='skip')
df.tail(5)
Quality#1 ID#2 ID#1 String#2 String
3544116202641620507At this point , Mr. Brando announced : ' Some...Brando said that " somebody ought to put a bul...
3545018480011848224Martin , 58 , will be freed today after servin...Martin served two thirds of a five-year senten...
35461747160747144We have concluded that the outlook for price ...In a statement , the ECB said the outlook for ...
3547125399332539850The notification was first reported Friday by ...MSNBC.com first reported the CIA request on Fr...
35480453575453448The 30-year bond US30YT = RR rose 22 / 32 for ...The 30-year bond US30YT = RR grew 1-3 / 32 for...
# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['#1 String']}\nSentence 2: {row['#2 String']}",
        "output": str(row['Quality'])
    }
    # 将字典添加到列表中
    alpaca_data.append(data_point)

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)
alpaca_data[-1]
{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}
#保存到文件
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:
    f.write(alpaca_json)

3.2.3 parquet 格式转

df = pd.read_parquet('./LLaMA-Factory/train_data/MRPC_hf/train-00000-of-00001.parquet')
df.tail()
sentence1sentence2labelidx
3663" At this point , Mr. Brando announced : ' Som...Brando said that " somebody ought to put a bul...14071
3664Martin , 58 , will be freed today after servin...Martin served two thirds of a five-year senten...04072
3665" We have concluded that the outlook for price...In a statement , the ECB said the outlook for ...14073
3666The notification was first reported Friday by ...MSNBC.com first reported the CIA request on Fr...14074
3667The 30-year bond US30YT = RR rose 22 / 32 for ...The 30-year bond US30YT = RR grew 1-3 / 32 for...04075
# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['sentence1']}\nSentence 2: {row['sentence2']}",
        "output": str(row['label'])
    }
    # 将字典添加到列表中
    alpaca_data.append(data_point)

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)
alpaca_data[-1]
{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}
#保存到文件
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:
    f.write(alpaca_json)

3.2.3 jsonl格式转

# 创建一个空列表来存储转换后的数据
alpaca_data = []
# 打开并读取JSONL文件
with open('./LLaMA-Factory/train_data/MRPC_hf/train.jsonl', 'r') as f:
    for line in f:
        # 解析每一行JSON数据
        data_point = json.loads(line)
        
        # 创建一个字典来存储转换后的数据
        alpaca_point = {
            "instruction": instruction,
            "input": f"Sentence 1: {data_point['text1']}\nSentence 2: {data_point['text2']}",
            "output": str(data_point['label'])
        }
        
        # 将字典添加到列表中
        alpaca_data.append(alpaca_point)
alpaca_data[-1]
{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}
#保存到文件
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:
    f.write(alpaca_json)

3.3 编辑dataset_info.json

将生成的文件名,添加到LLama-factory/data/dataset_info.json 中

{
  "MRPC_train_data": {
    "file_name": "MRPC_train_data.json"
  },
  "identity": {
    "file_name": "identity.json"
  },
  "alpaca_en_demo": {
    "file_name": "alpaca_en_demo.json"
  },...
}

3.4 准备测试集

df = pd.read_csv('./LLaMA-Factory/train_data/MRPC_tsv/dev.tsv',sep='\t',on_bad_lines='skip')
df.tail()
Quality#1 ID#2 ID#1 String#2 String
383029775002977547Their contract will expire at 12 : 01 a.m. Wed...It has outraged the membership , said Rian W...
384131071373107119But plaque volume increased by 2.7 percent in ...The volume of plaque in Pravachol patients ' a...
385116192441619274Today in the US , the book - kept under wraps ...Tomorrow the book , kept under wraps by G. P. ...
386030618363062031The S & P / TSX composite rose 87.74 points on...On the week , the Dow Jones industrial average...
3871485999486011Ex-KGB agent Putin added that the Beatles were...In Soviet times the Beatles ' music " was cons...
# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['#1 String']}\nSentence 2: {row['#2 String']}",
        "output": str(row['Quality'])
    }
    # 将字典添加到列表中
    alpaca_data.append(data_point)

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)
#保存到文件
with open('LLaMA-Factory/data/MRPC_test_data.json', 'w') as f:
    f.write(alpaca_json)

四、训练配置

因为不知道AutoDL算力平台怎么访问自定义端口,且GRADIO_SHARE在国内无法访问,所以这里不启动webui训练界面

%cd LLaMA-Factory/
/root/LLaMA-Factory
# 使用物理路径 gemma-2b-it 参见 《1训练环境准备》 模型下载  魔搭下载的路径需要手工验证下,很怪
model_name_or_path="/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it"
# 保存 LoRA 适配器的路径
adapter_name_or_path="train_MRPC",
  
import json

args = dict(
  stage="sft",  # 进行指令监督微调
  do_train="True",
  model_name_or_path=model_name_or_path,    
  preprocessing_num_workers=16,
  finetuning_type="lora",   # 使用 LoRA 适配器来节省显存
  template="gemma",   # 使用 gemma 提示词模板
  flash_attn="auto",
  dataset_dir="data",
  dataset="MRPC_train_data",      # 使用 MRPC_train_data数据集 参见 《2数据准备.ipynb》 2.2节
  cutoff_len=1024,
  learning_rate=5e-05,    # 学习率大小
  num_train_epochs=3.0,    # 训练轮数
  max_samples=100000,      # 使用每个数据集中的样本条数 
  per_device_train_batch_size=2,
  gradient_accumulation_steps=8,
  lr_scheduler_type="cosine",    # 使用余弦学习率退火算法
  max_grad_norm=1.0,           # 将梯度范数裁剪至 1.0
  logging_steps=10,             # 每 10 步输出一个记录
  save_steps=1000,         # 每 1000 步保存一个检查点
  warmup_steps=0,             # 使用预热学习率,这里没有使用,可以设置步数
  optim="adamw_torch",
  output_dir="train_MRPC",   # 保存 LoRA 适配器的路径
  plot_loss=True,
  ddp_timeout=180000000,
  include_num_input_tokens_seen=True,
  lora_rank=8,
  lora_alpha=16,
  lora_dropout=0,  #LoRA 随机丢弃率
  fp16=True,  # 使用 float16 混合精度训练
  lora_target="all"   # 添加 LoRA 适配器至全部线性层
)

json.dump(args, open("train_gemma.json", "w", encoding="utf-8"), indent=2)

!llamafactory-cli train train_gemma.json
07/31/2024 17:34:30 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer_config.json
07/31/2024 17:34:31 - INFO - llamafactory.data.loader - Loading dataset MRPC_train_data.json...
input_ids:
[2, 106, 1645, 108, 99211, 1013, 573, 1378, 26099, 708, 147440, 11070, 13107, 954, 16230, 777, 235274, 235303, 1013, 984, 708, 13107, 235269, 777, 235276, 235303, 1013, 984, 708, 780, 235265, 108, 86386, 235248, 235274, 235292, 4181, 514, 3423, 19538, 926, 8462, 1688, 7624, 693, 3151, 664, 573, 13229, 664, 1688, 576, 46971, 1697, 87242, 926, 5820, 954, 108, 86386, 235248, 235284, 235292, 165244, 577, 1357, 685, 1297, 664, 573, 13229, 664, 1688, 4181, 514, 3423, 19538, 926, 8462, 576, 46971, 1697, 87242, 926, 5820, 954, 107, 108, 106, 2516, 108, 235274, 1]
inputs:
<bos><start_of_turn>user
Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.
Sentence 1: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Sentence 2: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .<end_of_turn>
<start_of_turn>model
1<eos>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 235274, 1]
labels:
1<eos>
[INFO|configuration_utils.py:731] 2024-07-31 17:34:32,639 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:34:32,640 >> Model config GemmaConfig {
  "_name_or_path": "/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

[INFO|modeling_utils.py:3631] 2024-07-31 17:34:32,662 >> loading weights file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-07-31 17:34:32,663 >> Instantiating GemmaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:1038] 2024-07-31 17:34:32,664 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}

[WARNING|logging.py:328] 2024-07-31 17:34:32,667 >> `config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.15s/it]
[INFO|modeling_utils.py:4463] 2024-07-31 17:34:35,011 >> All model checkpoint weights were used when initializing GemmaForCausalLM.

[INFO|modeling_utils.py:4471] 2024-07-31 17:34:35,011 >> All the weights of GemmaForCausalLM were initialized from the model checkpoint at /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GemmaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-07-31 17:34:35,013 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/generation_config.json
[INFO|configuration_utils.py:1038] 2024-07-31 17:34:35,013 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}

07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
07/31/2024 17:34:35 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
07/31/2024 17:34:35 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.misc - Found linear modules: k_proj,gate_proj,o_proj,down_proj,v_proj,up_proj,q_proj
07/31/2024 17:34:36 - INFO - llamafactory.model.loader - trainable params: 9,805,824 || all params: 2,515,978,240 || trainable%: 0.3897
[INFO|trainer.py:648] 2024-07-31 17:34:36,180 >> Using auto half precision backend
[INFO|trainer.py:2134] 2024-07-31 17:34:36,526 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-07-31 17:34:36,526 >>   Num examples = 3,668
[INFO|trainer.py:2136] 2024-07-31 17:34:36,526 >>   Num Epochs = 3
[INFO|trainer.py:2137] 2024-07-31 17:34:36,526 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-07-31 17:34:36,526 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2141] 2024-07-31 17:34:36,526 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2142] 2024-07-31 17:34:36,526 >>   Total optimization steps = 687
[INFO|trainer.py:2143] 2024-07-31 17:34:36,530 >>   Number of trainable parameters = 9,805,824
{'loss': 2.8161, 'grad_norm': 17.23887062072754, 'learning_rate': 4.9978830041808596e-05, 'epoch': 0.04, 'num_input_tokens_seen': 17600}
......
{'loss': 0.0738, 'grad_norm': 0.30729034543037415, 'learning_rate': 1.6727376094963222e-08, 'epoch': 2.97, 'num_input_tokens_seen': 1214928}
100%|█████████████████████████████████████████| 687/687 [13:29<00:00,  1.15s/it][INFO|trainer.py:3503] 2024-07-31 17:48:05,664 >> Saving model checkpoint to train_MRPC/checkpoint-687
[INFO|configuration_utils.py:731] 2024-07-31 17:48:05,684 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:48:05,685 >> Model config GemmaConfig {
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

[INFO|tokenization_utils_base.py:2702] 2024-07-31 17:48:05,739 >> tokenizer config file saved in train_MRPC/checkpoint-687/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-07-31 17:48:05,739 >> Special tokens file saved in train_MRPC/checkpoint-687/special_tokens_map.json
[INFO|trainer.py:2394] 2024-07-31 17:48:06,301 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 809.7713, 'train_samples_per_second': 13.589, 'train_steps_per_second': 0.848, 'train_loss': 0.19551172700684838, 'epoch': 3.0, 'num_input_tokens_seen': 1227456}
100%|█████████████████████████████████████████| 687/687 [13:29<00:00,  1.18s/it]
[INFO|trainer.py:3503] 2024-07-31 17:48:06,304 >> Saving model checkpoint to train_MRPC
[INFO|configuration_utils.py:731] 2024-07-31 17:48:06,324 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:48:06,325 >> Model config GemmaConfig {
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

[INFO|tokenization_utils_base.py:2702] 2024-07-31 17:48:06,369 >> tokenizer config file saved in train_MRPC/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-07-31 17:48:06,369 >> Special tokens file saved in train_MRPC/special_tokens_map.json
***** train metrics *****
  epoch                    =     2.9967
  num_input_tokens_seen    =    1227456
  total_flos               = 13660893GF
  train_loss               =     0.1955
  train_runtime            = 0:13:29.77
  train_samples_per_second =     13.589
  train_steps_per_second   =      0.848
Figure saved at: train_MRPC/training_loss.png
07/31/2024 17:48:06 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
07/31/2024 17:48:06 - WARNING - llamafactory.extras.ploting - No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2024-07-31 17:48:06,913 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

五、查看训练结果

from IPython.display import Image

# 指定图片的路径
image_path = 'train_MRPC/training_loss.png'

# 显示图片
Image(filename=image_path)

在这里插入图片描述

六、评估测试

from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc
from tqdm import tqdm
import json
import pandas as pd

# 读取评估集的JSON文件
with open('data/MRPC_test_data.json', 'r') as f:
    evaluation_data = json.load(f)

args = dict(
    model_name_or_path=model_name_or_path,
    adapter_name_or_path="train_MRPC",  # 加载之前保存的 LoRA 适配器
    template="gemma",  # 和训练保持一致
    finetuning_type="lora"  # 和训练保持一致
)
chat_model = ChatModel(args)

results = []

# 使用 tqdm 添加进度条
for sample in tqdm(evaluation_data, desc="Evaluating"):
    instruction = sample['instruction']
    input_text = sample['input']
    expected_output = sample['output']

    messages = [
        {"role": "user", "content": f"{instruction}\n{input_text}"}
    ]

    response = ""
    for new_text in chat_model.stream_chat(messages):
        response += new_text

    # 记录模型的输出和预期输出
    results.append({
        "instruction": instruction,
        "input": input_text,
        "expected_output": expected_output,
        "model_output": response.strip()
    })

# 将结果存储到 DataFrame 中
df = pd.DataFrame(results)
# 保存 DataFrame 到 CSV 文件
df.to_csv('evaluation_results.csv', index=False)

torch_gc()
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,521 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,521 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-07-31 22:19:26,175 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 22:19:26,176 >> Model config GemmaConfig {
  "_name_or_path": "/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

......
07/31/2024 22:19:27 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
07/31/2024 22:19:28 - INFO - llamafactory.model.adapter - Merged 1 adapter(s).
07/31/2024 22:19:28 - INFO - llamafactory.model.adapter - Loaded adapter(s): train_MRPC
07/31/2024 22:19:28 - INFO - llamafactory.model.loader - all params: 2,506,172,416


Evaluating: 100%|██████████| 388/388 [00:16<00:00, 23.67it/s]

七、对评估集进行简单评分

!pip install scikit-learn
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: scikit-learn in /root/miniconda3/lib/python3.10/site-packages (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: scipy>=1.6.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.14.0)
Requirement already satisfied: numpy>=1.19.5 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.26.3)
Requirement already satisfied: joblib>=1.2.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.4.2)
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 读取 CSV 文件
df = pd.read_csv('evaluation_results.csv')

# 确保 model_output 和 expected_output 列是字符串类型
df['model_output'] = df['model_output'].astype(str)
df['expected_output'] = df['expected_output'].astype(str)

# 将模型输出和预期输出转换为二分类标签
df['model_output_binary'] = df['model_output'].apply(lambda x: 1 if x.strip() == '1' else 0)
df['expected_output_binary'] = df['expected_output'].apply(lambda x: 1 if x.strip() == '1' else 0)

# 计算评估指标
accuracy = accuracy_score(df['expected_output_binary'], df['model_output_binary'])
precision = precision_score(df['expected_output_binary'], df['model_output_binary'])
recall = recall_score(df['expected_output_binary'], df['model_output_binary'])
f1 = f1_score(df['expected_output_binary'], df['model_output_binary'])

# 打印评估指标
print(f"准确性 Accuracy: {accuracy:.4f}")
print(f"精确率 Precision: {precision:.4f}")
print(f"召回率 Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
准确性 Accuracy: 0.8686
精确率 Precision: 0.8873
召回率 Recall: 0.9242
F1 Score: 0.9054

准确率(Accuracy)、精确率(Precision)、召回率(Recall)和 F1 分数(F1 Score)是评估分类模型性能的常用指标。以下是这些指标的详细解释:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值