通过fine-tuning 微调LLM模型实现传统NLP任务_大模型+指令微调的组合终究与追求精度提升的文本理解类任务不太契合。在足量的标-CSDN博客

本文链接：https://blog.csdn.net/qq_22544887/article/details/140841391

文章目录

前言
一、安装 LLaMA Factory
二、下载LLM模型
三、数据转换
四、训练配置
五、查看训练结果
六、评估测试
七、对评估集进行简单评分

前言

大模型+指令微调的组合终究与追求精度提升的文本理解类任务不太契合。在足量的标注数据场景下，精度上难以匹敌传统的BERT式微调方法。但是大模型毕竟在参数量和学习的知识信息量级上要远超过往的BERT簇模型，所以从理论上来看，只要能够充分利用大模型庞大的知识量，其在文本理解能力上必然是超越BERT簇模型的。指令微调+Prompt工程的大模型生成式方法在文本理解类任务上并没有充分利用到大模型的丰富知识，那么能否参考BERT式的微调方法，将大模型的参数权重作为基座，去针对性适配下游任务呢？答案是可行的，因为大模型本质也是一个transformer模型网络，只不过预训练的方式不同而已，只需要在网络的最后一层添加对应的任务层即可。不过在实际落地时，这种方式可能面临这样的问题：

目前主流的大模型参数通常在7B以上的量级，使用这种参数量的模型即使是使用lora微调，训练和在线推理预测的成本也是不小的，为了某个单个任务的精度提升而去过拟合一个大模型看上去得不偿失。

不过，上述问题在谷歌发布了gemma 2B模型后得到了极大的缓解。相对于7B的参数量，13B左右的模型在训练成本与推理的时延等方面都能得到足够的控制。因此，本次实验就以gemma 2B 为基准模型，来探索它实现BERT式微调方法后的效果。

一、安装 LLaMA Factory

!nvidia-smi

Wed Jul 31 14:49:16 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:0E:00.0 Off |                  N/A |
| 38%   35C    P8             16W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

!git clone https://gitclone.com/github.com/hiyouga/LLaMA-Factory.git

Cloning into 'LLaMA-Factory'...
remote: 对象计数中: 15692, 完成.[K
remote: 压缩对象中: 100% (4034/4034), 完成.[K
remote: Total 15692 (delta 11655), reused 15452 (delta 11497)[K 
Receiving objects: 100% (15692/15692), 221.51 MiB | 1022.00 KiB/s, done.
Resolving deltas: 100% (11655/11655), done.

下面的步骤可以启动后先做别的

%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]

/root/LLaMA-Factory
CITATION.cff  README.md     [0m[01;34mdocker[0m/         requirements.txt  [01;34mtests[0m/
LICENSE       README_zh.md  [01;34mevaluation[0m/     [01;34mscripts[0m/          [01;34mtrain_data[0m/
MANIFEST.in   [01;34massets[0m/       [01;34mexamples[0m/       setup.py
Makefile      [01;34mdata[0m/         pyproject.toml  [01;34msrc[0m/
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Obtaining file:///root/LLaMA-Factory
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
......

Installing collected packages: llamafactory
  Attempting uninstall: llamafactory
    Found existing installation: llamafactory 0.8.3.dev0
    Uninstalling llamafactory-0.8.3.dev0:
      Successfully uninstalled llamafactory-0.8.3.dev0
Successfully installed llamafactory-0.8.3.dev0
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m

二、下载LLM模型

如果已经下载好模型，不用再次下载，这里我们从魔搭下载

!pip install modelscope

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting modelscope
  Downloading http://mirrors.aliyun.com/pypi/packages/38/37/9fe505ebc67ba5e0345a69d6e8b2ee8630523975b484d221691
Installing collected packages: modelscope
Successfully installed modelscope-1.16.1

from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download(
    'AI-ModelScope/gemma-2b-it', 
     cache_dir='/root/models/gemma-2b-it' 
    )

Downloading: 100%|██████████| 627/627 [00:00<00:00, 1.40kB/s]
Downloading: 100%|██████████| 38.0/38.0 [00:00<00:00, 132B/s]
Downloading: 100%|██████████| 9.34G/9.34G [09:20<00:00, 17.9MB/s]  
Downloading: 100%|██████████| 137/137 [00:00<00:00, 494B/s]
Downloading: 100%|██████████| 4.61G/4.61G [04:35<00:00, 18.0MB/s]
Downloading: 100%|██████████| 64.0M/64.0M [00:02<00:00, 31.4MB/s]
Downloading: 100%|██████████| 13.2k/13.2k [00:02<00:00, 5.88kB/s]
Downloading: 100%|██████████| 23.1k/23.1k [00:00<00:00, 26.7kB/s]
Downloading: 100%|██████████| 636/636 [00:00<00:00, 1.99kB/s]
Downloading: 100%|██████████| 16.7M/16.7M [00:01<00:00, 17.0MB/s]
Downloading: 100%|██████████| 4.04M/4.04M [00:00<00:00, 8.94MB/s]
Downloading: 100%|██████████| 33.4k/33.4k [00:00<00:00, 72.6kB/s]

三、数据转换

我们需要把不同的数据格式转换成alpaca 格式的训练数据

3.1、安装数据转换必须的依赖

!pip install pandas

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: pandas in ./miniconda3/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: python-dateutil>=2.8.2 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: tzdata>=2022.7 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: numpy>=1.22.4 in ./miniconda3/lib/python3.10/site-packages (from pandas) (1.26.3)
Requirement already satisfied: pytz>=2020.1 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in ./miniconda3/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m

3.2 MRPC

不同的数据要通过代码转换

import pandas as pd
import json

创建指令，判断 Sentence 1 和 Sentence 2 是否语意等价

instruction="Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not."

3.2.1 csv 格式转

df = pd.read_csv('./LLaMA-Factory/train_data/MRPC_tsv/train.tsv',sep='\t',on_bad_lines='skip')

df.tail(5)

	Quality	#1 ID	#2 ID	#1 String	#2 String
3544	1	1620264	1620507	At this point , Mr. Brando announced : ' Some...	Brando said that " somebody ought to put a bul...
3545	0	1848001	1848224	Martin , 58 , will be freed today after servin...	Martin served two thirds of a five-year senten...
3546	1	747160	747144	We have concluded that the outlook for price ...	In a statement , the ECB said the outlook for ...
3547	1	2539933	2539850	The notification was first reported Friday by ...	MSNBC.com first reported the CIA request on Fr...
3548	0	453575	453448	The 30-year bond US30YT = RR rose 22 / 32 for ...	The 30-year bond US30YT = RR grew 1-3 / 32 for...

# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['#1 String']}\nSentence 2: {row['#2 String']}",
        "output": str(row['Quality'])
    }
    # 将字典添加到列表中
    alpaca_data.append(data_point)

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)

alpaca_data[-1]

{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}

#保存到文件
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:
    f.write(alpaca_json)

3.2.3 parquet 格式转

df = pd.read_parquet('./LLaMA-Factory/train_data/MRPC_hf/train-00000-of-00001.parquet')
df.tail()

	sentence1	sentence2	label	idx
3663	" At this point , Mr. Brando announced : ' Som...	Brando said that " somebody ought to put a bul...	1	4071
3664	Martin , 58 , will be freed today after servin...	Martin served two thirds of a five-year senten...	0	4072
3665	" We have concluded that the outlook for price...	In a statement , the ECB said the outlook for ...	1	4073
3666	The notification was first reported Friday by ...	MSNBC.com first reported the CIA request on Fr...	1	4074
3667	The 30-year bond US30YT = RR rose 22 / 32 for ...	The 30-year bond US30YT = RR grew 1-3 / 32 for...	0	4075

# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['sentence1']}\nSentence 2: {row['sentence2']}",
        "output": str(row['label'])
    }
    # 将字典添加到列表中
    alpaca_data.append(data_point)

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)
alpaca_data[-1]

{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}

#保存到文件
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:
    f.write(alpaca_json)

3.2.3 jsonl格式转

# 创建一个空列表来存储转换后的数据
alpaca_data = []
# 打开并读取JSONL文件
with open('./LLaMA-Factory/train_data/MRPC_hf/train.jsonl', 'r') as f:
    for line in f:
        # 解析每一行JSON数据
        data_point = json.loads(line)
        
        # 创建一个字典来存储转换后的数据
        alpaca_point = {
            "instruction": instruction,
            "input": f"Sentence 1: {data_point['text1']}\nSentence 2: {data_point['text2']}",
            "output": str(data_point['label'])
        }
        
        # 将字典添加到列表中
        alpaca_data.append(alpaca_point)

alpaca_data[-1]

{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}

#保存到文件
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:
    f.write(alpaca_json)

3.3 编辑dataset_info.json

将生成的文件名，添加到LLama-factory/data/dataset_info.json 中

{
  "MRPC_train_data": {
    "file_name": "MRPC_train_data.json"
  },
  "identity": {
    "file_name": "identity.json"
  },
  "alpaca_en_demo": {
    "file_name": "alpaca_en_demo.json"
  },...
}

3.4 准备测试集

df = pd.read_csv('./LLaMA-Factory/train_data/MRPC_tsv/dev.tsv',sep='\t',on_bad_lines='skip')
df.tail()

	Quality	#1 ID	#2 ID	#1 String	#2 String
383	0	2977500	2977547	Their contract will expire at 12 : 01 a.m. Wed...	It has outraged the membership , said Rian W...
384	1	3107137	3107119	But plaque volume increased by 2.7 percent in ...	The volume of plaque in Pravachol patients ' a...
385	1	1619244	1619274	Today in the US , the book - kept under wraps ...	Tomorrow the book , kept under wraps by G. P. ...
386	0	3061836	3062031	The S & P / TSX composite rose 87.74 points on...	On the week , the Dow Jones industrial average...
387	1	485999	486011	Ex-KGB agent Putin added that the Beatles were...	In Soviet times the Beatles ' music " was cons...

# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['#1 String']}\nSentence 2: {row['#2 String']}",
        "output": str(row['Quality'])
    }
    # 将字典添加到列表中
    alpaca_data.append(data_point)

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)

#保存到文件
with open('LLaMA-Factory/data/MRPC_test_data.json', 'w') as f:
    f.write(alpaca_json)

四、训练配置

因为不知道AutoDL算力平台怎么访问自定义端口，且GRADIO_SHARE在国内无法访问，所以这里不启动webui训练界面

%cd LLaMA-Factory/

/root/LLaMA-Factory

# 使用物理路径 gemma-2b-it 参见 《1训练环境准备》 模型下载  魔搭下载的路径需要手工验证下，很怪
model_name_or_path="/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it"
# 保存 LoRA 适配器的路径
adapter_name_or_path="train_MRPC",

import json

args = dict(
  stage="sft",  # 进行指令监督微调
  do_train="True",
  model_name_or_path=model_name_or_path,    
  preprocessing_num_workers=16,
  finetuning_type="lora",   # 使用 LoRA 适配器来节省显存
  template="gemma",   # 使用 gemma 提示词模板
  flash_attn="auto",
  dataset_dir="data",
  dataset="MRPC_train_data",      # 使用 MRPC_train_data数据集 参见 《2数据准备.ipynb》 2.2节
  cutoff_len=1024,
  learning_rate=5e-05,    # 学习率大小
  num_train_epochs=3.0,    # 训练轮数
  max_samples=100000,      # 使用每个数据集中的样本条数 
  per_device_train_batch_size=2,
  gradient_accumulation_steps=8,
  lr_scheduler_type="cosine",    # 使用余弦学习率退火算法
  max_grad_norm=1.0,           # 将梯度范数裁剪至 1.0
  logging_steps=10,             # 每 10 步输出一个记录
  save_steps=1000,         # 每 1000 步保存一个检查点
  warmup_steps=0,             # 使用预热学习率,这里没有使用，可以设置步数
  optim="adamw_torch",
  output_dir="train_MRPC",   # 保存 LoRA 适配器的路径
  plot_loss=True,
  ddp_timeout=180000000,
  include_num_input_tokens_seen=True,
  lora_rank=8,
  lora_alpha=16,
  lora_dropout=0,  #LoRA 随机丢弃率
  fp16=True,  # 使用 float16 混合精度训练
  lora_target="all"   # 添加 LoRA 适配器至全部线性层
)

json.dump(args, open("train_gemma.json", "w", encoding="utf-8"), indent=2)

!llamafactory-cli train train_gemma.json

07/31/2024 17:34:30 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer_config.json
07/31/2024 17:34:31 - INFO - llamafactory.data.loader - Loading dataset MRPC_train_data.json...
input_ids:
[2, 106, 1645, 108, 99211, 1013, 573, 1378, 26099, 708, 147440, 11070, 13107, 954, 16230, 777, 235274, 235303, 1013, 984, 708, 13107, 235269, 777, 235276, 235303, 1013, 984, 708, 780, 235265, 108, 86386, 235248, 235274, 235292, 4181, 514, 3423, 19538, 926, 8462, 1688, 7624, 693, 3151, 664, 573, 13229, 664, 1688, 576, 46971, 1697, 87242, 926, 5820, 954, 108, 86386, 235248, 235284, 235292, 165244, 577, 1357, 685, 1297, 664, 573, 13229, 664, 1688, 4181, 514, 3423, 19538, 926, 8462, 576, 46971, 1697, 87242, 926, 5820, 954, 107, 108, 106, 2516, 108, 235274, 1]
inputs:
<bos><start_of_turn>user
Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.
Sentence 1: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Sentence 2: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .<end_of_turn>
<start_of_turn>model
1<eos>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 235274, 1]
labels:
1<eos>
[INFO|configuration_utils.py:731] 2024-07-31 17:34:32,639 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:34:32,640 >> Model config GemmaConfig {
  "_name_or_path": "/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

[INFO|modeling_utils.py:3631] 2024-07-31 17:34:32,662 >> loading weights file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-07-31 17:34:32,663 >> Instantiating GemmaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:1038] 2024-07-31 17:34:32,664 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}

[WARNING|logging.py:328] 2024-07-31 17:34:32,667 >> `config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.15s/it]
[INFO|modeling_utils.py:4463] 2024-07-31 17:34:35,011 >> All model checkpoint weights were used when initializing GemmaForCausalLM.

[INFO|modeling_utils.py:4471] 2024-07-31 17:34:35,011 >> All the weights of GemmaForCausalLM were initialized from the model checkpoint at /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GemmaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-07-31 17:34:35,013 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/generation_config.json
[INFO|configuration_utils.py:1038] 2024-07-31 17:34:35,013 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}

07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
07/31/2024 17:34:35 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
07/31/2024 17:34:35 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.misc - Found linear modules: k_proj,gate_proj,o_proj,down_proj,v_proj,up_proj,q_proj
07/31/2024 17:34:36 - INFO - llamafactory.model.loader - trainable params: 9,805,824 || all params: 2,515,978,240 || trainable%: 0.3897
[INFO|trainer.py:648] 2024-07-31 17:34:36,180 >> Using auto half precision backend
[INFO|trainer.py:2134] 2024-07-31 17:34:36,526 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-07-31 17:34:36,526 >>   Num examples = 3,668
[INFO|trainer.py:2136] 2024-07-31 17:34:36,526 >>   Num Epochs = 3
[INFO|trainer.py:2137] 2024-07-31 17:34:36,526 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-07-31 17:34:36,526 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2141] 2024-07-31 17:34:36,526 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2142] 2024-07-31 17:34:36,526 >>   Total optimization steps = 687
[INFO|trainer.py:2143] 2024-07-31 17:34:36,530 >>   Number of trainable parameters = 9,805,824
{'loss': 2.8161, 'grad_norm': 17.23887062072754, 'learning_rate': 4.9978830041808596e-05, 'epoch': 0.04, 'num_input_tokens_seen': 17600}
......
{'loss': 0.0738, 'grad_norm': 0.30729034543037415, 'learning_rate': 1.6727376094963222e-08, 'epoch': 2.97, 'num_input_tokens_seen': 1214928}
100%|█████████████████████████████████████████| 687/687 [13:29<00:00,  1.15s/it][INFO|trainer.py:3503] 2024-07-31 17:48:05,664 >> Saving model checkpoint to train_MRPC/checkpoint-687
[INFO|configuration_utils.py:731] 2024-07-31 17:48:05,684 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:48:05,685 >> Model config GemmaConfig {
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

[INFO|tokenization_utils_base.py:2702] 2024-07-31 17:48:05,739 >> tokenizer config file saved in train_MRPC/checkpoint-687/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-07-31 17:48:05,739 >> Special tokens file saved in train_MRPC/checkpoint-687/special_tokens_map.json
[INFO|trainer.py:2394] 2024-07-31 17:48:06,301 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 809.7713, 'train_samples_per_second': 13.589, 'train_steps_per_second': 0.848, 'train_loss': 0.19551172700684838, 'epoch': 3.0, 'num_input_tokens_seen': 1227456}
100%|█████████████████████████████████████████| 687/687 [13:29<00:00,  1.18s/it]
[INFO|trainer.py:3503] 2024-07-31 17:48:06,304 >> Saving model checkpoint to train_MRPC
[INFO|configuration_utils.py:731] 2024-07-31 17:48:06,324 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:48:06,325 >> Model config GemmaConfig {
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

[INFO|tokenization_utils_base.py:2702] 2024-07-31 17:48:06,369 >> tokenizer config file saved in train_MRPC/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-07-31 17:48:06,369 >> Special tokens file saved in train_MRPC/special_tokens_map.json
***** train metrics *****
  epoch                    =     2.9967
  num_input_tokens_seen    =    1227456
  total_flos               = 13660893GF
  train_loss               =     0.1955
  train_runtime            = 0:13:29.77
  train_samples_per_second =     13.589
  train_steps_per_second   =      0.848
Figure saved at: train_MRPC/training_loss.png
07/31/2024 17:48:06 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
07/31/2024 17:48:06 - WARNING - llamafactory.extras.ploting - No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2024-07-31 17:48:06,913 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

五、查看训练结果

from IPython.display import Image

# 指定图片的路径
image_path = 'train_MRPC/training_loss.png'

# 显示图片
Image(filename=image_path)

在这里插入图片描述

六、评估测试

from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc
from tqdm import tqdm
import json
import pandas as pd

# 读取评估集的JSON文件
with open('data/MRPC_test_data.json', 'r') as f:
    evaluation_data = json.load(f)

args = dict(
    model_name_or_path=model_name_or_path,
    adapter_name_or_path="train_MRPC",  # 加载之前保存的 LoRA 适配器
    template="gemma",  # 和训练保持一致
    finetuning_type="lora"  # 和训练保持一致
)
chat_model = ChatModel(args)

results = []

# 使用 tqdm 添加进度条
for sample in tqdm(evaluation_data, desc="Evaluating"):
    instruction = sample['instruction']
    input_text = sample['input']
    expected_output = sample['output']

    messages = [
        {"role": "user", "content": f"{instruction}\n{input_text}"}
    ]

    response = ""
    for new_text in chat_model.stream_chat(messages):
        response += new_text

    # 记录模型的输出和预期输出
    results.append({
        "instruction": instruction,
        "input": input_text,
        "expected_output": expected_output,
        "model_output": response.strip()
    })

# 将结果存储到 DataFrame 中
df = pd.DataFrame(results)
# 保存 DataFrame 到 CSV 文件
df.to_csv('evaluation_results.csv', index=False)

torch_gc()

[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,521 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,521 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-07-31 22:19:26,175 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 22:19:26,176 >> Model config GemmaConfig {
  "_name_or_path": "/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000
}

......
07/31/2024 22:19:27 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
07/31/2024 22:19:28 - INFO - llamafactory.model.adapter - Merged 1 adapter(s).
07/31/2024 22:19:28 - INFO - llamafactory.model.adapter - Loaded adapter(s): train_MRPC
07/31/2024 22:19:28 - INFO - llamafactory.model.loader - all params: 2,506,172,416


Evaluating: 100%|██████████| 388/388 [00:16<00:00, 23.67it/s]

七、对评估集进行简单评分

!pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: scikit-learn in /root/miniconda3/lib/python3.10/site-packages (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: scipy>=1.6.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.14.0)
Requirement already satisfied: numpy>=1.19.5 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.26.3)
Requirement already satisfied: joblib>=1.2.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.4.2)
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m

import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 读取 CSV 文件
df = pd.read_csv('evaluation_results.csv')

# 确保 model_output 和 expected_output 列是字符串类型
df['model_output'] = df['model_output'].astype(str)
df['expected_output'] = df['expected_output'].astype(str)

# 将模型输出和预期输出转换为二分类标签
df['model_output_binary'] = df['model_output'].apply(lambda x: 1 if x.strip() == '1' else 0)
df['expected_output_binary'] = df['expected_output'].apply(lambda x: 1 if x.strip() == '1' else 0)

# 计算评估指标
accuracy = accuracy_score(df['expected_output_binary'], df['model_output_binary'])
precision = precision_score(df['expected_output_binary'], df['model_output_binary'])
recall = recall_score(df['expected_output_binary'], df['model_output_binary'])
f1 = f1_score(df['expected_output_binary'], df['model_output_binary'])

# 打印评估指标
print(f"准确性 Accuracy: {accuracy:.4f}")
print(f"精确率 Precision: {precision:.4f}")
print(f"召回率 Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

准确性 Accuracy: 0.8686
精确率 Precision: 0.8873
召回率 Recall: 0.9242
F1 Score: 0.9054

准确率（Accuracy）、精确率（Precision）、召回率（Recall）和 F1 分数（F1 Score）是评估分类模型性能的常用指标。以下是这些指标的详细解释：