通过fine-tuning 微调LLM模型实现传统NLP任务

不过,上述问题在谷歌发布了gemma 2B模型后得到了极大的缓解。相对于7B的参数量,13B左右的模型在训练成本与推理的时延等方面都能得到足够的控制。因此,本次实验就以gemma 2B 为基准模型,来探索它实现BERT式微调方法后的效果。

一、安装 LLaMA Factory

Wed Jul 31 14:49:16 2024       
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:0E:00.0 Off |                  N/A |
| 38%   35C    P8             16W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|  No running processes found                                                             |
!git clone https://gitclone.com/github.com/hiyouga/LLaMA-Factory.git
Cloning into 'LLaMA-Factory'...
remote: 对象计数中: 15692, 完成.[K
remote: 压缩对象中: 100% (4034/4034), 完成.[K
remote: Total 15692 (delta 11655), reused 15452 (delta 11497)[K 
Receiving objects: 100% (15692/15692), 221.51 MiB | 1022.00 KiB/s, done.
Resolving deltas: 100% (11655/11655), done.


%cd LLaMA-Factory
!pip install -e .[torch,bitsandbytes]
CITATION.cff  README.md     [0m[01;34mdocker[0m/         requirements.txt  [01;34mtests[0m/
LICENSE       README_zh.md  [01;34mevaluation[0m/     [01;34mscripts[0m/          [01;34mtrain_data[0m/
MANIFEST.in   [01;34massets[0m/       [01;34mexamples[0m/       setup.py
Makefile      [01;34mdata[0m/         pyproject.toml  [01;34msrc[0m/
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Obtaining file:///root/LLaMA-Factory
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone

Installing collected packages: llamafactory
  Attempting uninstall: llamafactory
    Found existing installation: llamafactory 0.8.3.dev0
    Uninstalling llamafactory-0.8.3.dev0:
      Successfully uninstalled llamafactory-0.8.3.dev0
Successfully installed llamafactory-0.8.3.dev0
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m



!pip install modelscope
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting modelscope
  Downloading http://mirrors.aliyun.com/pypi/packages/38/37/9fe505ebc67ba5e0345a69d6e8b2ee8630523975b484d221691
Installing collected packages: modelscope
Successfully installed modelscope-1.16.1
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download(
Downloading: 100%|██████████| 627/627 [00:00<00:00, 1.40kB/s]
Downloading: 100%|██████████| 38.0/38.0 [00:00<00:00, 132B/s]
Downloading: 100%|██████████| 9.34G/9.34G [09:20<00:00, 17.9MB/s]  
Downloading: 100%|██████████| 137/137 [00:00<00:00, 494B/s]
Downloading: 100%|██████████| 4.61G/4.61G [04:35<00:00, 18.0MB/s]
Downloading: 100%|██████████| 64.0M/64.0M [00:02<00:00, 31.4MB/s]
Downloading: 100%|██████████| 13.2k/13.2k [00:02<00:00, 5.88kB/s]
Downloading: 100%|██████████| 23.1k/23.1k [00:00<00:00, 26.7kB/s]
Downloading: 100%|██████████| 636/636 [00:00<00:00, 1.99kB/s]
Downloading: 100%|██████████| 16.7M/16.7M [00:01<00:00, 17.0MB/s]
Downloading: 100%|██████████| 4.04M/4.04M [00:00<00:00, 8.94MB/s]
Downloading: 100%|██████████| 33.4k/33.4k [00:00<00:00, 72.6kB/s]


我们需要把不同的数据格式转换成alpaca 格式的训练数据


!pip install pandas
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: pandas in ./miniconda3/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: python-dateutil>=2.8.2 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: tzdata>=2022.7 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: numpy>=1.22.4 in ./miniconda3/lib/python3.10/site-packages (from pandas) (1.26.3)
Requirement already satisfied: pytz>=2020.1 in ./miniconda3/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in ./miniconda3/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m

3.2 MRPC


import pandas as pd
import json

创建指令,判断 Sentence 1 和 Sentence 2 是否 语意等价

instruction="Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not."

3.2.1 csv 格式转

df = pd.read_csv('./LLaMA-Factory/train_data/MRPC_tsv/train.tsv',sep='\t',on_bad_lines='skip')
Quality#1 ID#2 ID#1 String#2 String
3544116202641620507At this point , Mr. Brando announced : ' Some...Brando said that " somebody ought to put a bul...
3545018480011848224Martin , 58 , will be freed today after servin...Martin served two thirds of a five-year senten...
35461747160747144We have concluded that the outlook for price ...In a statement , the ECB said the outlook for ...
3547125399332539850The notification was first reported Friday by ...MSNBC.com first reported the CIA request on Fr...
35480453575453448The 30-year bond US30YT = RR rose 22 / 32 for ...The 30-year bond US30YT = RR grew 1-3 / 32 for...
# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['#1 String']}\nSentence 2: {row['#2 String']}",
        "output": str(row['Quality'])
    # 将字典添加到列表中

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)
{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:

3.2.3 parquet 格式转

df = pd.read_parquet('./LLaMA-Factory/train_data/MRPC_hf/train-00000-of-00001.parquet')
3663" At this point , Mr. Brando announced : ' Som...Brando said that " somebody ought to put a bul...14071
3664Martin , 58 , will be freed today after servin...Martin served two thirds of a five-year senten...04072
3665" We have concluded that the outlook for price...In a statement , the ECB said the outlook for ...14073
3666The notification was first reported Friday by ...MSNBC.com first reported the CIA request on Fr...14074
3667The 30-year bond US30YT = RR rose 22 / 32 for ...The 30-year bond US30YT = RR grew 1-3 / 32 for...04075
# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['sentence1']}\nSentence 2: {row['sentence2']}",
        "output": str(row['label'])
    # 将字典添加到列表中

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)
{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:

3.2.3 jsonl格式转

# 创建一个空列表来存储转换后的数据
alpaca_data = []
# 打开并读取JSONL文件
with open('./LLaMA-Factory/train_data/MRPC_hf/train.jsonl', 'r') as f:
    for line in f:
        # 解析每一行JSON数据
        data_point = json.loads(line)
        # 创建一个字典来存储转换后的数据
        alpaca_point = {
            "instruction": instruction,
            "input": f"Sentence 1: {data_point['text1']}\nSentence 2: {data_point['text2']}",
            "output": str(data_point['label'])
        # 将字典添加到列表中
{'instruction': "Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.",
 'input': "Sentence 1: The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .\nSentence 2: The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .",
 'output': '0'}
with open('LLaMA-Factory/data/MRPC_train_data.json', 'w') as f:

3.3 编辑dataset_info.json

将生成的文件名,添加到LLama-factory/data/dataset_info.json 中

  "MRPC_train_data": {
    "file_name": "MRPC_train_data.json"
  "identity": {
    "file_name": "identity.json"
  "alpaca_en_demo": {
    "file_name": "alpaca_en_demo.json"

3.4 准备测试集

df = pd.read_csv('./LLaMA-Factory/train_data/MRPC_tsv/dev.tsv',sep='\t',on_bad_lines='skip')
Quality#1 ID#2 ID#1 String#2 String
383029775002977547Their contract will expire at 12 : 01 a.m. Wed...It has outraged the membership , said Rian W...
384131071373107119But plaque volume increased by 2.7 percent in ...The volume of plaque in Pravachol patients ' a...
385116192441619274Today in the US , the book - kept under wraps ...Tomorrow the book , kept under wraps by G. P. ...
386030618363062031The S & P / TSX composite rose 87.74 points on...On the week , the Dow Jones industrial average...
3871485999486011Ex-KGB agent Putin added that the Beatles were...In Soviet times the Beatles ' music " was cons...
# 创建一个空列表来存储转换后的数据
alpaca_data = []
for index, row in df.iterrows():
    # 创建一个字典来存储当前行的数据
    data_point = {
        "instruction": instruction,
        "input": f"Sentence 1: {row['#1 String']}\nSentence 2: {row['#2 String']}",
        "output": str(row['Quality'])
    # 将字典添加到列表中

# 将列表转换为JSON字符串
alpaca_json = json.dumps(alpaca_data, indent=4)
with open('LLaMA-Factory/data/MRPC_test_data.json', 'w') as f:



%cd LLaMA-Factory/
# 使用物理路径 gemma-2b-it 参见 《1训练环境准备》 模型下载  魔搭下载的路径需要手工验证下,很怪
# 保存 LoRA 适配器的路径
import json

args = dict(
  stage="sft",  # 进行指令监督微调
  finetuning_type="lora",   # 使用 LoRA 适配器来节省显存
  template="gemma",   # 使用 gemma 提示词模板
  dataset="MRPC_train_data",      # 使用 MRPC_train_data数据集 参见 《2数据准备.ipynb》 2.2节
  learning_rate=5e-05,    # 学习率大小
  num_train_epochs=3.0,    # 训练轮数
  max_samples=100000,      # 使用每个数据集中的样本条数 
  lr_scheduler_type="cosine",    # 使用余弦学习率退火算法
  max_grad_norm=1.0,           # 将梯度范数裁剪至 1.0
  logging_steps=10,             # 每 10 步输出一个记录
  save_steps=1000,         # 每 1000 步保存一个检查点
  warmup_steps=0,             # 使用预热学习率,这里没有使用,可以设置步数
  output_dir="train_MRPC",   # 保存 LoRA 适配器的路径
  lora_dropout=0,  #LoRA 随机丢弃率
  fp16=True,  # 使用 float16 混合精度训练
  lora_target="all"   # 添加 LoRA 适配器至全部线性层

json.dump(args, open("train_gemma.json", "w", encoding="utf-8"), indent=2)

!llamafactory-cli train train_gemma.json
07/31/2024 17:34:30 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 17:34:30,213 >> loading file tokenizer_config.json
07/31/2024 17:34:31 - INFO - llamafactory.data.loader - Loading dataset MRPC_train_data.json...
[2, 106, 1645, 108, 99211, 1013, 573, 1378, 26099, 708, 147440, 11070, 13107, 954, 16230, 777, 235274, 235303, 1013, 984, 708, 13107, 235269, 777, 235276, 235303, 1013, 984, 708, 780, 235265, 108, 86386, 235248, 235274, 235292, 4181, 514, 3423, 19538, 926, 8462, 1688, 7624, 693, 3151, 664, 573, 13229, 664, 1688, 576, 46971, 1697, 87242, 926, 5820, 954, 108, 86386, 235248, 235284, 235292, 165244, 577, 1357, 685, 1297, 664, 573, 13229, 664, 1688, 4181, 514, 3423, 19538, 926, 8462, 576, 46971, 1697, 87242, 926, 5820, 954, 107, 108, 106, 2516, 108, 235274, 1]
Determine if the two sentences are semantically equivalent . Output '1' if they are equivalent, '0' if they are not.
Sentence 1: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Sentence 2: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .<end_of_turn>
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 235274, 1]
[INFO|configuration_utils.py:731] 2024-07-31 17:34:32,639 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:34:32,640 >> Model config GemmaConfig {
  "_name_or_path": "/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it",
  "architectures": [
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000

[INFO|modeling_utils.py:3631] 2024-07-31 17:34:32,662 >> loading weights file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-07-31 17:34:32,663 >> Instantiating GemmaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:1038] 2024-07-31 17:34:32,664 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0

[WARNING|logging.py:328] 2024-07-31 17:34:32,667 >> `config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.15s/it]
[INFO|modeling_utils.py:4463] 2024-07-31 17:34:35,011 >> All model checkpoint weights were used when initializing GemmaForCausalLM.

[INFO|modeling_utils.py:4471] 2024-07-31 17:34:35,011 >> All the weights of GemmaForCausalLM were initialized from the model checkpoint at /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GemmaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-07-31 17:34:35,013 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/generation_config.json
[INFO|configuration_utils.py:1038] 2024-07-31 17:34:35,013 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0

07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
07/31/2024 17:34:35 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
07/31/2024 17:34:35 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
07/31/2024 17:34:35 - INFO - llamafactory.model.model_utils.misc - Found linear modules: k_proj,gate_proj,o_proj,down_proj,v_proj,up_proj,q_proj
07/31/2024 17:34:36 - INFO - llamafactory.model.loader - trainable params: 9,805,824 || all params: 2,515,978,240 || trainable%: 0.3897
[INFO|trainer.py:648] 2024-07-31 17:34:36,180 >> Using auto half precision backend
[INFO|trainer.py:2134] 2024-07-31 17:34:36,526 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-07-31 17:34:36,526 >>   Num examples = 3,668
[INFO|trainer.py:2136] 2024-07-31 17:34:36,526 >>   Num Epochs = 3
[INFO|trainer.py:2137] 2024-07-31 17:34:36,526 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-07-31 17:34:36,526 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2141] 2024-07-31 17:34:36,526 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2142] 2024-07-31 17:34:36,526 >>   Total optimization steps = 687
[INFO|trainer.py:2143] 2024-07-31 17:34:36,530 >>   Number of trainable parameters = 9,805,824
{'loss': 2.8161, 'grad_norm': 17.23887062072754, 'learning_rate': 4.9978830041808596e-05, 'epoch': 0.04, 'num_input_tokens_seen': 17600}
{'loss': 0.0738, 'grad_norm': 0.30729034543037415, 'learning_rate': 1.6727376094963222e-08, 'epoch': 2.97, 'num_input_tokens_seen': 1214928}
100%|█████████████████████████████████████████| 687/687 [13:29<00:00,  1.15s/it][INFO|trainer.py:3503] 2024-07-31 17:48:05,664 >> Saving model checkpoint to train_MRPC/checkpoint-687
[INFO|configuration_utils.py:731] 2024-07-31 17:48:05,684 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:48:05,685 >> Model config GemmaConfig {
  "architectures": [
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000

[INFO|tokenization_utils_base.py:2702] 2024-07-31 17:48:05,739 >> tokenizer config file saved in train_MRPC/checkpoint-687/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-07-31 17:48:05,739 >> Special tokens file saved in train_MRPC/checkpoint-687/special_tokens_map.json
[INFO|trainer.py:2394] 2024-07-31 17:48:06,301 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 809.7713, 'train_samples_per_second': 13.589, 'train_steps_per_second': 0.848, 'train_loss': 0.19551172700684838, 'epoch': 3.0, 'num_input_tokens_seen': 1227456}
100%|█████████████████████████████████████████| 687/687 [13:29<00:00,  1.18s/it]
[INFO|trainer.py:3503] 2024-07-31 17:48:06,304 >> Saving model checkpoint to train_MRPC
[INFO|configuration_utils.py:731] 2024-07-31 17:48:06,324 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 17:48:06,325 >> Model config GemmaConfig {
  "architectures": [
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000

[INFO|tokenization_utils_base.py:2702] 2024-07-31 17:48:06,369 >> tokenizer config file saved in train_MRPC/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-07-31 17:48:06,369 >> Special tokens file saved in train_MRPC/special_tokens_map.json
***** train metrics *****
  epoch                    =     2.9967
  num_input_tokens_seen    =    1227456
  total_flos               = 13660893GF
  train_loss               =     0.1955
  train_runtime            = 0:13:29.77
  train_samples_per_second =     13.589
  train_steps_per_second   =      0.848
Figure saved at: train_MRPC/training_loss.png
07/31/2024 17:48:06 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
07/31/2024 17:48:06 - WARNING - llamafactory.extras.ploting - No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2024-07-31 17:48:06,913 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}


from IPython.display import Image

# 指定图片的路径
image_path = 'train_MRPC/training_loss.png'

# 显示图片



from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc
from tqdm import tqdm
import json
import pandas as pd

# 读取评估集的JSON文件
with open('data/MRPC_test_data.json', 'r') as f:
    evaluation_data = json.load(f)

args = dict(
    adapter_name_or_path="train_MRPC",  # 加载之前保存的 LoRA 适配器
    template="gemma",  # 和训练保持一致
    finetuning_type="lora"  # 和训练保持一致
chat_model = ChatModel(args)

results = []

# 使用 tqdm 添加进度条
for sample in tqdm(evaluation_data, desc="Evaluating"):
    instruction = sample['instruction']
    input_text = sample['input']
    expected_output = sample['output']

    messages = [
        {"role": "user", "content": f"{instruction}\n{input_text}"}

    response = ""
    for new_text in chat_model.stream_chat(messages):
        response += new_text

    # 记录模型的输出和预期输出
        "instruction": instruction,
        "input": input_text,
        "expected_output": expected_output,
        "model_output": response.strip()

# 将结果存储到 DataFrame 中
df = pd.DataFrame(results)
# 保存 DataFrame 到 CSV 文件
df.to_csv('evaluation_results.csv', index=False)

[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,520 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,521 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-07-31 22:19:25,521 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-07-31 22:19:26,175 >> loading configuration file /root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it/config.json
[INFO|configuration_utils.py:800] 2024-07-31 22:19:26,176 >> Model config GemmaConfig {
  "_name_or_path": "/root/models/gemma-2b-it/AI-ModelScope/gemma-2b-it",
  "architectures": [
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 256000

07/31/2024 22:19:27 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
07/31/2024 22:19:28 - INFO - llamafactory.model.adapter - Merged 1 adapter(s).
07/31/2024 22:19:28 - INFO - llamafactory.model.adapter - Loaded adapter(s): train_MRPC
07/31/2024 22:19:28 - INFO - llamafactory.model.loader - all params: 2,506,172,416

Evaluating: 100%|██████████| 388/388 [00:16<00:00, 23.67it/s]


!pip install scikit-learn
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: scikit-learn in /root/miniconda3/lib/python3.10/site-packages (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: scipy>=1.6.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.14.0)
Requirement already satisfied: numpy>=1.19.5 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.26.3)
Requirement already satisfied: joblib>=1.2.0 in /root/miniconda3/lib/python3.10/site-packages (from scikit-learn) (1.4.2)
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 读取 CSV 文件
df = pd.read_csv('evaluation_results.csv')

# 确保 model_output 和 expected_output 列是字符串类型
df['model_output'] = df['model_output'].astype(str)
df['expected_output'] = df['expected_output'].astype(str)

# 将模型输出和预期输出转换为二分类标签
df['model_output_binary'] = df['model_output'].apply(lambda x: 1 if x.strip() == '1' else 0)
df['expected_output_binary'] = df['expected_output'].apply(lambda x: 1 if x.strip() == '1' else 0)

# 计算评估指标
accuracy = accuracy_score(df['expected_output_binary'], df['model_output_binary'])
precision = precision_score(df['expected_output_binary'], df['model_output_binary'])
recall = recall_score(df['expected_output_binary'], df['model_output_binary'])
f1 = f1_score(df['expected_output_binary'], df['model_output_binary'])

# 打印评估指标
print(f"准确性 Accuracy: {accuracy:.4f}")
print(f"精确率 Precision: {precision:.4f}")
print(f"召回率 Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
准确性 Accuracy: 0.8686
精确率 Precision: 0.8873
召回率 Recall: 0.9242
F1 Score: 0.9054

准确率(Accuracy)、精确率(Precision)、召回率(Recall)和 F1 分数(F1 Score)是评估分类模型性能的常用指标。以下是这些指标的详细解释:





