欺诈文本分类微调（六）：Lora单卡训练

沉下心来学鲁班

已于 2024-09-26 10:51:39 修改

阅读量1k

点赞数 36

分类专栏：微调文章标签：分类人工智能机器学习语言模型微调

于 2024-08-23 01:16:48 首次发布

本文链接：https://blog.csdn.net/xiaojia1001/article/details/141440847

版权

微调专栏收录该内容

18 篇文章 1 订阅

订阅专栏

1. 引言

前面欺诈文本分类微调（四）：构造训练/测试数据集已经构造出了数据集，更之前的欺诈文本分类微调（一）：基座模型选型选好了基座模型，这篇文章将基于构造出的数据集和选定的模型进行欺诈文本分类的微调训练。

关于微调方法，我们将使用比较普遍的Lora：在模型中注入低秩矩阵的方式。
关于训练器，使用transformers库中提供的Trainer类。

2. 数据准备

2.1 加载数据

导入要使用的基础包。

import os
import json
import torch
from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, EarlyStoppingCallback
from peft import LoraConfig, TaskType, get_peft_model

AutoModelForCausalLM:用于加载模型
AutoTokenizer:用于加载token分词器
TrainingArguments:用于配置训练参数
Trainer:用于训练模型
EarlyStoppingCallback:用于提前结束训练，当评估损失不再下降时。

声明数据集和基座模型的路径，以及微调后模型参数的输出路径。

traindata_path = '/data2/anti_fraud/dataset/train0819.jsonl'
evaldata_path = '/data2/anti_fraud/dataset/eval0819.jsonl'
model_path = '/data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct'
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0819_1'

定义工具函数load_jsonl用于加载数据集，并使用view_data_distribution查看数据集的标签分布。

def load_jsonl(path):
    with open(path, 'r') as file:
        data = [json.loads(line) for line in file]
        return pd.DataFrame(data)

def view_data_distribution(data_path, show_first=False):
    df = load_jsonl(data_path)
    print(f"total_count:{df.shape[0]}, true_count: {df['label'].sum()}, false_count: {(df['label']==False).sum()}")
    print(json.dumps(df.iloc[0].to_dict(), indent=4, ensure_ascii=False)) if show_first else None

view_data_distribution(traindata_path, show_first=True)

    total_count:18787, true_count: 9377, false_count: 9410
    {
        "input": "发言人3: 现在我所在这个哪里能够工艺能够去把屈光做得很好的，去到这个省级医院是自治区医院跟广西医科大学这个附属医院他们还可以，他们一直保持比较好的一个一个手术量。\n发言人1: 就是",
        "label": false,
        "fraud_speaker": "",
        "instruction": "\n下面是一段对话文本, 请分析对话内容是否有诈骗风险，以json格式输出你的判断结果(is_fraud: true/false)。\n"
    }

2.2 数据序列化

如上所示，原始的训练数据是文本形式，而模型推理需要的输入是数字，这中间需要用tokenizer进行文本到数字的序列化转换。

每个语言模型内部都维护了一个词表，里面维护了模型认识的所有词与数字编号的映射，不同模型的词表是不一样的，我们需要使用基座模型所对应的词表来创建tokenizer。

Tokenizer是一个词元生成器，它首先通过分词算法将文本切分成独立的token列表，再通过词表映射将每个token转换成语言模型可以处理的数字。详情见语言模型解构——Tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
tokenizer

Qwen2Tokenizer(name_or_path='/data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
    	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    }

上面这个tokenizer的输出信息显示：词表中共有151643个词元，这个模型支持最大32KB的序列长度，并且还定义了开始标记<|im_start|>、结束标记<|im_end|>、填充标记<|endoftext|>，这些特殊token需要在数据预处理时被正确的添加到文本中。

我们尝试用这个tokenizer序列化一个简单文本看看序列化后的数据长什么模样。

tokenizer("你是谁")

{'input_ids': [105043, 100165], 'attention_mask': [1, 1]}

input_ids就是你是谁序列化后成token列表后的数字形式，attention_mask是一个与input_ids长度相同的数组，用于指示模型应该关注哪些token，以及忽略哪些token，填充(padding)token在模型推理时通常应该被忽略。

注：attention_mask的值通常为0或1，1表示该位置的token是有效的输入（模型应该关注这个token）, 0表示该位置的token是填充（padding），模型在处理时应忽略此token。

定义输入文本的预处理函数，作用是按模型的输入要求将输入文本转换为输入、掩码、标签三个序列。

def preprocess(item, tokenizer, max_length=2048):
    system_message = "You are a helpful assistant."
    user_message = item['instruction'] + item['input']
    assistant_message = json.dumps({"is_fraud":item["label"]}, ensure_ascii=False)
    
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer(f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant\n", add_special_tokens=False)  
    response = tokenizer(assistant_message, add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]  
    # -100是一个特殊的标记，用于指示指令部分的token不应参与损失计算
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]  
    
    # 对输入长度做一个限制保护，超出截断
    return {
        "input_ids": input_ids[:max_length],
        "attention_mask": attention_mask[:max_length],
        "labels": labels[:max_length]
    }

封装一个load_dataset用于加载数据集，并使用preprocess函数来预处理所有数据集。

def load_dataset(train_path, eval_path, tokenizer):
    train_df = load_jsonl(train_path)
    train_ds = Dataset.from_pandas(train_df)
    train_dataset = train_ds.map(lambda x: preprocess(x, tokenizer), remove_columns=train_ds.column_names)
    
    eval_df = load_jsonl(eval_path)
    eval_ds = Dataset.from_pandas(eval_df)
    eval_dataset = eval_ds.map(lambda x: preprocess(x, tokenizer),  remove_columns=eval_ds.column_names)
    return train_dataset, eval_dataset

train_dataset, eval_dataset = load_dataset(traindata_path, evaldata_path, tokenizer)

print(train_dataset)

在这里插入图片描述


    Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 18787
    })

查看序列化后的结果：

f"Input IDs: {train_dataset[0]['input_ids']}, Attention Mask: {train_dataset[0]['attention_mask']}, Labels: {train_dataset[0]['labels']}"

'Input IDs: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 
872, 271, 100431, 99639, 37474, 105051, 108704, 11, 220, 14880, 101042, 105051, 
43815, 107189, 106037, 101052, 3837, 23031, 2236, 68805, 66017, 103929, 104317, 
59151, 9623, 761, 97957, 25, 830, 91233, 8, 8997, 110395, 18, 25, 10236, 236, 108, 
102865, 101393, 99487, 101314, 100006, 101189, 100006, 85336, 99360, 102683, 99225, 
106630, 104528, 3837, 85336, 26939, 99487, 104671, 100634, 20412, 104917, 100634, 
99557, 104366, 115203, 99487, 108398, 100634, 99650, 104468, 3837, 99650, 99725, 
100662, 99792, 99692, 46944, 46944, 104160, 32757, 8997, 110395, 16, 25, 58230, 109, 
20412, 151645, 198, 151644, 77091, 198, 4913, 285, 761, 97957, 788, 895, 92, 151643], 
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1], Labels: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
-100, -100, 4913, 285, 761, 97957, 788, 895, 92, 151643]'

输出结果是一堆数字，这是给模型去运算的。给我们人肉眼看的话可以将其反序列化为文本形式。

tokenizer.decode(train_dataset[0]['input_ids'])


 '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\n下面是一
 段对话文本, 请分析对话内容是否有诈骗风险，以json格式输出你的判断结果(is_fraud: 
 true/false)。\n发言人3: 现在我所在这个哪里能够工艺能够去把屈光做得很好的，去到这个
 省级医院是自治区医院跟广西医科大学这个附属医院他们还可以，他们一直保持比较好的一
 个一个手术量。\n发言人1: 就是<|im_end|>\n<|im_start|>assistant\n{"is_fraud": false}
 <|endoftext|>'

对于输出labels中我们添加了大量的-100特殊标记，将其过滤掉后再输出。

tokenizer.decode(list(filter(lambda x: x != -100, train_dataset[0]["labels"])))

    '{"is_fraud": false}<|endoftext|>'

3. 模型准备

3.1 加载模型

指定设备，这里先使用单机单卡。

通过环境变量 CUDA_VISIBLE_DEVICES来指定当前进程可以使用的GPU卡范围。
device指定模型需要使用的设备，我们只用一个设备，直接指定cuda即可。

# 指定可以使用的GPU设备
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
# 当有多张卡时，device_map="auto"参数会自动把模型切分到多张GPU卡上，如果不希望这么做，改为事后to(device)
device = 'cuda'

先将模型加载进内存，再使用model.to(device)将模型从内存移到指定的GPU设备上，这里用的模型比较小加上资源有限，数据类型使用半精度16位即可。

def load_model(model_path, device='cuda'):
    model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16)
    model.enable_input_require_grads() # 开启梯度检查点时，要执行该方法
    return model.to(device)

model = load_model(model_path, device)
model

    Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151936, 1536)
        (layers): ModuleList(
          (0-27): 28 x Qwen2DecoderLayer(
            (self_attn): Qwen2SdpaAttention(
              (q_proj): Linear(in_features=1536, out_features=1536, bias=True)
              (k_proj): Linear(in_features=1536, out_features=256, bias=True)
              (v_proj): Linear(in_features=1536, out_features=256, bias=True)
              (o_proj): Linear(in_features=1536, out_features=1536, bias=False)
              (rotary_emb): Qwen2RotaryEmbedding()
            )
            (mlp): Qwen2MLP(
              (gate_proj): Linear(in_features=1536, out_features=8960, bias=False)
              (up_proj): Linear(in_features=1536, out_features=8960, bias=False)
              (down_proj): Linear(in_features=8960, out_features=1536, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): Qwen2RMSNorm()
            (post_attention_layernorm): Qwen2RMSNorm()
          )
        )
        (norm): Qwen2RMSNorm()
      )
      (lm_head): Linear(in_features=1536, out_features=151936, bias=False)
    )

这里可以清晰看到qwen2模型的结构,最开始是一个向量嵌入层，紧接着是Attention和MLP组成的28层DecodeLayer,最后有一个用于分类的输出层。

3.2 插入微调参数

使用Lora进行微调时，需要修改模型结构，这里将一个rank=8的低秩矩阵插入到模型的每个DecodeLayer层中，在训练时只学习这个低秩矩阵，原模型的参数不改变。

target_modules：定义了要对模型的哪些块做修改，准确来说是在具体哪些块中插入低秩矩阵。
r: 低秩矩阵的秩大小，值越小，模型能学习的参数越少，这里使用默认的8.
lora_alpha：一个缩放比例因子，控制着模型推理过程中将LoRA参数在模型整个参数中所占的比重大小，这里也按推荐配置为r的2倍。
lora_dropout: 训练过程中，随机丢弃的神经元比例，目的是引入随机性来增强模型的泛化能力。

def build_peft_model(model):
    config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        inference_mode=False, # 训练模式
        r=8, 
        lora_alpha=16,   
        lora_dropout=0.05
    )
    return get_peft_model(model, config)

peft_model = build_peft_model(model)
peft_model

    PeftModelForCausalLM(
      (base_model): LoraModel(
        (model): Qwen2ForCausalLM(
          (model): Qwen2Model(
            (embed_tokens): Embedding(151936, 1536)
            (layers): ModuleList(
              (0-27): 28 x Qwen2DecoderLayer(
                (self_attn): Qwen2SdpaAttention(
                  (q_proj): lora.Linear(
                    (base_layer): Linear(in_features=1536, out_features=1536, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1536, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=1536, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (k_proj): lora.Linear(
                    (base_layer): Linear(in_features=1536, out_features=256, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1536, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=256, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (v_proj): lora.Linear(
                    (base_layer): Linear(in_features=1536, out_features=256, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1536, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=256, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (o_proj): lora.Linear(
                    (base_layer): Linear(in_features=1536, out_features=1536, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1536, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=1536, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (rotary_emb): Qwen2RotaryEmbedding()
                )
                (mlp): Qwen2MLP(
                  (gate_proj): lora.Linear(
                    (base_layer): Linear(in_features=1536, out_features=8960, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1536, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=8960, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (up_proj): lora.Linear(
                    (base_layer): Linear(in_features=1536, out_features=8960, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1536, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=8960, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (down_proj): lora.Linear(
                    (base_layer): Linear(in_features=8960, out_features=1536, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=8960, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=1536, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (act_fn): SiLU()
                )
                (input_layernorm): Qwen2RMSNorm()
                (post_attention_layernorm): Qwen2RMSNorm()
              )
            )
            (norm): Qwen2RMSNorm()
          )
          (lm_head): Linear(in_features=1536, out_features=151936, bias=False)
        )
      )
    )

可以看到，每个q、k、v、o等块中都多了lora_A、lora_B和lora_dropout等参数量比较小的矩阵。

查看此模型要训练的参数量，从参数量能看出来，只有少部分参数（即插入的lora低秩矩阵）需要学习。

peft_model.print_trainable_parameters()

trainable params: 9,232,384 || all params: 1,552,946,688 || trainable%: 0.5945

3.3 构建训练器

配置训练参数，这块是后期需要重点关注和调整的地方，本次是初次微调，先使用默认参数看看效果再调整，一些参数的理解如下：

per_device_train_batch_size：每个设备单次运算的小批量大小，默认值未更改。
gradient_accumulation_steps：梯度累积的步骤数，原本是每4条数据更新一次参数，加上梯度累积=4后相当于每16条数据更新一次参数，相当于变相增加batch_size大小。
num_train_epochs：训练的总轮数，默认值为3，相当于所有数据训练3遍。
eval_strategy: 评估策略，可选有steps和epochs
eval_steps：训练多少步评估一次模型性能，每个batch_size为一步，此参数在eval_strategy=steps时适用。
save_steps：训练多少步自动保存一次模型参数。
learning_rate：学习率，默认值未更改。
load_best_model_at_end：训练结束时自动加载最佳模型
gradient_checkpointing：是否启用梯度检查点，启用梯度检查点可以减少kvcache对内存的占用，能节省内存。

据实际测试：对于1.5B batch_size=4的训练场景，未启用梯度检查点时会占用22G的显存，启用后能降到17G左右，效果还是很明显的。

def build_train_arguments(output_path):
    return TrainingArguments(
        output_dir=output_path,
        per_device_train_batch_size=4,  # 每个设备（如每个GPU）的训练批次大小
        gradient_accumulation_steps=4,  # 梯度累积的步骤数，相当于增大批次大小
        logging_steps=10,                
        num_train_epochs=3,    
        eval_strategy="steps",  
        eval_steps=10, # 设置评估的步数，与保存步数一致
        save_steps=10, # 为了快速演示，这里设置20，建议设置成100
        learning_rate=1e-4,
        save_on_each_node=True,
        load_best_model_at_end=True, # 在训练结束时加载最佳模型
        gradient_checkpointing=True  #  启用梯度检查点以节省内存
    )

接下来构建训练器，创建训练器的参数不多，重点理解以下几个：

eval_dataset：评估数据集，设置了此参数才会在训练过程中自动评估模型的性能，Validation Loss指标才会有值，相当于边训练边验证效果。
data_collator：控制如何将原始数据合并成批(batch), DataCollatorForSeq2Seq 会自动处理输入序列的填充，使用 tokenizer 提供的填充标记（padding token）将不同长度的序列填充到相同的长度，以避免在训练过程中因序列长度不同而产生错误。

注：序列到序列（Seq2Seq）模型中，批量输入的多条文本数据通常具有不同的长度，而模型在进行矩阵运算时需要同一批次的数据有相同长度才能一起运算，否则会报错，所以需要指定padding=True参数来将输入序列填充到相同长度。
EarlyStoppingCallback：用于设置提前结束训练的回调，early_stopping_patience=3表示验证指标没有改进时经过3个评估周期后提前停止训练。

注：默认情况下，训练会跑满train_dataset和num_train_epochs指定的所有数据集和训练轮次，但存在一些场景（例如过拟合发生时）需要提前结束训练，此时就可以设置早停回调以免模型越训练越差，还有一个重要的点是避免浪费GPU算力成本。

def build_trainer(model, tokenizer, args, train_dataset, eval_dataset):
    return Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # 早停回调
    )

4. 开始训练

基于前面准备的数据集和模型来构建一个训练器，调用trainer.train()方法即开始训练。

trainer = build_trainer(peft_model, tokenizer, build_train_arguments(output_path), train_dataset, eval_dataset)
trainer.train()

在这里插入图片描述

TrainOutput(global_step=90, training_loss=0.0323091435763571, metrics={'train_runtime': 762.4751, 'train_samples_per_second': 73.918, 'train_steps_per_second': 4.619, 'total_flos': 2991778113798144.0, 'train_loss': 0.0323091435763571, 'epoch': 0.07664466680860124})

这个训练结束的有点快，可能是提前结束的设置项起到了作用。

结束训练后，模型文件保存结果：
在这里插入图片描述

接下来评估下模型训练的效果，这里使用jupyter魔法命令%run直接引入前文已经构建的评估代码，并使用指定的测试集来测试模型性能。

%run evaluate.py

testdata_path = '/data2/anti_fraud/dataset/test0819.jsonl'
evaluate_with_model(peft_model, tokenizer, testdata_path, device, debug=True)

progress: 100%|██████████| 2349/2349 [19:22<00:00,  2.02it/s]

tn：1107, fp:60, fn:322, tp:860
precision: 0.9347826086956522, recall: 0.727580372250423

相比于欺诈文本分类微调（五）：模型评测中对基座模型的评测结果（precision: 0.8805, recall: 0.4576）来看，这个训练的精确率0.9347和召回率0.7275都有提升，而且召回率提升幅度还比较大。

这个结果是不是看起来还行？其实事情并没有这么顺利，在这个训练之前还进行过两轮训练，相比这个来说，效果就有些差了，不过我还是想把它们记录在这里，以便我们从中吸取经验和教训。

失败的尝试—1

在上面这轮训练之前有一轮训练，与上面最大的区别是两个lora参数：lora_alpha=32, lora_dropout=0.1,如下：

def build_peft_model(model):
    config = LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        inference_mode=False, # 训练模式
        r=8, 
        lora_alpha=32,   
        lora_dropout=0.1
    )
    return get_peft_model(model, config)

peft_model2 = build_peft_model(model)

output_path2 = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0819_2'
trainer2 = build_trainer(peft_model2, tokenizer, build_train_arguments(output_path2), train_dataset, eval_dataset)
trainer2.train()

在这里插入图片描述

TrainOutput(global_step=80, training_loss=0.06814795173704624, metrics={'train_runtime': 673.4243, 'train_samples_per_second': 83.693, 'train_steps_per_second': 5.23, 'total_flos': 2662222676484096.0, 'train_loss': 0.06814795173704624, 'epoch': 0.06812859271875665})

运行评估测试。

testdata_path = '/data2/anti_fraud/dataset/test0819.jsonl'
evaluate_with_model(peft_model2, tokenizer, testdata_path, device, debug=True)

progress: 100%|██████████| 2349/2349 [18:50<00:00,  2.08it/s]

tn：1148, fp:19, fn:595, tp:587
precision: 0.9686468646864687, recall: 0.4966159052453469

这个结果中，精确率还好些，但召回率0.4966相比于前面的0.7275就差距比较多了，也只比基座模型的`0.4576好一点点。

之所以差别这么大，应该是lora_alpha/r=2还是lora_alpha/r=4的这个比值不同所导致的原因，曾有论文实际验证过，这个比值等于2时有最好的效果，参考使用 LoRA 微调 LLM 的实用技巧。

失败的尝试—2

如果只用正向数据集来训练会怎么样？

刚开始学习一门技术总是想亲自演练各种可能性带来的效果，以此来建立自己对这门技术最基本的认知。

traindata_path3 = '/data2/anti_fraud/dataset/train.jsonl'
testdata_path3 = '/data2/anti_fraud/dataset/test.jsonl'

view_data_distribution(traindata_path3), view_data_distribution(testdata_path3),

    total_count:21792, true_count: 21792, false_count: 0
    total_count:5464, true_count: 5464, false_count: 0

上面的数据分布可以看出，这个数据集只有正向数据集，没有反向数据集，下面就用这个数据集进行训练。

output_path3 = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0819_3'

train_dataset3, eval_dataset3 = load_dataset(traindata_path3, testdata_path3, tokenizer)
trainer3 = build_trainer(peft_model2, tokenizer, build_train_arguments(output_path3), train_dataset3, eval_dataset3.select([i for i in range(1000)]))
trainer3.train()

在这里插入图片描述

    TrainOutput(global_step=80, training_loss=2.9383825705053824e-06, metrics={'train_runtime': 445.6354, 'train_samples_per_second': 146.703, 'train_steps_per_second': 9.169, 'total_flos': 3645473460867072.0, 'train_loss': 2.9383825705053824e-06, 'epoch': 0.05873715124816446})

这个训练的过程很奇特，只有不到10个step损失就降为了0，相当于模型很聪明，才刚走了几步路，就发现有一条直达终点的捷径，这条捷径在所给的训练数据集上非常有效。

但在未知数据上效果怎么样，还要评测一下才能见分晓。

testdata_path = '/data2/anti_fraud/dataset/test0819.jsonl'
evaluate_with_model(peft_model2, tokenizer, testdata_path, device, debug=True)

    progress: 100%|██████████| 2349/2349 [19:12<00:00,  2.04it/s]

    tn：0, fp:1167, fn:0, tp:1182
    precision: 0.5031928480204342, recall: 1.0

从这个评测的结果上，就能看出，模型所找到的捷径是将所有数据都分类为正（tn和fn均为0表示没有反向的分类）。

这里可以得到一条经验：模型很善于找捷径，所以我们在准备训练数据时不能留下误导性的特征给模型去学习。这也是为什么在前面的准备数据环节时，做了很多像长度对齐、均衡分布相关的工作，目的就是去掉误导性的特征，让模型去学习真正的我们期望它去学习的特征。

小结：本文基于lora的思想，在模型结构中插入了独立低秩矩阵进行微调训练，训练结果初见成效。不过，由于配置问题这个训练结束的很早，后面需要调整配置让数据得到充分的训练，以便模型能学习到更多的特征。此外，训练参数目前基本都还是默认值，这块也会有很大的调优空间。

欺诈文本分类微调（六）：Lora单卡训练

1. 引言

2. 数据准备

2.1 加载数据

2.2 数据序列化

3. 模型准备

3.1 加载模型

3.2 插入微调参数

3.3 构建训练器

4. 开始训练

失败的尝试—1

失败的尝试—2

相关阅读