自动炼丹基于Llama-factory改写--学习记录_llama factory修改loss-CSDN博客

本文链接：https://blog.csdn.net/qq_43691827/article/details/140353563

最近在使用llama-factory微调LLM，但是一个个手动修改基本的学习率，学习轮次有太麻烦了。在寻找有没有什么自动调参的办法，在github上看到Llama-factory支持Weights & Biases 记录实验数据自动微调，尝试了一下本小白不太会搞(*꒦ິ⌓꒦ີ)。而且我要的结果不仅是损失小还得是预测的结果达到一定条件，就想着自己尝试在Llama-factory上加点东西，实现我的目标。有什么不对的地方还拜托各位大佬们教教我。

需求：

需要用llama-factory微调出一个参数合适的llama3-8B模型。根据llama-factory的ReadMe,我要的命令行命令有两个，命令一：指令监督微调

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

命令二：批量预测并计算 BLEU 和 ROUGE 分数

llamafactory-cli train examples/train_lora/llama3_lora_predict.yaml

思路大概就这样，

一：修改指令监督微调的参数文件，之后调用命令一训练模型。

二：用训练好的模型做预测，获取预测值结果。

三：对结果评分，将好的结果记录下来并保存对应的训练参数。

之后循环执行，1，2，3。

———————————————————————————————————————————

开始：

第一步：修改训练yaml参数

我的训练yaml文件参数如下：我只改learning_rate，num_train_epochs。打算先找到适合的这两个参数。

### model

model_name_or_path: LLM-Research/Meta-Llama-3-8B #我的是魔塔的目录

### method

stage: sft #指定微调训练方法

do_train: true #

finetuning_type: lora

lora_target: all

### dataset

dataset: validationself #需要自己在data目录下dataset_info.json文件定义

template: llama3

cutoff_len: 1024

max_samples: 1000

overwrite_cache: true

preprocessing_num_workers: 16

### output

output_dir: saves/test #输出目录

logging_steps: 10

save_steps: 500

plot_loss: true

overwrite_output_dir: true

### train

per_device_train_batch_size: 1

gradient_accumulation_steps: 8

learning_rate: 1.0e-4 #我要改的学习率

num_train_epochs: 3.0 #训练轮次

lr_scheduler_type: cosine

warmup_ratio: 0.1

bf16: true

ddp_timeout: 180000000

### eval

val_size: 0.1

per_device_eval_batch_size: 1

eval_strategy: steps

eval_steps: 500

先编写的相应的批量修改参数learning_rate，num_train_epochs的循环如下：

当然该循环很垃圾，暴力求解的。

import yaml  
  
# 读取原始的YAML文件  
with open('trainyear.yaml', 'r') as file:  
    config = yaml.safe_load(file)  
  
# 设置参数范围  
num_train_epochs_range = list(range(50, 101))  # 从50到100，代表5.0到10.0，步长为0.1  
learning_rate_range = [0.0001 * (10 ** x) for x in range(4)] + [0.001 * (1 + y) for y in range(90)]  # 从0.0001到0.1，步长不一但覆盖整个范围  
  
# 循环生成所有组合，并替换源文件中的参数  
for epochs in num_train_epochs_range:  
    num_train_epochs = epochs / 10.0  # 转换为浮点数  
    for lr in learning_rate_range:  
        # 更新配置中的参数  
        config['learning_rate'] = lr  
        config['num_train_epochs'] = num_train_epochs  
          
        # 写入YAML文件，替换之前的参数  
        with open('trainyear.yaml', 'w') as file:  
            yaml.safe_dump(config, file)  
  
        print(f'Updated config_updated.yaml with epochs={num_train_epochs} and lr={lr}')

第二步：编写函数实现原来命令llamafactory-cli train yaml，取得结果。

实现命令一的功能

def run_command_train():

    command = "llamafactory-cli train autoFinetuning/trainyear.yaml"

    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True)
    while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
        
        # 捕获剩余的错误输出
    err = process.stderr.read()
    if err:
        print("Standard Error:\n", err.strip())

这时模型已经训练出来了，下面我要用模型进行验证集的预测，并获取预测的结果。

第三步：验证集预测

预测函数：实现命令二的功能

def run_command_predict():


    command = "llamafactory-cli train autoFinetuning/predictyear.yaml"


    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True)
    while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
        
        # 捕获剩余的错误输出
    err = process.stderr.read()
    if err:
        print("Standard Error:\n", err.strip())

这里编写需要使用的预测yaml文件，编写参考web预览命令：

编写的用于预测的yaml如下：

### model

model_name_or_path: LLM-Research/Meta-Llama-3-8B #魔塔社区的llama路径

adapter_name_or_path: saves/test

### method

stage: sft

do_predict: true

finetuning_type: lora

### dataset

eval_dataset: validationself

dataset_dir: data

template: llama3

cutoff_len: 1024

max_samples: 50 #这个根据自己验证集数据量大小

overwrite_cache: true

preprocessing_num_workers: 16

### output

output_dir: saves/testwandb/lora/predict

overwrite_output_dir: true

### eval

per_device_eval_batch_size: 1

predict_with_generate: true

ddp_timeout: 180000000

注意：llama-factory提供的参考脚本的功能有两个一个是批量预测一个是计算 BLEU 和 ROUGE 分数。我只需要批量预测，小白我只找到了这个办法，即借用这个命令二来批量预测结果。

预测结果被保存在了配置的预测yaml文件中的该参数output_dir: saves/testwandb/lora/predict 目录下的generated_predictions.jsonl文件中。

参考：LLaMA-Factory/examples/README_zh.md at main · hiyouga/LLaMA-Factory (github.com)

第四步：获取预测值，调用api取得分数

下面要从generated_predictions.jsonl，获取批量预测的结果，并调用api来获得预测结果的分数。

def process_jsonl_file(input_file, output_dir):
    """
    处理JSONL文件，将每个JSON对象的'predict'字段保存为单独的文本文件。
    """
    index = 1  # 初始化文件索引
    with open(input_file, 'r', encoding='utf-8') as file:  # 打开输入的JSONL文件
        for line in file:  # 遍历文件中的每一行
            item = json.loads(line)  # 将JSONL行转换为字典
            predict_text = item.get('predict', '')  # 获取'predict'字段的值
            filename = f"{output_dir}/{index}.txt"  # 构造输出文件名
            index += 1  # 文件索引递增
            with open(filename, 'w', encoding='utf-8') as out_file:  # 打开输出文件
                out_file.write(predict_text)  # 写入'predict'字段的值
    print("所有文本文件已保存。")  # 打印完成信息

def send_requests_and_save_results(text_files_dir, json_files_dir, api_url, headers):
    """
    发送文本文件到API，并将响应结果保存为JSON文件。
    """
    for i in range(1, 21):  # 遍历指定范围内的文件
        file_name = f"{text_files_dir}/{i}.txt"  # 构造输入文件名
        json_file_name = f"{json_files_dir}/{i}.json"  # 构造输出文件名

        with open(file_name, 'r', encoding='utf-8') as file:  # 打开输入文件
            input_text = file.read()  # 读取文件内容

        payload = {"input_text": input_text}  # 构造请求体
        response = requests.post(api_url, json=payload, headers=headers)  # 发送POST请求
        result = response.json()  # 获取响应体中的JSON数据

        with open(json_file_name, 'w', encoding='utf-8') as file:  # 打开输出文件
            json.dump(result, file, ensure_ascii=False, indent=4)  # 将JSON数据写入文件

def calculate_average_and_sort_values(json_files_dir):
    """
    计算所有JSON文件中'is_human_written'字段的平均值。
    """
    is_human_written_values = []  # 初始化列表，用于存储'is_human_written'的值
    for i in range(1, 20):  # 遍历指定范围内的文件
        file_name = f"{json_files_dir}/{i}.json"  # 构造文件名
        try:
            with open(file_name, 'r', encoding='utf-8') as file:  # 打开文件
                data = json.load(file)  # 读取JSON数据
                is_human_written_values.append(data['data']['is_human_written'])  # 添加'is_human_written'的值到列表
        except FileNotFoundError:  # 文件不存在异常处理
            print(f"文件{file_name}不存在。")
        except json.JSONDecodeError:  # JSON解析异常处理
            print(f"文件{file_name}不是有效的JSON格式。")
        except KeyError:  # 缺少'is_human_written'字段异常处理
            print(f"文件{file_name}中缺少'is_human_written'字段。")

    if is_human_written_values:  # 如果列表不为空
        average_value = sum(is_human_written_values) / len(is_human_written_values)  # 计算平均值
        return average_value
    else:  # 如果列表为空
        print("没有足够的数据来计算平均值。")  # 打印提示信息

第五步：永远保存最好的分数，及其对应的参数

改写第一步的循环：初始存放最好的分数的数组，记录对应参数。

    # 初始化一个列表来保存（maxnum, num_train_epochs, lr）元组  
    top_5_results = []  


    # 读取原始的YAML文件  
    with open('/mnt/workspace/LLaMA-Factory/autoFinetuning/trainyear.yaml', 'r') as file:  
        config = yaml.safe_load(file)  
    
    # 设置参数范围  
    num_train_epochs_range = list(range(75, 73))  # 从50到100，代表5.0到10.0，步长为0.1  
    # learning_rate_range = [0.0001 * (10 ** x) for x in range(4)] + [0.001 * (1 + y) for y in range(90)]  # 从0.0001到0.1，步长不一但覆盖整个范围  
    learning_rate_range = [5e-5]#5e-5, 3e-5
        

    for epochs in num_train_epochs_range:    
        num_train_epochs = epochs / 10.0  # 转换为浮点数    
        for lr in learning_rate_range:    
            # 更新配置中的参数    
            config['learning_rate'] = lr    
            config['num_train_epochs'] = num_train_epochs    
            
            # 写入YAML文件，替换之前的参数    
            with open('/mnt/workspace/LLaMA-Factory/autoFinetuning/trainyear.yaml', 'w') as file:    
                yaml.safe_dump(config, file)    
    
            print(f'Updated trainyear.yaml with epochs={num_train_epochs} and lr={lr}')  
            # Step 1: 根据修改好的yaml训练模型 
            run_command_train()  
            print("Step1完成.............................................................Step1")
            # Step 2: 用训练好的模型批量预测数据 
            run_command_predict()  
            print("Step2完成.............................................................Step2")
           
            # Step 3: 处理预测数据JSONL文件并保存文本文件  
            process_jsonl_file(input_file, output_dir)  
            print("Step3完成.............................................................Step3")
   
            # Step 4: 发送请求到API并保存结果为JSON文件  
            print("Step4完成.............................................................Step4")
    
            # Step 5: 计算并排序JSON文件中的'is_human_written'值  
            maxnum = calculate_average_and_sort_values(json_files_dir)  
            print("Step5完成.............................................................Step5")
    
            if len(top_5_results) < 5:
                top_5_results.append((maxnum, num_train_epochs, lr))
            else:
                top_5_results.sort(key=lambda x: x[0])  # 按maxnum升序排序
                if maxnum > top_5_results[0][0]:
                    top_5_results.pop(0)
                    top_5_results.append((maxnum, num_train_epochs, lr))
                top_5_results.sort(reverse=True, key=lambda x: x[0])  # 按maxnum降序排序

    
            
    
    # 将top_5_results写入文件  
    with open('top_5_results.txt', 'w') as f:  
        for result in top_5_results:  
            f.write(f'maxnum: {result[0]}, num_train_epochs: {result[1]}, lr: {result[2]}\n')

总结:

待完成