【大模型指令微调: 从零学会炼丹】第二章: 数据集预处理

gaoboy0316

已于 2024-10-04 21:00:59 修改

阅读量1.7k

点赞数 9

分类专栏： Fine-tune 文章标签：人工智能

于 2024-09-30 15:21:21 首次发布

本文链接：https://blog.csdn.net/weixin_42980968/article/details/142655226

版权

Fine-tune 专栏收录该内容

3 篇文章

订阅专栏

大模型指令微调: 从零学会炼丹

系列目录

第一章: 微调数据集构建
 第二章: 数据集预处理
 第三章: Q-LoRa微调Phi-3.5-mini
第四章: Ollama 微调后大模型部署

第二章: 数据集预处理

环境准备

pip install datasets transformers pandas duckdb functools

导入包

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    set_seed
)
import pandas as pd
import os
import duckdb
from functools import partial

seed = 42
set_seed(seed)

os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'

代码中的HTTP_PROXY和HTTPS_PROXY设置了代理用于从huggingface 中下载模型文件

读取数据

通常会选择使用pandas 直接读取或者处理数据集, 博主这里选择了最近比较火爆的duckdb, duckdb 可以通过SQL读取或编辑dataframe中的数据

# 从 duckdb 数据库中读取数据
# 连接到 duckdb 数据库
conn = duckdb.connect('./preprocess/data_labeler.db')

# 执行查询以获取数据
df = conn.execute("SELECT * FROM labeled_data_after_augmentation").fetchdf()

# 关闭数据库连接
conn.close()

dataset = Dataset.from_pandas(df)

创建预处理函数和辅助函数

创建预处理函数和一系列辅助函数, 用于将数据集处理为LLM可以理解形式, 包括数据处理和序列化.

创建指令微调模版

创建一个create_prompt_template辅助函数, 确保数据集适用于微调场景, 将数据集中的input 和output显式转换为LLM的指令.

#格式化数据集为对LLM 的显示指令
def create_prompt_template(data):
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INPUT_KEY = "### Input:"
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"

    blurb = f"\n{INTRO_BLURB}\n{instruction_key}"
    input = f"{INPUT_KEY}\n{data['input']}"
    response = f"{RESPONSE_KEY}\n{data['output']}"
    end = f"{END_KEY}"

    parts = [part for part in [blurb, input, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    data["text"] = formatted_prompt

    return data

获取模型支持的最大大小

通过读取模型的config 信息, 读取模型支持的最长文本

def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length

序列化辅助函数

定义辅助函数调用tokenizer进行序列化

def preprocess_batch(batch, tokenizer, max_length):
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

预处理函数

调用前面定义的辅助函数, 后续使用本函数处理dataset

def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """

    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_template)

    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)

    
    # 将input_ids < max_length 的样本保存到xlsx 文件
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
    )
    #将input_ids > max_length 的样本保存到xlsx 文件
    df = dataset.filter(lambda sample: len(sample["input_ids"]) >= max_length).to_pandas()
    df = df[['id','input','output']]
    df.to_excel('long_samples.xlsx', index=False)

    #从dataset中删除'id','input','output','timestamp'
    dataset = dataset.remove_columns(['id','input','output','timestamp'])
    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

由于后续使用时是手动设置的max_length,这段函数会将超出max_length 的数据输出到本地, 需要观察下数据占比

加载tokenizer

model_id = "microsoft/Phi-3.5-mini-instruct"


# 创建tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False
)

处理数据

max_length = 4096
dataset = preprocess_dataset(tokenizer, 4096, seed, dataset)

# 将处理后的数据集保存到本地文件
dataset.save_to_disk('./preprocess/data/processed_dataset')