使用知识蒸馏技术对GPT-3.5模型进行微调的完整指南

最新推荐文章于 2024-10-12 12:26:23 发布

llzwxh888

最新推荐文章于 2024-10-12 12:26:23 发布

阅读量245

点赞数 3

文章标签： gpt-3 python 机器学习

本文链接：https://blog.csdn.net/ppoojjj/article/details/140922627

版权

简介

近年来的研究表明，GPT-4在评估大语言模型生成的文本时，其表现能够与人类评审紧密对齐。在这篇文章中，我们将展示如何使用llama_index库对GPT-3.5进行知识蒸馏，使得其表现更接近于GPT-4，并通过代理，更接近于人类评审。

本文的主要步骤如下：

生成数据集：train_dataset和test_dataset
执行知识蒸馏（使用train_dataset）
在test_dataset上评估微调后的模型

生成数据集：train_dataset和test_dataset

首先，我们将使用WikipediaReader读取几个城市的“历史”页面，以生成我们的问题数据集。

!pip install wikipedia -q
from llama_index.readers.wikipedia import WikipediaReader

train_cities = ["San Francisco", "Toronto", "New York", "Vancouver", "Montreal", "Boston"]
test_cities = ["Tokyo", "Singapore", "Paris"]

train_documents = WikipediaReader().load_data(pages=[f"History of {x}" for x in train_cities])
test_documents = WikipediaReader().load_data(pages=[f"History of {x}" for x in test_cities])

接下来，我们将使用DatasetGenerator从这些文档中生成问题。

from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI

QUESTION_GEN_PROMPT = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)

train_dataset_generator = DatasetGenerator.from_documents(
    train_documents, question_gen_query=QUESTION_GEN_PROMPT, llm=llm, show_progress=True, num_questions_per_chunk=25
)
test_dataset_generator = DatasetGenerator.from_documents(
    test_documents, question_gen_query=QUESTION_GEN_PROMPT, llm=llm, show_progress=True, num_questions_per_chunk=25
)

train_questions = train_dataset_generator.generate_questions_from_nodes(num=200)
test_questions = test_dataset_generator.generate_questions_from_nodes(num=150)

生成答案

我们将使用Llama-2和Mistral两个LLM生成答案，并创建文档索引和检索器。

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

train_index = VectorStoreIndex.from_documents(documents=train_documents)
train_retriever = VectorIndexRetriever(index=train_index, similarity_top_k=2)

test_index = VectorStoreIndex.from_documents(documents=test_documents)
test_retriever = VectorIndexRetriever(index=test_index, similarity_top_k=2)

接下来，我们将创建检索查询引擎并生成答案。

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

def create_query_engine(hf_name: str, retriever: VectorIndexRetriever, hf_llm_generators: dict) -> RetrieverQueryEngine:
    if hf_name not in hf_llm_generators:
        raise KeyError("model not listed in hf_llm_generators")
    llm = HuggingFaceInferenceAPI(model_name=hf_llm_generators[hf_name], context_window=2048, token="HUGGING_FACE_TOKEN")
    return RetrieverQueryEngine.from_args(retriever=retriever, llm=llm)

hf_llm_generators = {"mistral-7b-instruct": "mistralai/Mistral-7B-Instruct-v0.1", "llama2-7b-chat": "meta-llama/Llama-2-7b-chat-hf"}

train_query_engines = {mdl: create_query_engine(mdl, train_retriever, hf_llm_generators) for mdl in hf_llm_generators.keys()}
test_query_engines = {mdl: create_query_engine(mdl, test_retriever, hf_llm_generators) for mdl in hf_llm_generators.keys()}

评估和知识蒸馏

接下来，我们将使用GPT-4对生成的答案进行评估，并微调GPT-3.5模型。

from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core import Settings

main_finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([main_finetuning_handler])
Settings.callback_manager = callback_manager

llm_4 = OpenAI(temperature=0, model="gpt-4", callback_manager=callback_manager)
gpt4_judge = PairwiseComparisonEvaluator(llm=llm_4)

# 以异步方式评估问题和答案对
for data_entry in train_dataset:
    final_eval_result = await gpt4_judge.aevaluate(query=data_entry["question"], response=data_entry["answers"][0]["text"], second_response=data_entry["answers"][1]["text"], reference=data_entry["source"])
    judgement = {"llm": "gpt_4", "score": final_eval_result.score, "text": final_eval_result.response, "source": final_eval_result.pairwise_source}
    data_entry["evaluations"] = [judgement]

main_finetuning_handler.save_finetuning_events("pairwise_finetuning_events.jsonl")

微调和测试

最后，我们将使用微调后的GPT-3.5模型进行评估并与基准模型进行比较。

from llama_index.finetuning import OpenAIFinetuneEngine
finetune_engine = OpenAIFinetuneEngine("gpt-3.5-turbo", "resolved_pairwise_finetuning_events.jsonl")
finetune_engine.finetune()

# 检查当前微调作业状态
finetune_engine.get_current_job()