简介
近年来的研究表明,GPT-4在评估大语言模型生成的文本时,其表现能够与人类评审紧密对齐。在这篇文章中,我们将展示如何使用llama_index
库对GPT-3.5进行知识蒸馏,使得其表现更接近于GPT-4,并通过代理,更接近于人类评审。
本文的主要步骤如下:
- 生成数据集:train_dataset和test_dataset
- 执行知识蒸馏(使用train_dataset)
- 在test_dataset上评估微调后的模型
生成数据集:train_dataset和test_dataset
首先,我们将使用WikipediaReader读取几个城市的“历史”页面,以生成我们的问题数据集。
!pip install wikipedia -q
from llama_index.readers.wikipedia import WikipediaReader
train_cities = ["San Francisco", "Toronto", "New York", "Vancouver", "Montreal", "Boston"]
test_cities = ["Tokyo", "Singapore", "Paris"]
train_documents = WikipediaReader().load_data(pages=[f"History of {x}" for x in train_cities])
test_documents = WikipediaReader().load_data(pages=[f"History of {x}" for x in test_cities])
接下来,我们将使用DatasetGenerator
从这些文档中生成问题。
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI
QUESTION_GEN_PROMPT = (
"You are a Teacher/ Professor. Your task is to setup "
"a quiz/examination. Using the provided context, formulate "
"a single question that captures an important fact from the "
"context. Restrict the question to the context information provided."
)
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
train_dataset_generator = DatasetGenerator.from_documents(
train_documents, question_gen_query=QUESTION_GEN_PROMPT, llm=llm, show_progress=True, num_questions_per_chunk=25
)
test_dataset_generator = DatasetGenerator.from_documents(
test_documents, question_gen_query=QUESTION_GEN_PROMPT, llm=llm, show_progress=True, num_questions_per_chunk=25
)
train_questions = train_dataset_generator.generate_questions_from_nodes(num=200)
test_questions = test_dataset_generator.generate_questions_from_nodes(num=150)
生成答案
我们将使用Llama-2和Mistral两个LLM生成答案,并创建文档索引和检索器。
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
train_index = VectorStoreIndex.from_documents(documents=train_documents)
train_retriever = VectorIndexRetriever(index=train_index, similarity_top_k=2)
test_index = VectorStoreIndex.from_documents(documents=test_documents)
test_retriever = VectorIndexRetriever(index=test_index, similarity_top_k=2)
接下来,我们将创建检索查询引擎并生成答案。
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface import HuggingFaceInferenceAPI
def create_query_engine(hf_name: str, retriever: VectorIndexRetriever, hf_llm_generators: dict) -> RetrieverQueryEngine:
if hf_name not in hf_llm_generators:
raise KeyError("model not listed in hf_llm_generators")
llm = HuggingFaceInferenceAPI(model_name=hf_llm_generators[hf_name], context_window=2048, token="HUGGING_FACE_TOKEN")
return RetrieverQueryEngine.from_args(retriever=retriever, llm=llm)
hf_llm_generators = {"mistral-7b-instruct": "mistralai/Mistral-7B-Instruct-v0.1", "llama2-7b-chat": "meta-llama/Llama-2-7b-chat-hf"}
train_query_engines = {mdl: create_query_engine(mdl, train_retriever, hf_llm_generators) for mdl in hf_llm_generators.keys()}
test_query_engines = {mdl: create_query_engine(mdl, test_retriever, hf_llm_generators) for mdl in hf_llm_generators.keys()}
评估和知识蒸馏
接下来,我们将使用GPT-4对生成的答案进行评估,并微调GPT-3.5模型。
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core import Settings
main_finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([main_finetuning_handler])
Settings.callback_manager = callback_manager
llm_4 = OpenAI(temperature=0, model="gpt-4", callback_manager=callback_manager)
gpt4_judge = PairwiseComparisonEvaluator(llm=llm_4)
# 以异步方式评估问题和答案对
for data_entry in train_dataset:
final_eval_result = await gpt4_judge.aevaluate(query=data_entry["question"], response=data_entry["answers"][0]["text"], second_response=data_entry["answers"][1]["text"], reference=data_entry["source"])
judgement = {"llm": "gpt_4", "score": final_eval_result.score, "text": final_eval_result.response, "source": final_eval_result.pairwise_source}
data_entry["evaluations"] = [judgement]
main_finetuning_handler.save_finetuning_events("pairwise_finetuning_events.jsonl")
微调和测试
最后,我们将使用微调后的GPT-3.5模型进行评估并与基准模型进行比较。
from llama_index.finetuning import OpenAIFinetuneEngine
finetune_engine = OpenAIFinetuneEngine("gpt-3.5-turbo", "resolved_pairwise_finetuning_events.jsonl")
finetune_engine.finetune()
# 检查当前微调作业状态
finetune_engine.get_current_job()
可能遇到的错误
- API调用失败:确保您的API密钥正确无误,并且API服务可用。
- 模型权限问题:如果使用某些特定的LLM(如Llama-2),确保您已获得该模型的访问权限。
- 数据集加载失败:检查数据源是否可访问并确保数据格式正确。
结论
通过以上步骤,我们成功地对GPT-3.5模型进行了知识蒸馏,使其表现更加接近于GPT-4。微调后的模型在评估任务上表现更好,且减少了位置偏差。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!