使用 LlamaIndex 进行文档处理和评估
在本文中,我们将探讨如何使用 LlamaIndex 库进行文档处理和评估。LlamaIndex 提供了一整套工具用于加载、处理、和评估文档,特别是在处理 AI 生成内容的时候非常有用。我们将通过几步简单的操作展示如何使用该库,并提供一些代码示例。
安装依赖
首先,我们需要安装 LlamaIndex 以及其他相关依赖。
%pip install llama-index-readers-file
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
!pip install llama-index
加载数据和设置
我们将下载 Tesla 的 10-K 文件,并将其加载为 pandas DataFrame 进行处理。
import pandas as pd
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", None)
!wget "https://www.dropbox.com/scl/fi/mlaymdy1ni1ovyeykhhuk/tesla_2021_10k.htm?rlkey=qf9k4zn0ejrbm716j0gg7r802&dl=1" -O tesla_2021_10k.htm
!wget "https://www.dropbox.com/scl/fi/rkw0u959yb4w8vlzz76sa/tesla_2020_10k.htm?rlkey=tfkdshswpoupav5tqigwz1mp7&dl=1" -O tesla_2020_10k.htm
from llama_index.readers.file import FlatReader
from pathlib import Path
reader = FlatReader()
docs = reader.load_data(Path("./tesla_2020_10k.htm"))
生成评估数据集并定义评估函数
接下来,我们将生成“黄金”评估数据集,并定义用于评估的函数。
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.readers.file import FlatReader
from llama_index.core.node_parser import HTMLNodeParser, SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from pathlib import Path
import nest_asyncio
nest_asyncio.apply()
reader = FlatReader()
docs = reader.load_data(Path("./tesla_2020_10k.htm"))
pipeline = IngestionPipeline(
documents=docs,
transformations=[
HTMLNodeParser.from_defaults(),
SentenceSplitter(chunk_size=1024, chunk_overlap=200),
OpenAIEmbedding(),
],
)
eval_nodes = pipeline.run(documents=docs)
eval_llm = OpenAI(model="gpt-3.5-turbo")
dataset_generator = DatasetGenerator(
eval_nodes[:100],
llm=eval_llm,
show_progress=True,
num_questions_per_chunk=3,
)
eval_dataset = dataset_generator.agenerate_dataset_from_nodes(num=100)
eval_dataset.save_json("data/tesla10k_eval_dataset.json")
eval_dataset = QueryResponseDataset.from_json(
"data/tesla10k_eval_dataset.json"
)
eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]
运行评估
现在我们将运行评估程序。
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
)
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_dict = {
"correctness": evaluator_c,
"semantic_similarity": evaluator_s,
}
batch_eval_runner = BatchEvalRunner(
evaluator_dict, workers=2, show_progress=True
)
from llama_index.core import VectorStoreIndex
async def run_evals(
pipeline, batch_eval_runner, docs, eval_qs, eval_responses_ref
):
nodes = pipeline.run(documents=docs)
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()
pred_responses = get_responses(eval_qs, query_engine, show_progress=True)
eval_results = await batch_eval_runner.aevaluate_responses(
eval_qs, responses=pred_responses, reference=eval_responses_ref
)
return eval_results
实验不同的处理方法
我们可以尝试不同的处理方法并评估其质量。下面是尝试不同句子分割策略的示例:
from llama_index.core.node_parser import HTMLNodeParser, SentenceSplitter
sent_parser_o0 = SentenceSplitter(chunk_size=1024, chunk_overlap=0)
sent_parser_o200 = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
sent_parser_o500 = SentenceSplitter(chunk_size=1024, chunk_overlap=600)
html_parser = HTMLNodeParser.from_defaults()
parser_dict = {
"sent_parser_o0": sent_parser_o0,
"sent_parser_o200": sent_parser_o200,
"sent_parser_o500": sent_parser_o500,
}
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.ingestion import IngestionPipeline
pipeline_dict = {}
for k, parser in parser_dict.items():
pipeline = IngestionPipeline(
documents=docs,
transformations=[
html_parser,
parser,
OpenAIEmbedding(),
],
)
pipeline_dict[k] = pipeline
eval_results_dict = {}
for k, pipeline in pipeline_dict.items():
eval_results = await run_evals(
pipeline, batch_eval_runner, docs, eval_qs, ref_response_strs
)
eval_results_dict[k] = eval_results
import pickle
pickle.dump(eval_results_dict, open("eval_results_1.pkl", "wb"))
eval_results_list = list(eval_results_dict.items())
results_df = get_results_df(
[v for _, v in eval_results_list],
[k for k, _ in eval_results_list],
["correctness", "semantic_similarity"],
)
display(results_df)
可能遇到的错误
- 网络问题:在下载 Tesla 的 10-K 文件时,如果网络连接不稳定,可能会导致下载失败。
- 路径问题:确保文件路径正确,否则可能会导致文件加载失败。
- API 调用问题:在使用 LlamaIndex 提供的 API 时,如果配置不正确或权限不足,可能会导致 API 调用失败。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料: