在自然语言处理(NLP)领域,元数据提取对于提高检索结果的质量具有重要意义。本文将介绍如何使用LlamaIndex进行自动化元数据提取,以改进检索结果的质量。我们将演示如何使用两个提取器:QuestionsAnsweredExtractor
和SummaryExtractor
,前者生成问题/答案对,后者提取摘要。
环境配置
首先,我们需要配置LlamaIndex。如果你在Google Colab中打开这个Notebook,你可能需要安装LlamaIndex。
!pip install llama-index-llms-openai
!pip install llama-index-readers-web
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
import os
import openai
from llama_index.core import set_global_handler
set_global_handler("wandb", run_args={"project": "llamaindex"})
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
定义元数据提取器
我们定义两个元数据提取器。metadata_extractor_1
只包含QuestionsAnsweredExtractor
,而metadata_extractor_2
则包含QuestionsAnsweredExtractor
和SummaryExtractor
。
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512) # 使用中转API
实例化提取器:
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
)
node_parser = TokenTextSplitter(
separator=" ", chunk_size=256, chunk_overlap=128
)
extractors_1 = [
QuestionsAnsweredExtractor(
questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
),
]
extractors_2 = [
SummaryExtractor(summaries=["prev", "self", "next"], llm=llm),
QuestionsAnsweredExtractor(
questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
),
]
加载数据并运行提取器
我们从Eugene Yan的文章中加载数据,并运行提取器。
from llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
orig_nodes = node_parser.get_nodes_from_documents(docs)
nodes = orig_nodes[20:28]
运行元数据提取器:
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_1])
nodes_1 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_2])
nodes_2 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)
可视化部分数据
print(nodes_2[3].get_content(metadata_mode="all"))
设置RAG查询引擎并比较结果
from llama_index.core import VectorStoreIndex
index0 = VectorStoreIndex(orig_nodes)
index1 = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])
index2 = VectorStoreIndex(orig_nodes[:20] + nodes_2 + orig_nodes[28:])
query_engine0 = index0.as_query_engine(similarity_top_k=1)
query_engine1 = index1.as_query_engine(similarity_top_k=1)
query_engine2 = index2.as_query_engine(similarity_top_k=1)
query_str = (
"Can you describe metrics for evaluating text generation quality, compare"
" them, and tell me about their downsides"
)
response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)
print(response0.source_nodes[0].node.get_content())
可能遇到的错误
- API Key错误:确保API Key设置正确,并且在代码中正确调用。
- 网络连接问题:如果你使用的是中转API地址,确保网络畅通。
- 数据加载问题:检查数据源地址是否有效,并确保网页格式正确。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!