llamaindex 元数据提取

最新推荐文章于 2024-07-31 16:04:10 发布

需要重新演唱

最新推荐文章于 2024-07-31 16:04:10 发布

阅读量437

点赞数 5

分类专栏： llamaindex 文章标签： llamaindex RAG AI

本文链接：https://blog.csdn.net/xycxycooo/article/details/140800516

版权

llamaindex 专栏收录该内容

28 篇文章 1 订阅

订阅专栏

元数据提取

概念解释

在许多情况下，尤其是对于长文档，一段文本可能缺乏必要的上下文来区分它与其他相似的文本块。为了解决这个问题，我们使用大型语言模型（LLMs）来提取与文档相关的某些上下文信息，以更好地帮助检索和语言模型区分看起来相似的段落。

使用方法

首先，我们定义一个元数据提取器，它接收一个特征提取器列表，这些提取器将按顺序处理。然后，我们将这个提取器传递给节点解析器，节点解析器会将额外的元数据添加到每个节点中。

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
)
from llama_index.extractors.entity import EntityExtractor

transformations = [
    SentenceSplitter(),
    TitleExtractor(nodes=5),
    QuestionsAnsweredExtractor(questions=3),
    SummaryExtractor(summaries=["prev", "self"]),
    KeywordExtractor(keywords=10),
    EntityExtractor(prediction_threshold=0.5),
]

然后，我们可以在输入文档或节点上运行这些转换：

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations)

nodes = pipeline.run(documents=documents)

以下是一个提取的元数据示例：

{'page_label': '2',
 'file_name': '10k-132.pdf',
 'document_title': 'Uber Technologies, Inc. 2019 Annual Report: Revolutionizing Mobility and Logistics Across 69 Countries and 111 Million MAPCs with $65 Billion in Gross Bookings',
 'questions_this_excerpt_can_answer': '\n\n1. How many countries does Uber Technologies, Inc. operate in?\n2. What is the total number of MAPCs served by Uber Technologies, Inc.?\n3. How much gross bookings did Uber Technologies, Inc. generate in 2019?',
 'prev_section_summary': "\n\nThe 2019 Annual Report provides an overview of the key topics and entities that have been important to the organization over the past year. These include financial performance, operational highlights, customer satisfaction, employee engagement, and sustainability initiatives. It also provides an overview of the organization's strategic objectives and goals for the upcoming year.",
 'section_summary': '\nThis section discusses a global tech platform that serves multiple multi-trillion dollar markets with products leveraging core technology and infrastructure. It enables consumers and drivers to tap a button and get a ride or work. The platform has revolutionized personal mobility with ridesharing and is now leveraging its platform to redefine the massive meal delivery and logistics industries. The foundation of the platform is its massive network, leading technology, operational excellence, and product expertise.',
 'excerpt_keywords': '\nRidesharing, Mobility, Meal Delivery, Logistics, Network, Technology, Operational Excellence, Product Expertise, Point A, Point B'}

自定义提取器

如果提供的提取器不符合你的需求，你也可以定义一个自定义提取器，如下所示：

from llama_index.core.extractors import BaseExtractor

class CustomExtractor(BaseExtractor):
    async def aextract(self, nodes) -> List[Dict]:
        metadata_list = [
            {
                "custom": node.metadata["document_title"]
                + "\n"
                + node.metadata["excerpt_keywords"]
            }
            for node in nodes
        ]
        return metadata_list

extractor.extract() 将自动在底层调用 aextract()，以提供同步和异步入口点。

在一个更高级的示例中，它还可以利用LLM从节点内容和现有元数据中提取特征。有关更多详细信息，请参阅提供的元数据提取器的源代码。

拓展

元数据提取是LlamaIndex中一个非常强大的功能，它可以帮助你从文档中提取有价值的上下文信息，从而提高检索和语言模型的准确性。通过合理配置和使用这些提取器，你可以构建一个高效、智能的检索系统。

在实际应用中，元数据提取特别适用于处理长文档、复杂文档或需要高度上下文感知的场景。例如，在法律文档分析、学术论文检索或企业报告处理中，元数据提取可以显著提高文档的可理解性和检索效率。

希望这些解释和示例能帮助你更好地理解和使用元数据提取功能。如果有任何问题或需要进一步的解释，请随时提问。

需要重新演唱

关注

5
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
llamaindex 元数据提取

"\n"将自动在底层调用aextract()，以提供同步和异步入口点。在一个更高级的示例中，它还可以利用LLM从节点内容和现有元数据中提取特征。有关更多详细信息，请参阅提供的元数据提取器的源代码。
复制链接

扫一扫