探索Doctran库：如何从文档中提取关键信息

最新推荐文章于 2025-05-22 16:41:47 发布

stjklkjhgffxw

最新推荐文章于 2025-05-22 16:41:47 发布

阅读量628

点赞数 3

文章标签： python

本文链接：https://blog.csdn.net/stjklkjhgffxw/article/details/142408751

版权

引言

在当前信息过载的时代，从文档中提取有用的特性变得尤为重要。这不仅有助于文档分类和数据挖掘，还有助于风格转换等任务。本文将介绍如何使用Doctran库以及OpenAI的功能调用特性来提取特定的元数据。

主要内容

Doctran库简介

Doctran库是一个强大的工具，用于从文本中提取结构化数据。它能够识别文档中的关键信息，帮助开发者进行深入的数据分析和处理。在使用Doctran库之前，确保你已经安装了它：

%pip install --upgrade --quiet doctran

应用场景

文档分类：将文档分为不同类别。
数据挖掘：提取结构化数据以便于分析。
风格转换：改变文本书写风格，使其更贴近用户预期，提升向量搜索结果。

使用Doctran提取文档属性

在这节中，我们将展示如何使用Doctran从文档中提取信息。

首先，导入必要模块并加载环境变量：

import json
from langchain_community.document_transformers import DoctranPropertyExtractor
from langchain_core.documents import Document
from dotenv import load_dotenv

load_dotenv()

准备要从中提取属性的文档：

sample_text = """... (略) ..."""
documents = [Document(page_content=sample_text)]

定义要提取的属性：

properties = [
    {
        "name": "category",
        "description": "What type of email this is.",
        "type": "string",
        "enum": ["update", "action_item", "customer_feedback", "announcement", "other"],
        "required": True,
    },
    {
        "name": "mentions",
        "description": "A list of all people mentioned in this email.",
        "type": "array",
        "items": {
            "name": "full_name",
            "description": "The full name of the person mentioned.",
            "type": "string",
        },
        "required": True,
    },
    {
        "name": "eli5",
        "description": "Explain this email to me like I'm 5 years old.",
        "type": "string",
        "required": True,
    },
]

创建属性提取器并提取信息：

property_extractor = DoctranPropertyExtractor(properties=properties)
extracted_document = property_extractor.transform_documents(documents, properties=properties)
print(json.dumps(extracted_document[0].metadata, indent=2))

代码示例

以下是完整代码示例，展示了如何使用Doctran库提取文档属性。注意这里使用了API代理服务，以提高访问稳定性：

import json
from langchain_community.document_transformers import DoctranPropertyExtractor
from langchain_core.documents import Document
from dotenv import load_dotenv

load_dotenv()

sample_text = """... (略) ..."""
documents = [Document(page_content=sample_text)]
properties = [
    {
        "name": "category",
        "description": "What type of email this is.",
        "type": "string",
        "enum": ["update", "action_item", "customer_feedback", "announcement", "other"],
        "required": True,
    },
    {
        "name": "mentions",
        "description": "A list of all people mentioned in this email.",
        "type": "array",
        "items": {
            "name": "full_name",
            "description": "The full name of the person mentioned.",
            "type": "string",
        },
        "required": True,
    },
    {
        "name": "eli5",
        "description": "Explain this email to me like I'm 5 years old.",
        "type": "string",
        "required": True,
    },
]
property_extractor = DoctranPropertyExtractor(properties=properties)
extracted_document = property_extractor.transform_documents(documents, properties=properties)
print(json.dumps(extracted_document[0].metadata, indent=2))