探索Azure AI Document Intelligence：处理文档，从未如此简单

最新推荐文章于 2025-02-24 18:45:00 发布

stjklkjhgffxw

最新推荐文章于 2025-02-24 18:45:00 发布

阅读量1k

点赞数 24

文章标签： azure 人工智能 flask python

本文链接：https://blog.csdn.net/stjklkjhgffxw/article/details/142665241

版权

引言

Azure AI Document Intelligence（前称Azure Form Recognizer）是一项基于机器学习的服务，能够从PDF、图像、Office文件、HTML等中提取文本（包括手写文本）、表格、文档结构（如标题、章节标题等）以及键值对。本文旨在深入介绍如何使用Azure AI Document Intelligence服务处理文档，并将其转化为LangChain文档，同时探讨该服务的不同用法和潜在挑战。

主要内容

功能概览

Azure AI Document Intelligence支持多种文件格式，如PDF、JPEG/JPG、PNG、BMP、TIFF、HEIF、DOCX、XLSX、PPTX和HTML。通过设置不同的模式，可以实现按页或整篇文档的文本提取。默认输出格式为Markdown，可与MarkdownHeaderTextSplitter结合进行语义文档分块。

前提条件

使用Azure AI Document Intelligence的前提是需要在下述预览区域之一（East US, West US2, West Europe）创建资源。可以参考创建文档来获取更多信息。创建资源后，你会获得和，它们将在加载器中用作参数。

安装必备包

在开始之前，确保安装了必要的Python包：

%pip install --upgrade --quiet langchain langchain-community azure-ai-documentintelligence

代码示例

示例1：使用本地文件

以下示例展示了如何使用本地文件进行文档处理：

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)

documents = loader.load()
print(documents)

示例2：使用公共URL

你也可以使用公共URL作为输入：

url_path = "<url>"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, url_path=url_path, api_model="prebuilt-layout"
)

documents = loader.load()
print(documents)

示例3：按页加载文档

可以指定mode="page"来逐页加载文档：

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    mode="page",
)

documents = loader.load()
for document in documents:
    print(f"Page Content: {document.page_content}")
    print(f"Metadata: {document.metadata}")

示例4：启用高分辨率OCR

启用高分辨率OCR可增强识别能力：

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
analysis_features = ["ocrHighResolution"]
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    analysis_features=analysis_features,
)

documents = loader.load()
print(documents)