使用Azure AI Document Intelligence进行文档解析的实战指南

最新推荐文章于 2025-04-29 09:22:07 发布

VYSAHF

最新推荐文章于 2025-04-29 09:22:07 发布

阅读量412

点赞数 4

文章标签： azure 人工智能 flask python

本文链接：https://blog.csdn.net/VYSAHF/article/details/146410206

版权

随着信息技术的快速发展，越来越多的企业和组织希望能够更高效地处理各种格式的文档。Azure AI Document Intelligence（之前称为Azure Form Recognizer）是一种基于机器学习的服务，能够从数字或扫描的PDF、图像、Office和HTML文件中提取文本（包括手写体）、表格、文档结构（例如标题、章节标题等）和关键值对。

技术背景介绍

Azure AI Document Intelligence能够支持多种文件格式，包括PDF、JPEG/JPG、PNG、BMP、TIFF、HEIF、DOCX、XLSX、PPTX和HTML。这使得它在处理多种文件源时具有很强的通用性。通过结合LangChain框架，我们可以将文档页内容转化为LangChain文档，进一步进行语义文档切分。

核心原理解析

Azure AI Document Intelligence通过先进的OCR技术和自然语言处理能力，自动识别文档中的各类信息，并将其结构化。开发者可使用API方便地与其集成，进行文档自动化处理。

代码实现演示

我们以下面的代码示例来演示如何使用Azure AI Document Intelligence服务：

示例1：处理本地文件

首先，我们需要初始化文档分析客户端，从而创建DocumentIntelligenceLoader实例。

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "path/to/your/document.pdf"
endpoint = "https://your-endpoint.cognitiveservices.azure.com/"
key = "your-api-key"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout"
)

documents = loader.load()
print(documents)

说明：上述代码中的file_path、endpoint和key需要替换为实际的文件路径和服务访问参数。

示例2：处理线上文件

你也可以使用一个公共的URL路径。

url_path = "https://your-url-path/to/document.png"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    url_path=url_path,
    api_model="prebuilt-layout"
)

documents = loader.load()
print(documents)

示例3：按页加载文档

通过指定mode="page"可以按页加载文档。

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "path/to/your/document.pdf"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    mode="page"
)

documents = loader.load()

for document in documents:
    print(f"Page Content: {document.page_content}")
    print(f"Metadata: {document.metadata}")

示例4：使用附加功能

通过指定analysis_features=["ocrHighResolution"]可以启用附加的高分辨率OCR功能。

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "path/to/your/document.pdf"
analysis_features = ["ocrHighResolution"]
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    analysis_features=analysis_features
)

documents = loader.load()
print(documents)