Ollama-OCR核心代码解读:使用先进LLM从图像和 PDF 中提取文本

AI仙人掌

已于 2025-04-01 13:06:55 修改

阅读量1k

点赞数 15

分类专栏： # RAG基建之PDF解析文章标签： ocr pdf LLM 深度学习

于 2025-03-30 12:00:00 首次发布

本文链接：https://blog.csdn.net/qq_36603091/article/details/146718098

版权

RAG基建之PDF解析专栏收录该内容

7 篇文章

订阅专栏

🧠 向所有学习者致敬！

“学习不是装满一桶水，而是点燃一把火。” —— 叶芝

我的博客主页： https://lizheng.blog.csdn.net

🌐 欢迎点击加入AI人工智能社区！

🚀 让我们一起努力，共创AI未来！ 🚀

Ollama-OCR 是一个功能强大的光学字符识别（OCR）工具包，它借助 Ollama 使用先进的视觉语言模型从图像和 PDF 中提取文本。该工具包既可以作为 Python 包使用，也可以通过 Streamlit 网络应用程序使用。以下是对该仓库的详细介绍：
在这里插入图片描述

主要特性

支持多种文件格式：支持处理 PDF 和常见的图像格式，如 PNG、JPG、JPEG、TIFF、BMP 等。
多模型支持：
- LLaVA 7B：高效的视觉语言模型，适用于实时处理。
- Llama 3.2 Vision：高级模型，对复杂文档具有较高的识别精度。
- Granite 3.2 Vision：专为视觉文档理解设计的紧凑高效模型，可从表格、图表等中自动提取内容。
- Moondream：轻量级模型，专为边缘设备设计。
- Minicpm-v：可处理任意纵横比和高达 180 万像素的图像。
多种输出格式：
- Markdown：保留文本格式，包括标题和列表。
- Plain Text：简单的纯文本提取。
- JSON：结构化数据格式。
- Structured：用于提取表格和有组织的数据。
- Key-Value Pairs：提取带标签的信息。
- Table：提取所有表格数据。
批量处理：支持并行处理多个图像，并提供每个图像的处理进度跟踪。
自定义提示：可以使用自定义提示覆盖默认提示，以实现更精准的文本提取。
图像预处理：在进行 OCR 之前，对图像进行预处理，包括转换 PDF 为图像、增强对比度、降噪等操作。

安装和使用

安装依赖

pip install -r requirements.txt

快速开始

安装 Ollama：根据官方文档进行安装。
拉取所需模型：

ollama pull llama3.2-vision:11b
ollama pull granite3.2-vision
ollama pull moondream
ollama pull minicpm-v

使用 Python 包

单文件处理：

from ollama_ocr import OCRProcessor

# 初始化 OCR 处理器
ocr = OCRProcessor(model_name='llama3.2-vision:11b', base_url="http://host.docker.internal:11434/api/generate")

# 处理图像
result = ocr.process_image(
    image_path="path/to/your/image.png",
    format_type="markdown",
    custom_prompt="Extract all text, focusing on dates and names.",
    language="English"
)
print(result)

批量处理：

from ollama_ocr import OCRProcessor

# 初始化 OCR 处理器
ocr = OCRProcessor(model_name='llama3.2-vision:11b', max_workers=4)

# 处理多个图像
batch_results = ocr.process_batch(
    input_path="path/to/images/folder",
    format_type="markdown",
    recursive=True,
    preprocess=True,
    custom_prompt="Extract all text, focusing on dates and names.",
    language="English"
)

# 访问结果
for file_path, text in batch_results['results'].items():
    print(f"\nFile: {file_path}")
    print(f"Extracted Text: {text}")

# 查看统计信息
print("\nProcessing Statistics:")
print(f"Total images: {batch_results['statistics']['total']}")
print(f"Successfully processed: {batch_results['statistics']['successful']}")
print(f"Failed: {batch_results['statistics']['failed']}")

使用 Streamlit 网络应用程序

克隆仓库：

git clone https://github.com/imanoop7/Ollama-OCR.git
cd Ollama-OCR

安装依赖：

pip install -r requirements.txt

cd src/ollama_ocr

运行 Streamlit 应用：

streamlit run app.py

核心代码解读

"""
嘿咻嘿咻~本代码可是个OCR魔法师，专门把图片和PDF里的文字给"揪"出来！
它能处理单张图片、批量图片，甚至PDF文件，还能根据需求输出不同格式~
让我们开始这段奇妙的代码注释翻译之旅吧！
"""
import json
from typing import Dict, Any, List, Union
import os
import base64
import requests
from tqdm import tqdm
import concurrent.futures
from pathlib import Path
import cv2
import pymupdf 
import numpy as np

class OCRProcessor:
    def __init__(self, model_name: str = "llama3.2-vision:11b", 
                 base_url: str = "http://localhost:11434/api/generate",
                 max_workers: int = 1):
        """
        初始化OCR处理器，装备好你的魔法工具包！
        :param model_name: 使用的AI模型名称，默认是个视力11.0的"羊驼3.2"模型
        :param base_url: API的基础地址，默认是本地11434端口
        :param max_workers: 最大线程数，小心别把CPU累趴下
        """
        self.model_name = model_name
        self.base_url = base_url
        self.max_workers = max_workers

    def _encode_image(self, image_path: str) -> str:
        """把图片变成base64编码的字符串，就像给图片穿上隐身衣"""
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")

    def _pdf_to_images(self, pdf_path: str) -> List[str]:
        """
        把PDF的每一页都变成图片，就像把书页一张张撕下来拍照
        :param pdf_path: PDF文件路径
        :return: 图片路径列表，每页一个
        :raises ValueError: 如果PDF转换失败就大喊大叫
        """
        try:
            doc = pymupdf.open(pdf_path)
            image_paths = []
            for page_num in range(doc.page_count):
                page = doc[page_num]
                pix = page.get_pixmap()  # 把页面渲染成图片
                temp_path = f"{pdf_path}_page{page_num}.png"  # 临时图片存放路径
                pix.save(temp_path)  # 保存图片
                image_paths.append(temp_path)
            doc.close()
            return image_paths
        except Exception as e:
            raise ValueError(f"Could not convert PDF to images: {e}")

    def _preprocess_image(self, image_path: str, language: str = "en") -> str:
        """
        给图片做美容SPA，让文字更容易被识别出来
        :param image_path: 图片路径
        :param language: 语言代码，不同语言有不同的美容方案
        :return: 美容后的图片路径
        :raises ValueError: 如果图片读不出来就生气
        """
        # 读取图片
        image = cv2.imread(image_path)
        if image is None:
            raise ValueError(f"Could not read image at {image_path}")

        # 变成黑白照片
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

        # 用CLAHE魔法增强对比度
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
        enhanced = clahe.apply(gray)

        # 降噪处理，让图片更干净
        denoised = cv2.fastNlMeansDenoising(enhanced)

        # 根据不同语言选择不同的二值化方式
        if language.lower() in ["japanese", "chinese", "zh", "korean"]:
            # 中日韩文字喜欢高斯自适应阈值
            thresh = cv2.adaptiveThreshold(
                denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                cv2.THRESH_BINARY, 11, 2)
            thresh = cv2.bitwise_not(thresh)
        else:
            # 其他语言用大津算法
            thresh = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
            thresh = cv2.bitwise_not(thresh)

        # 保存美容后的图片
        preprocessed_path = f"{image_path}_preprocessed.jpg"
        cv2.imwrite(preprocessed_path, thresh)

        return preprocessed_path

    def process_image(self, image_path: str, format_type: str = "markdown", preprocess: bool = True, 
                      custom_prompt: str = None, language: str = "en") -> str:
        """
        核心魔法！处理单张图片或PDF，提取文字内容
        :param image_path: 图片或PDF路径
        :param format_type: 输出格式，可选["markdown", "text", "json", "structured", "key_value","custom"]
        :param preprocess: 是否要给图片做美容
        :param custom_prompt: 自定义咒语(提示词)，不填就用默认的
        :param language: 语言代码，影响预处理和提示词
        :return: 提取的文字内容，格式由format_type决定
        """
        try:
            # 处理PDF文件的情况
            if image_path.lower().endswith('.pdf'):
                image_pages = self._pdf_to_images(image_path)
                print("No. of pages in the PDF", len(image_pages))
                responses = []
                for idx, page_file in enumerate(image_pages):
                    # 根据需要给图片做美容
                    if preprocess:
                        preprocessed_path = self._preprocess_image(page_file, language)
                    else:
                        preprocessed_path = page_file

                    image_base64 = self._encode_image(preprocessed_path)

                    if custom_prompt and custom_prompt.strip():
                        prompt = custom_prompt
                        print("Using custom prompt:", prompt)  # Debug print
                    else:
                        prompts = {
                            "markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                Format the output in markdown:
                                - Use headers (#, ##, ###) **only if they appear in the image**
                                - Preserve original lists (-, *, numbered lists) as they are
                                - Maintain all text formatting (bold, italics, underlines) exactly as seen
                                - **Do not add, interpret, or restructure any content**
                            """,
                            "text": f"""Extract all visible text from this image in {language} **without any changes**.
                                - **Do not summarize, paraphrase, or infer missing text.**
                                - Retain all spacing, punctuation, and formatting exactly as in the image.
                                - If text is unclear or partially visible, extract as much as possible without guessing.
                                - **Include all text, even if it seems irrelevant or repeated.** 
                                """,


                           "json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
                                - **Do not summarize, add, or modify any text.**
                                - Maintain hierarchical sections and subsections as they appear.
                                - Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
                                - Include all text, even if fragmented, blurry, or unclear.
                                """,


                            "structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
                                - Identify and format tables **without altering content**.
                                - Preserve list structures (bulleted, numbered) **exactly as shown**.
                                - Maintain all section headings, indents, and alignments.
                                - **Do not add, infer, or restructure the content in any way.**
                                """,


                           "key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
                                - Identify and extract labels and their corresponding values without modification.
                                - Maintain the exact wording, punctuation, and order.
                                - Format each pair as 'key: value' **only if clearly structured that way in the image**.
                                - **Do not infer missing values or add any extra text.**
                                """,

                            "table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                - **Preserve the table structure** (rows, columns, headers) as closely as possible.
                                - **Do not add missing values or infer content**—if a cell is empty, leave it empty.
                                - Maintain all numerical, textual, and special character formatting.
                                - If the table contains merged cells, indicate them clearly without altering their meaning.
                                - Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
                                """,


                        }
                        prompt = prompts.get(format_type, prompts["text"])
                        print("Using default prompt:", prompt)  # Debug print

                    # 准备API请求的数据包
                    payload = {
                        "model": self.model_name,
                        "prompt": prompt,
                        "stream": False,
                        "images": [image_base64]
                    }

                    # 向Ollama发送API请求
                    response = requests.post(self.base_url, json=payload)
                    response.raise_for_status()
                    res = response.json().get("response", "")
                    print("Page No. Processed", idx)
                    # 给结果加上页码前缀
                    responses.append(f"Page {idx + 1}:\n{res}")

                    # 清理临时文件
                    if preprocess and preprocessed_path.endswith('_preprocessed.jpg'):
                        os.remove(preprocessed_path)
                    if page_file.endswith('.png'):
                        os.remove(page_file)

                final_result = "\n".join(responses)
                if format_type == "json":
                    try:
                        json_data = json.loads(final_result)
                        return json.dumps(json_data, indent=2)
                    except json.JSONDecodeError:
                        return final_result
                return final_result

            # 处理普通图片的情况
            if preprocess:
                image_path = self._preprocess_image(image_path, language)

            image_base64 = self._encode_image(image_path)

            # 清理临时美容后的图片
            if image_path.endswith(('_preprocessed.jpg', '_temp.jpg')):
                os.remove(image_path)

            if custom_prompt and custom_prompt.strip():
                prompt = custom_prompt
                print("Using custom prompt:", prompt)
            else:
                prompts = {
                            "markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                Format the output in markdown:
                                - Use headers (#, ##, ###) **only if they appear in the image**
                                - Preserve original lists (-, *, numbered lists) as they are
                                - Maintain all text formatting (bold, italics, underlines) exactly as seen
                                - **Do not add, interpret, or restructure any content**
                            """,
                            "text": f"""Extract all visible text from this image in {language} **without any changes**.
                                - **Do not summarize, paraphrase, or infer missing text.**
                                - Retain all spacing, punctuation, and formatting exactly as in the image.
                                - If text is unclear or partially visible, extract as much as possible without guessing.
                                - **Include all text, even if it seems irrelevant or repeated.** 
                                """,


                           "json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
                                - **Do not summarize, add, or modify any text.**
                                - Maintain hierarchical sections and subsections as they appear.
                                - Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
                                - Include all text, even if fragmented, blurry, or unclear.
                                """,


                            "structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
                                - Identify and format tables **without altering content**.
                                - Preserve list structures (bulleted, numbered) **exactly as shown**.
                                - Maintain all section headings, indents, and alignments.
                                - **Do not add, infer, or restructure the content in any way.**
                                """,


                           "key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
                                - Identify and extract labels and their corresponding values without modification.
                                - Maintain the exact wording, punctuation, and order.
                                - Format each pair as 'key: value' **only if clearly structured that way in the image**.
                                - **Do not infer missing values or add any extra text.**
                                """,

                            "table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                - **Preserve the table structure** (rows, columns, headers) as closely as possible.
                                - **Do not add missing values or infer content**—if a cell is empty, leave it empty.
                                - Maintain all numerical, textual, and special character formatting.
                                - If the table contains merged cells, indicate them clearly without altering their meaning.
                                - Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
                                """,
                }
                prompt = prompts.get(format_type, prompts["text"])
                print("Using default prompt:", prompt)  # Debug print

            payload = {
                "model": self.model_name,
                "prompt": prompt,
                "stream": False,
                "images": [image_base64]
            }

            response = requests.post(self.base_url, json=payload)
            response.raise_for_status()

            result = response.json().get("response", "")

            if format_type == "json":
                try:
                    json_data = json.loads(result)
                    return json.dumps(json_data, indent=2)
                except json.JSONDecodeError:
                    return result

            return result
        except Exception as e:
            return f"Error processing image: {str(e)}"

    def process_batch(
        self,
        input_path: Union[str, List[str]],
        format_type: str = "markdown",
        recursive: bool = False,
        preprocess: bool = True,
        custom_prompt: str = None,
        language: str = "en"
    ) -> Dict[str, Any]:
        """
        批量处理图片的魔法工厂
        :param input_path: 可以是文件夹路径，也可以是图片路径列表
        :param format_type: 输出格式类型
        :param recursive: 是否要递归搜索子文件夹
        :param preprocess: 是否要给图片做美容
        :param custom_prompt: 自定义咒语(提示词)
        :param language: 语言代码
        :return: 包含结果和统计信息的字典
        """
        # 收集所有图片路径
        image_paths = []
        if isinstance(input_path, str):
            base_path = Path(input_path)
            if base_path.is_dir():
                pattern = '**/*' if recursive else '*'
                for ext in ['.png', '.jpg', '.jpeg', '.pdf', '.tiff']:
                    image_paths.extend(base_path.glob(f'{pattern}{ext}'))
            else:
                image_paths = [base_path]
        else:
            image_paths = [Path(p) for p in input_path]

        results = {}
        errors = {}
        
        # 使用进度条并行处理图片
        with tqdm(total=len(image_paths), desc="Processing images") as pbar:
            with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                future_to_path = {
                    executor.submit(self.process_image, str(path), format_type, preprocess, custom_prompt, language): path
                    for path in image_paths
                }
                
                for future in concurrent.futures.as_completed(future_to_path):
                    path = future_to_path[future]
                    try:
                        results[str(path)] = future.result()
                    except Exception as e:
                        errors[str(path)] = str(e)
                    pbar.update(1)

        return {
            "results": results,
            "errors": errors,
            "statistics": {
                "total": len(image_paths),
                "successful": len(results),
                "failed": len(errors)
            }
        }