Ollama-OCR核心代码解读:使用先进LLM从图像和 PDF 中提取文本

🧠 向所有学习者致敬!

“学习不是装满一桶水,而是点燃一把火。” —— 叶芝


我的博客主页: https://lizheng.blog.csdn.net

🌐 欢迎点击加入AI人工智能社区

🚀 让我们一起努力,共创AI未来! 🚀


Ollama-OCR 是一个功能强大的光学字符识别(OCR)工具包,它借助 Ollama 使用先进的视觉语言模型从图像和 PDF 中提取文本。该工具包既可以作为 Python 包使用,也可以通过 Streamlit 网络应用程序使用。以下是对该仓库的详细介绍:
在这里插入图片描述

主要特性

  1. 支持多种文件格式:支持处理 PDF 和常见的图像格式,如 PNG、JPG、JPEG、TIFF、BMP 等。
  2. 多模型支持
    • LLaVA 7B:高效的视觉语言模型,适用于实时处理。
    • Llama 3.2 Vision:高级模型,对复杂文档具有较高的识别精度。
    • Granite 3.2 Vision:专为视觉文档理解设计的紧凑高效模型,可从表格、图表等中自动提取内容。
    • Moondream:轻量级模型,专为边缘设备设计。
    • Minicpm-v:可处理任意纵横比和高达 180 万像素的图像。
  3. 多种输出格式
    • Markdown:保留文本格式,包括标题和列表。
    • Plain Text:简单的纯文本提取。
    • JSON:结构化数据格式。
    • Structured:用于提取表格和有组织的数据。
    • Key-Value Pairs:提取带标签的信息。
    • Table:提取所有表格数据。
  4. 批量处理:支持并行处理多个图像,并提供每个图像的处理进度跟踪。
  5. 自定义提示:可以使用自定义提示覆盖默认提示,以实现更精准的文本提取。
  6. 图像预处理:在进行 OCR 之前,对图像进行预处理,包括转换 PDF 为图像、增强对比度、降噪等操作。

安装和使用

安装依赖
pip install -r requirements.txt
快速开始
  1. 安装 Ollama:根据官方文档进行安装。
  2. 拉取所需模型
ollama pull llama3.2-vision:11b
ollama pull granite3.2-vision
ollama pull moondream
ollama pull minicpm-v
使用 Python 包
  • 单文件处理
from ollama_ocr import OCRProcessor

# 初始化 OCR 处理器
ocr = OCRProcessor(model_name='llama3.2-vision:11b', base_url="http://host.docker.internal:11434/api/generate")

# 处理图像
result = ocr.process_image(
    image_path="path/to/your/image.png",
    format_type="markdown",
    custom_prompt="Extract all text, focusing on dates and names.",
    language="English"
)
print(result)
  • 批量处理
from ollama_ocr import OCRProcessor

# 初始化 OCR 处理器
ocr = OCRProcessor(model_name='llama3.2-vision:11b', max_workers=4)

# 处理多个图像
batch_results = ocr.process_batch(
    input_path="path/to/images/folder",
    format_type="markdown",
    recursive=True,
    preprocess=True,
    custom_prompt="Extract all text, focusing on dates and names.",
    language="English"
)

# 访问结果
for file_path, text in batch_results['results'].items():
    print(f"\nFile: {file_path}")
    print(f"Extracted Text: {text}")

# 查看统计信息
print("\nProcessing Statistics:")
print(f"Total images: {batch_results['statistics']['total']}")
print(f"Successfully processed: {batch_results['statistics']['successful']}")
print(f"Failed: {batch_results['statistics']['failed']}")
使用 Streamlit 网络应用程序
  1. 克隆仓库:
git clone https://github.com/imanoop7/Ollama-OCR.git
cd Ollama-OCR
  1. 安装依赖:
pip install -r requirements.txt
  1. 进入 app.py 所在目录:
cd src/ollama_ocr
  1. 运行 Streamlit 应用:
streamlit run app.py

核心代码解读

"""
嘿咻嘿咻~本代码可是个OCR魔法师,专门把图片和PDF里的文字给"揪"出来!
它能处理单张图片、批量图片,甚至PDF文件,还能根据需求输出不同格式~
让我们开始这段奇妙的代码注释翻译之旅吧!
"""
import json
from typing import Dict, Any, List, Union
import os
import base64
import requests
from tqdm import tqdm
import concurrent.futures
from pathlib import Path
import cv2
import pymupdf 
import numpy as np

class OCRProcessor:
    def __init__(self, model_name: str = "llama3.2-vision:11b", 
                 base_url: str = "http://localhost:11434/api/generate",
                 max_workers: int = 1):
        """
        初始化OCR处理器,装备好你的魔法工具包!
        :param model_name: 使用的AI模型名称,默认是个视力11.0的"羊驼3.2"模型
        :param base_url: API的基础地址,默认是本地11434端口
        :param max_workers: 最大线程数,小心别把CPU累趴下
        """
        self.model_name = model_name
        self.base_url = base_url
        self.max_workers = max_workers

    def _encode_image(self, image_path: str) -> str:
        """把图片变成base64编码的字符串,就像给图片穿上隐身衣"""
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")

    def _pdf_to_images(self, pdf_path: str) -> List[str]:
        """
        把PDF的每一页都变成图片,就像把书页一张张撕下来拍照
        :param pdf_path: PDF文件路径
        :return: 图片路径列表,每页一个
        :raises ValueError: 如果PDF转换失败就大喊大叫
        """
        try:
            doc = pymupdf.open(pdf_path)
            image_paths = []
            for page_num in range(doc.page_count):
                page = doc[page_num]
                pix = page.get_pixmap()  # 把页面渲染成图片
                temp_path = f"{pdf_path}_page{page_num}.png"  # 临时图片存放路径
                pix.save(temp_path)  # 保存图片
                image_paths.append(temp_path)
            doc.close()
            return image_paths
        except Exception as e:
            raise ValueError(f"Could not convert PDF to images: {e}")

    def _preprocess_image(self, image_path: str, language: str = "en") -> str:
        """
        给图片做美容SPA,让文字更容易被识别出来
        :param image_path: 图片路径
        :param language: 语言代码,不同语言有不同的美容方案
        :return: 美容后的图片路径
        :raises ValueError: 如果图片读不出来就生气
        """
        # 读取图片
        image = cv2.imread(image_path)
        if image is None:
            raise ValueError(f"Could not read image at {image_path}")

        # 变成黑白照片
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

        # 用CLAHE魔法增强对比度
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
        enhanced = clahe.apply(gray)

        # 降噪处理,让图片更干净
        denoised = cv2.fastNlMeansDenoising(enhanced)

        # 根据不同语言选择不同的二值化方式
        if language.lower() in ["japanese", "chinese", "zh", "korean"]:
            # 中日韩文字喜欢高斯自适应阈值
            thresh = cv2.adaptiveThreshold(
                denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                cv2.THRESH_BINARY, 11, 2)
            thresh = cv2.bitwise_not(thresh)
        else:
            # 其他语言用大津算法
            thresh = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
            thresh = cv2.bitwise_not(thresh)

        # 保存美容后的图片
        preprocessed_path = f"{image_path}_preprocessed.jpg"
        cv2.imwrite(preprocessed_path, thresh)

        return preprocessed_path

    def process_image(self, image_path: str, format_type: str = "markdown", preprocess: bool = True, 
                      custom_prompt: str = None, language: str = "en") -> str:
        """
        核心魔法!处理单张图片或PDF,提取文字内容
        :param image_path: 图片或PDF路径
        :param format_type: 输出格式,可选["markdown", "text", "json", "structured", "key_value","custom"]
        :param preprocess: 是否要给图片做美容
        :param custom_prompt: 自定义咒语(提示词),不填就用默认的
        :param language: 语言代码,影响预处理和提示词
        :return: 提取的文字内容,格式由format_type决定
        """
        try:
            # 处理PDF文件的情况
            if image_path.lower().endswith('.pdf'):
                image_pages = self._pdf_to_images(image_path)
                print("No. of pages in the PDF", len(image_pages))
                responses = []
                for idx, page_file in enumerate(image_pages):
                    # 根据需要给图片做美容
                    if preprocess:
                        preprocessed_path = self._preprocess_image(page_file, language)
                    else:
                        preprocessed_path = page_file

                    image_base64 = self._encode_image(preprocessed_path)

                    if custom_prompt and custom_prompt.strip():
                        prompt = custom_prompt
                        print("Using custom prompt:", prompt)  # Debug print
                    else:
                        prompts = {
                            "markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                Format the output in markdown:
                                - Use headers (#, ##, ###) **only if they appear in the image**
                                - Preserve original lists (-, *, numbered lists) as they are
                                - Maintain all text formatting (bold, italics, underlines) exactly as seen
                                - **Do not add, interpret, or restructure any content**
                            """,
                            "text": f"""Extract all visible text from this image in {language} **without any changes**.
                                - **Do not summarize, paraphrase, or infer missing text.**
                                - Retain all spacing, punctuation, and formatting exactly as in the image.
                                - If text is unclear or partially visible, extract as much as possible without guessing.
                                - **Include all text, even if it seems irrelevant or repeated.** 
                                """,


                           "json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
                                - **Do not summarize, add, or modify any text.**
                                - Maintain hierarchical sections and subsections as they appear.
                                - Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
                                - Include all text, even if fragmented, blurry, or unclear.
                                """,


                            "structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
                                - Identify and format tables **without altering content**.
                                - Preserve list structures (bulleted, numbered) **exactly as shown**.
                                - Maintain all section headings, indents, and alignments.
                                - **Do not add, infer, or restructure the content in any way.**
                                """,


                           "key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
                                - Identify and extract labels and their corresponding values without modification.
                                - Maintain the exact wording, punctuation, and order.
                                - Format each pair as 'key: value' **only if clearly structured that way in the image**.
                                - **Do not infer missing values or add any extra text.**
                                """,

                            "table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                - **Preserve the table structure** (rows, columns, headers) as closely as possible.
                                - **Do not add missing values or infer content**—if a cell is empty, leave it empty.
                                - Maintain all numerical, textual, and special character formatting.
                                - If the table contains merged cells, indicate them clearly without altering their meaning.
                                - Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
                                """,


                        }
                        prompt = prompts.get(format_type, prompts["text"])
                        print("Using default prompt:", prompt)  # Debug print

                    # 准备API请求的数据包
                    payload = {
                        "model": self.model_name,
                        "prompt": prompt,
                        "stream": False,
                        "images": [image_base64]
                    }

                    # 向Ollama发送API请求
                    response = requests.post(self.base_url, json=payload)
                    response.raise_for_status()
                    res = response.json().get("response", "")
                    print("Page No. Processed", idx)
                    # 给结果加上页码前缀
                    responses.append(f"Page {idx + 1}:\n{res}")

                    # 清理临时文件
                    if preprocess and preprocessed_path.endswith('_preprocessed.jpg'):
                        os.remove(preprocessed_path)
                    if page_file.endswith('.png'):
                        os.remove(page_file)

                final_result = "\n".join(responses)
                if format_type == "json":
                    try:
                        json_data = json.loads(final_result)
                        return json.dumps(json_data, indent=2)
                    except json.JSONDecodeError:
                        return final_result
                return final_result

            # 处理普通图片的情况
            if preprocess:
                image_path = self._preprocess_image(image_path, language)

            image_base64 = self._encode_image(image_path)

            # 清理临时美容后的图片
            if image_path.endswith(('_preprocessed.jpg', '_temp.jpg')):
                os.remove(image_path)

            if custom_prompt and custom_prompt.strip():
                prompt = custom_prompt
                print("Using custom prompt:", prompt)
            else:
                prompts = {
                            "markdown": f"""Extract all text content from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                Format the output in markdown:
                                - Use headers (#, ##, ###) **only if they appear in the image**
                                - Preserve original lists (-, *, numbered lists) as they are
                                - Maintain all text formatting (bold, italics, underlines) exactly as seen
                                - **Do not add, interpret, or restructure any content**
                            """,
                            "text": f"""Extract all visible text from this image in {language} **without any changes**.
                                - **Do not summarize, paraphrase, or infer missing text.**
                                - Retain all spacing, punctuation, and formatting exactly as in the image.
                                - If text is unclear or partially visible, extract as much as possible without guessing.
                                - **Include all text, even if it seems irrelevant or repeated.** 
                                """,


                           "json": f"""Extract all text from this image in {language} and format it as JSON, **strictly preserving** the structure.
                                - **Do not summarize, add, or modify any text.**
                                - Maintain hierarchical sections and subsections as they appear.
                                - Use keys that reflect the document's actual structure (e.g., "title", "body", "footer").
                                - Include all text, even if fragmented, blurry, or unclear.
                                """,


                            "structured": f"""Extract all text from this image in {language}, **ensuring complete structural accuracy**:
                                - Identify and format tables **without altering content**.
                                - Preserve list structures (bulleted, numbered) **exactly as shown**.
                                - Maintain all section headings, indents, and alignments.
                                - **Do not add, infer, or restructure the content in any way.**
                                """,


                           "key_value": f"""Extract all key-value pairs from this image in {language} **exactly as they appear**:
                                - Identify and extract labels and their corresponding values without modification.
                                - Maintain the exact wording, punctuation, and order.
                                - Format each pair as 'key: value' **only if clearly structured that way in the image**.
                                - **Do not infer missing values or add any extra text.**
                                """,

                            "table": f"""Extract all tabular data from this image in {language} **exactly as it appears**, without modification, summarization, or omission.
                                - **Preserve the table structure** (rows, columns, headers) as closely as possible.
                                - **Do not add missing values or infer content**—if a cell is empty, leave it empty.
                                - Maintain all numerical, textual, and special character formatting.
                                - If the table contains merged cells, indicate them clearly without altering their meaning.
                                - Output the table in a structured format such as Markdown, CSV, or JSON, based on the intended use.
                                """,
                }
                prompt = prompts.get(format_type, prompts["text"])
                print("Using default prompt:", prompt)  # Debug print

            payload = {
                "model": self.model_name,
                "prompt": prompt,
                "stream": False,
                "images": [image_base64]
            }

            response = requests.post(self.base_url, json=payload)
            response.raise_for_status()

            result = response.json().get("response", "")

            if format_type == "json":
                try:
                    json_data = json.loads(result)
                    return json.dumps(json_data, indent=2)
                except json.JSONDecodeError:
                    return result

            return result
        except Exception as e:
            return f"Error processing image: {str(e)}"

    def process_batch(
        self,
        input_path: Union[str, List[str]],
        format_type: str = "markdown",
        recursive: bool = False,
        preprocess: bool = True,
        custom_prompt: str = None,
        language: str = "en"
    ) -> Dict[str, Any]:
        """
        批量处理图片的魔法工厂
        :param input_path: 可以是文件夹路径,也可以是图片路径列表
        :param format_type: 输出格式类型
        :param recursive: 是否要递归搜索子文件夹
        :param preprocess: 是否要给图片做美容
        :param custom_prompt: 自定义咒语(提示词)
        :param language: 语言代码
        :return: 包含结果和统计信息的字典
        """
        # 收集所有图片路径
        image_paths = []
        if isinstance(input_path, str):
            base_path = Path(input_path)
            if base_path.is_dir():
                pattern = '**/*' if recursive else '*'
                for ext in ['.png', '.jpg', '.jpeg', '.pdf', '.tiff']:
                    image_paths.extend(base_path.glob(f'{pattern}{ext}'))
            else:
                image_paths = [base_path]
        else:
            image_paths = [Path(p) for p in input_path]

        results = {}
        errors = {}
        
        # 使用进度条并行处理图片
        with tqdm(total=len(image_paths), desc="Processing images") as pbar:
            with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                future_to_path = {
                    executor.submit(self.process_image, str(path), format_type, preprocess, custom_prompt, language): path
                    for path in image_paths
                }
                
                for future in concurrent.futures.as_completed(future_to_path):
                    path = future_to_path[future]
                    try:
                        results[str(path)] = future.result()
                    except Exception as e:
                        errors[str(path)] = str(e)
                    pbar.update(1)

        return {
            "results": results,
            "errors": errors,
            "statistics": {
                "total": len(image_paths),
                "successful": len(results),
                "failed": len(errors)
            }
        }
### Ollama OCR 技术及其使用与实现 Ollama 是一种基于大语言模型LLM)的技术框架,专注于提供高效的本地化部署解决方案。尽管 Ollama 主要用于 LLM 的运行管理[^2],但它也可以通过扩展与其他技术集成来支持特定的应用场景,比如光学字符识别OCR)。以下是关于如何利用 Ollama 实现 OCR 功能的相关技术工具的信息。 #### 工具技术概述 为了实现 OCR 功能并结合 Ollama 使用,可以考虑以下几种方式: 1. **Tesseract OCR 集成**: Tesseract 是一个开源的 OCR 引擎,能够处理多种图像格式并将其中的文字提取出来。可以通过 Python 或其他编程语言将其与 Ollama 结合起来,从而让后者分析或生成文字内容。 ```python import pytesseract from PIL import Image def extract_text_from_image(image_path): image = Image.open(image_path) text = pytesseract.image_to_string(image) return text extracted_text = extract_text_from_image('example.png') print(extracted_text) ``` 2. **Google Cloud Vision API**: 如果需要更高级的功能或者更高的精度,可以选择 Google Cloud Vision API 来完成 OCR 任务。此服务提供了强大的自然语言处理能力,并能轻松地与 Ollama 进行交互以增强其功能[^1]。 3. **EasyOCR**: EasyOCR 是另一个流行的库,它支持超过80种语言并且不需要依赖外部服务器即可工作。这使得它非常适合离线环境下的应用开发需求。 ```python import easyocr reader = easyocr.Reader(['en']) result = reader.readtext('image.jpg', detail=0) print(result) ``` 4. **Pytesseract OpenCV 组合**: 对于更加复杂的文档结构解析任务,则可能需要用到计算机视觉方面的知识配合 Pytesseract 完成预处理操作后再执行 OCR 提取过程。 #### 实施步骤说明 虽然不允许使用诸如“首先”之类的引导词,但仍需强调几个关键环节: - 数据准备阶段涉及收集训练样本以及标注这些图片中的文本区域; - 模型选择取决于具体应用场景的要求——如果追求速度则倾向于轻量级方案;反之若注重准确性可选用复杂度较高的算法; - 后端逻辑设计方面应考虑到前后两端通信协议的设计、错误处理机制建立等问题。 #### 示例代码片段展示 下面给出一段简单的例子演示如何将上述提到的一些组件串联在一起形成完整的流程: ```python import cv2 from ocr_tool import perform_ocr # 假设我们有一个自定义函数来进行实际的 OCR 处理 def process_document(file_name): img = cv2.imread(file_name) gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) _, thresholded_img = cv2.threshold(gray_img, 150, 255, cv2.THRESH_BINARY_INV) detected_texts = perform_ocr(thresholded_img) return detected_texts if __name__ == "__main__": file_input = 'test_doc.jpg' output = process_document(file_input) print(output) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

AI仙人掌

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值