【BUG】程序卡死，无法捕获异常，无法设置超时，无法使用线程池管理

内卷焦虑人士

于 2024-08-21 17:54:15 发布

阅读量96

点赞数

文章标签： bug pymupdf

本文链接：https://blog.csdn.net/weixin_46398647/article/details/141393539

版权

报错内容

在使用pymupdf解析PDF时，出现报错

MuPDF error: format error: object is not a stream
MuPDF error: syntax error: invalid ICC colorspace
MuPDF error: syntax error: unknown cid font type
MuPDF error: format error: object is not a stream
MuPDF error: syntax error: invalid ICC colorspace
MuPDF error: syntax error: unknown cid font type
MuPDF error: syntax error: unknown cid font type
MuPDF error: syntax error: unknown cid font type
MuPDF error: library error: zlib error: invalid stored block lengths

试错方案

这些是尝试解决问题的过程，都是无效方案，可以跳过
因为我是记录bug笔记，所以写在这里

捕获更具体的异常

失败

import fitz
def pdf2text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file) # open a document
        for page in doc: # iterate the document pages
            text = page.get_text() # get plain text (is in UTF-8)
            texts += text
        data['document'] = texts
        data['metadata'] =  pdf_file
        data['format'] =  "pdf_text"
        return data
    except fitz.FitzError as fe:
        print("MuPDF specific error:", fe)
        return False
    except Exception as e:
        print(e)
        return False

设置超时

失败

import fitz
import signal

def handler(signum, frame):
    raise TimeoutError("Operation timed out")

signal.signal(signal.SIGALRM, handler)

def pdf2text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file) # open a document
        for page in doc: # iterate the document pages
            signal.alarm(10)  # 设置超时为 10 秒
            text = page.get_text() # get plain text (is in UTF-8)
            signal.alarm(0)  # 取消超时
            texts += text
        data['document'] = texts
        data['metadata'] =  pdf_file
        data['format'] =  "pdf_text"
        return data
    except TimeoutError as te:
        print("Timeout error:", te)
        return False
    except Exception as e:
        print(e)
        return False

使用子进程

单独运行文件成功了，但是在服务调用时仍然卡死

import fitz
import multiprocessing

def extract_text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file)
        for page in doc:
            texts += page.get_text()
        data['document'] = texts
        data['metadata'] = pdf_file
        data['format'] = "pdf_text"
        return data
    except Exception as e:
        print(f"Error in subprocess: {e}")
        return None

def pdf2text(pdf_file):
    try:
        with multiprocessing.Pool(1) as pool:
            result = pool.apply_async(extract_text, (pdf_file,))
            return result.get(timeout=5)  # 设置超时为30秒
    except multiprocessing.TimeoutError:
        print("Subprocess timed out.")
        return False
    except Exception as e:
        print(f"Main process error: {e}")
        return False

使用线程池

失败

from concurrent.futures import ThreadPoolExecutor, TimeoutError
import fitz

def parse_pdf_page(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file)
        for page in doc:
            text = page.get_text()
            texts += text
        data['document'] = texts
        data['metadata'] = pdf_file
        data['format'] = "pdf_text"
        return data
    except Exception as e:
        return str(e)

def pdf2text_with_threadpool(pdf_file, timeout=30):
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(parse_pdf_page, pdf_file)
        try:
            return future.result(timeout=timeout)
        except TimeoutError:
            print("PDF parsing timed out.")
            return False

# 服务中调用示例
result = pdf2text_with_threadpool('your_pdf_file.pdf')

成功方案

使用 subprocess 调用外部脚本
脚本中设置子进程超时反馈

主进程

main.py

import subprocess
import json
pdf_file_path = "your_pdf_file"
result = subprocess.run([
        'python', 
        'pdf_demo.py',
          pdf_file_path
          ], 
          capture_output=True, 
          text=True)
try:
    result = json.loads(result.stdout)
    print(result)
except json.JSONDecodeError:
    print("未解析",repr(result.stdout))

外部脚本

pdf_demo.py

import fitz
import multiprocessing
import sys
import json
def pdf2text(pdf_file):
    data = dict()
    texts = ''

    doc = fitz.open(pdf_file)
    for page in doc:
        text = page.get_text()
        texts += text

    data['document'] = texts
    data['metadata'] =  pdf_file
    data['format'] =  "pdf_text"
    return data


if __name__ == "__main__":
    pdf_file_path = sys.argv[1]
    pool = multiprocessing.Pool(1)
    result = pool.apply_async(pdf2text, (pdf_file_path,))
    try:
        data = result.get(timeout=5)
        print(json.dumps(data, ensure_ascii=False, indent=4))
    except multiprocessing.TimeoutError:
        pool.terminate()
        print("Subprocess timed out.")
    except Exception as e:
        print(f"Main process error: {e}")
    finally:
        pool.close()
        pool.join()

内卷焦虑人士

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【BUG】程序卡死，无法捕获异常，无法设置超时，无法使用线程池管理

MuPDF error: format error: object is not a streamMuPDF error: syntax error: invalid ICC colorspaceMuPDF error: syntax error: unknown cid font typeMuPDF error: library error: zlib error: invalid stored block lengths
复制链接

扫一扫