【BUG】程序卡死,无法捕获异常,无法设置超时,无法使用线程池管理

报错内容

在使用pymupdf解析PDF时,出现报错

MuPDF error: format error: object is not a stream
MuPDF error: syntax error: invalid ICC colorspace
MuPDF error: syntax error: unknown cid font type
MuPDF error: format error: object is not a stream
MuPDF error: syntax error: invalid ICC colorspace
MuPDF error: syntax error: unknown cid font type
MuPDF error: syntax error: unknown cid font type
MuPDF error: syntax error: unknown cid font type
MuPDF error: library error: zlib error: invalid stored block lengths

试错方案

这些是尝试解决问题的过程,都是无效方案,可以跳过
因为我是记录bug笔记,所以写在这里

捕获更具体的异常

失败

import fitz
def pdf2text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file) # open a document
        for page in doc: # iterate the document pages
            text = page.get_text() # get plain text (is in UTF-8)
            texts += text
        data['document'] = texts
        data['metadata'] =  pdf_file
        data['format'] =  "pdf_text"
        return data
    except fitz.FitzError as fe:
        print("MuPDF specific error:", fe)
        return False
    except Exception as e:
        print(e)
        return False

设置超时

失败

import fitz
import signal

def handler(signum, frame):
    raise TimeoutError("Operation timed out")

signal.signal(signal.SIGALRM, handler)

def pdf2text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file) # open a document
        for page in doc: # iterate the document pages
            signal.alarm(10)  # 设置超时为 10 秒
            text = page.get_text() # get plain text (is in UTF-8)
            signal.alarm(0)  # 取消超时
            texts += text
        data['document'] = texts
        data['metadata'] =  pdf_file
        data['format'] =  "pdf_text"
        return data
    except TimeoutError as te:
        print("Timeout error:", te)
        return False
    except Exception as e:
        print(e)
        return False

使用子进程

单独运行文件成功了,但是在服务调用时仍然卡死

import fitz
import multiprocessing

def extract_text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file)
        for page in doc:
            texts += page.get_text()
        data['document'] = texts
        data['metadata'] = pdf_file
        data['format'] = "pdf_text"
        return data
    except Exception as e:
        print(f"Error in subprocess: {e}")
        return None

def pdf2text(pdf_file):
    try:
        with multiprocessing.Pool(1) as pool:
            result = pool.apply_async(extract_text, (pdf_file,))
            return result.get(timeout=5)  # 设置超时为30秒
    except multiprocessing.TimeoutError:
        print("Subprocess timed out.")
        return False
    except Exception as e:
        print(f"Main process error: {e}")
        return False

使用线程池

失败

from concurrent.futures import ThreadPoolExecutor, TimeoutError
import fitz

def parse_pdf_page(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file)
        for page in doc:
            text = page.get_text()
            texts += text
        data['document'] = texts
        data['metadata'] = pdf_file
        data['format'] = "pdf_text"
        return data
    except Exception as e:
        return str(e)

def pdf2text_with_threadpool(pdf_file, timeout=30):
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(parse_pdf_page, pdf_file)
        try:
            return future.result(timeout=timeout)
        except TimeoutError:
            print("PDF parsing timed out.")
            return False

# 服务中调用示例
result = pdf2text_with_threadpool('your_pdf_file.pdf')

成功方案

使用 subprocess 调用外部脚本
脚本中设置子进程超时反馈

主进程

main.py

import subprocess
import json
pdf_file_path = "your_pdf_file"
result = subprocess.run([
        'python', 
        'pdf_demo.py',
          pdf_file_path
          ], 
          capture_output=True, 
          text=True)
try:
    result = json.loads(result.stdout)
    print(result)
except json.JSONDecodeError:
    print("未解析",repr(result.stdout))

外部脚本

pdf_demo.py

import fitz
import multiprocessing
import sys
import json
def pdf2text(pdf_file):
    data = dict()
    texts = ''

    doc = fitz.open(pdf_file)
    for page in doc:
        text = page.get_text()
        texts += text

    data['document'] = texts
    data['metadata'] =  pdf_file
    data['format'] =  "pdf_text"
    return data


if __name__ == "__main__":
    pdf_file_path = sys.argv[1]
    pool = multiprocessing.Pool(1)
    result = pool.apply_async(pdf2text, (pdf_file_path,))
    try:
        data = result.get(timeout=5)
        print(json.dumps(data, ensure_ascii=False, indent=4))
    except multiprocessing.TimeoutError:
        pool.terminate()
        print("Subprocess timed out.")
    except Exception as e:
        print(f"Main process error: {e}")
    finally:
        pool.close()
        pool.join()
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值