报错：pytesseract.pytesseract.TesseractError

最新推荐文章于 2025-03-27 19:59:31 发布

爱吃油淋鸡的莫何

最新推荐文章于 2025-03-27 19:59:31 发布

阅读量2k

点赞数 27

文章标签：前端服务器

本文链接：https://blog.csdn.net/qq_42835363/article/details/142523045

版权

搭建 Tesseract-OCR 环境及报错：pytesseract.pytesseract.TesseractError处理办法：

(1, ‘Error opening data file C:\Program Files\Tesseract-OCR/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your “tessdata” directory. Failed loading language ‘chi_sim’ Tesseract couldn’t load any languages! Could not initialize tesseract.’)

pytesseract.pytesseract.tesseract_cmd=r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def Tesseract_to_str(image):
    """提取图片中的文字，返回text字符串"""
    # 调用pytesseract库提取文字，识别中文需指定语言lang='chi_sim'
    text_from_image = pytesseract.image_to_string(image, lang='chi_sim')
    print(text_from_image)
    return text_from_image

上述代码运行时，报错如下：

File “D:\ProgramFiles\miniconda3\envs\env_myenv\Lib\site-packages\pytesseract\pytesseract.py”, line 489, in
Output.STRING: lambda: run_and_get_output(*args),
^^^^^^^^^^^^^^^^^^^^^^^^^
File “D:\ProgramFiles\miniconda3\envs\env_myenv\Lib\site-packages\pytesseract\pytesseract.py”, line 352, in run_and_get_output
run_tesseract(**kwargs)
File “D:\ProgramFiles\miniconda3\envs\env_myenv\Lib\site-packages\pytesseract\pytesseract.py”, line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, ‘Error opening data file C:\Program Files\Tesseract-OCR/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your “tessdata” directory. Failed loading language ‘chi_sim’ Tesseract couldn’t load any languages! Could not initialize tesseract.’)

原因查看搭建环境注意点：

1 搭建 Tesseract-OCR 环境。

1.1 注意需先手动安装Tesseract-OCR, 下载地址：https://digi.bib.uni-mannheim.de/tesseract/?C=M;O=D

注意：安装的时候选中中文包（安装时把所有选项都勾上）。

安装磁盘选择与运行的代码在同一磁盘。

安装 Tesseract-OCR 后，需将 Tesseract-OCR 对应的安装路径添加到系统环境变量中。

安装完成后，使用命令，查看版本号和支持语言：
cd D:\Program Files\Tesseract-OCR

tesseract -v tesseract --list-langs -v tesseract --list-langs

若有语言方面的Error,需将中文包 chi_sim.traineddata 下载到本地C:\Program Files\Tesseract-OCR 路径下。

语言包下载地址：https://tesseract-ocr.github.io/tessdoc/Data-Files

1.2 再安装python库pytesseract

pip install pytesseract

1.3 此时再运行下述代码将不再报错

def Tesseract_to_str(image):
    """提取图片中的文字，返回text字符串"""
    # 如果没有将tesseract的安装目录添加到系统环境变量中，则需要指定安装路径,
    pytesseract.pytesseract.tesseract_cmd = r"D:\Program_Files\Tesseract-OCR\tesseract.exe"
    testdata_dir_config = '--tessdata-dir D:/Program_Files/Tesseract-OCR/tessdata'
    # 调用pytesseract库提取文字，识别中文需指定语言lang='chi_sim'
    print('-'*20,'获取图中的文字','-'*20)
    try:
        text_from_image = pytesseract.image_to_string(image,  config=testdata_dir_config, lang='chi_sim')
    except Exception as e:
        print('获取文字失败!         ', e)
        return ''
    print(text_from_image)
    return text_from_image