搭建 Tesseract-OCR 环境及报错:pytesseract.pytesseract.TesseractError处理办法:
(1, ‘Error opening data file C:\Program Files\Tesseract-OCR/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your “tessdata” directory. Failed loading language ‘chi_sim’ Tesseract couldn’t load any languages! Could not initialize tesseract.’)
pytesseract.pytesseract.tesseract_cmd=r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def Tesseract_to_str(image):
"""提取图片中的文字,返回text字符串"""
# 调用pytesseract库提取文字,识别中文需指定语言lang='chi_sim'
text_from_image = pytesseract.image_to_string(image, lang='chi_sim')
print(text_from_image)
return text_from_image
上述代码运行时,报错如下:
File “D:\ProgramFiles\miniconda3\envs\env_myenv\Lib\site-packages\pytesseract\pytesseract.py”, line 489, in
Output.STRING: lambda: run_and_get_output(*args),
^^^^^^^^^^^^^^^^^^^^^^^^^
File “D:\ProgramFiles\miniconda3\envs\env_myenv\Lib\site-packages\pytesseract\pytesseract.py”, line 352, in run_and_get_output
run_tesseract(**kwargs)
File “D:\ProgramFiles\miniconda3\envs\env_myenv\Lib\site-packages\pytesseract\pytesseract.py”, line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, ‘Error opening data file C:\Program Files\Tesseract-OCR/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your “tessdata” directory. Failed loading language ‘chi_sim’ Tesseract couldn’t load any languages! Could not initialize tesseract.’)
原因查看搭建环境注意点:
1 搭建 Tesseract-OCR 环境。
1.1 注意需先手动安装Tesseract-OCR, 下载地址:https://digi.bib.uni-mannheim.de/tesseract/?C=M;O=D
注意:安装的时候选中中文包(安装时把所有选项都勾上)。
安装磁盘选择与运行的代码在同一磁盘。
安装 Tesseract-OCR 后,需将 Tesseract-OCR 对应的安装路径添加到系统环境变量中。
安装完成后,使用命令,查看版本号和支持语言:
cd D:\Program Files\Tesseract-OCR
tesseract -v tesseract --list-langs -v tesseract --list-langs
若有语言方面的Error,需将中文包 chi_sim.traineddata 下载到本地C:\Program Files\Tesseract-OCR 路径下。
语言包下载地址:https://tesseract-ocr.github.io/tessdoc/Data-Files
1.2 再安装python库pytesseract
pip install pytesseract
1.3 此时再运行下述代码将不再报错
def Tesseract_to_str(image):
"""提取图片中的文字,返回text字符串"""
# 如果没有将tesseract的安装目录添加到系统环境变量中,则需要指定安装路径,
pytesseract.pytesseract.tesseract_cmd = r"D:\Program_Files\Tesseract-OCR\tesseract.exe"
testdata_dir_config = '--tessdata-dir D:/Program_Files/Tesseract-OCR/tessdata'
# 调用pytesseract库提取文字,识别中文需指定语言lang='chi_sim'
print('-'*20,'获取图中的文字','-'*20)
try:
text_from_image = pytesseract.image_to_string(image, config=testdata_dir_config, lang='chi_sim')
except Exception as e:
print('获取文字失败! ', e)
return ''
print(text_from_image)
return text_from_image