python手动安装pyocr_使用python获取pdf上的文字(in win10)

最新推荐文章于 2024-04-10 09:52:10 发布

weixin_39531780

最新推荐文章于 2024-04-10 09:52:10 发布

阅读量548

点赞数

文章标签： python手动安装pyocr

环境版本： WIN10 | Python 3.6 | ImageMagick-6.9.9-38-Q8-x64-dll | Ghostscript 9.22 for Windows

整体思路：1.将PDF转为图片后进行文字识别 | 2.使用pdfminer解析pdf文件(准确率更高)

1.下载安装tesseract

在 github.com/UB-Mannheim/tesseract/wiki下载tesseract-xxx.exe文件后安装即可。需要注意的是，在选择安装组件时点开“Language data” 选上你要识别的语言，不选的话只能识别英文哟。

2.安装pyocr、Wand、Pillow

pip install pyocr

pip install Wand

pip install Pillow

若python版本为2.x需在http://pythonware.com/products/pil/下载Python Imaging Library (PIL) exe文件安装。

3.下载安装ImageMagick、Ghostscript

Wand依赖ImageMagick，ImageMagick依赖Ghostscript，去下面链接下载安装即可。

4.配置TESSDATA_PREFIX环境变量

变量值为tesseract安装目录，改完后重启一下项目。

5.修改pyocr包里的tesseract.py文件

# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY

#TESSERACT_CMD = 'tesseract.exe' if os.name == 'nt' else 'tesseract'

#TESSERACT_CMD = 'D:/Program Files (x86)/Tesseract-OCR/tesseract.exe' if os.name == 'nt' else 'tesseract'

TESSERACT_CMD = os.environ["TESSDATA_PREFIX"] + '/tesseract.exe' if os.name == 'nt' else 'tesseract'

6.编写并运行程序

前面的准备工作完成后，让我们来运行程序吧！

# -*- coding: utf-8 -*- py3不需要

from wand.image import Image

from PIL import Image as PI

import pyocr

import pyocr.builders

import io

import sys

def main():

tools = pyocr.get_available_tools()

if len(tools) == 0:

print("No OCR tool found")

sys.exit(1)

tool = tools[0]

print("Will use tool '%s'" % (tool.get_name()))

langs = tool.get_available_languages()

print("Available languages: %s" % ", ".join(langs))

lang = langs[0]

print("Will use lang '%s'" % (lang))

req_image = []

final_text = []

image_pdf = Image(filename="./pdf_file/stackoverflow.pdf", resolution=400)

image_jpeg = image_pdf.convert('jpeg')

image_jpeg.save(filename='./pdf2img/stackoverflow.jpeg')

for img in image_jpeg.sequence:

img_page = Image(image=img)

req_image.append(img_page.make_blob('jpeg'))

for img in req_image:

txt = tool.image_to_string(

PI.open(io.BytesIO(img)),

lang=lang,

builder=pyocr.builders.TextBuilder()

)

final_text.append(txt)

print(final_text)

# for text in final_text:

# print(text)

if __name__ == '__main__':

main()

生成的图片

运行结果

7.遇到的问题及解决方法

①运行程序时报错OSError: cannot find library; tried paths:

‘D:\Program Files\ImageMagick-7.0.7-Q16\CORE_RL_wand_.dll’,

‘D:\Program Files\ImageMagick-7.0.7-Q16\libMagickWand.dll’,

…

又去官网看了一下发现这段话：

Wand yet doesn’t support ImageMagick 7 which has several incompatible APIs with previous versions. For more details, see the issue #287.

卸载了之后重新安装ImageMagick-6.9.9-38-Q8-x64-dll版本问题解决。

②DelegateError: PDFDelegateFailed `一堆乱码(请跳过)

查了半天看到一哥们也用python3.6遇到这问题，但没有解决办法，心想切换成python2.7试试。如非需要切换python版本请看一条。

(在Anaconda3中实现多版本python Spyder共存)

全程在anaconda cmd中操作

1)先在conda中创建一个名为python2的环境，并下载对应版本python2.7

conda create –name python2 python=2.7

2)激活python2环境

activate python2

3)在python2的环境下下载spyder和Jupter notebook

conda install spyder

#conda install jupyter

③DelegateError: PDFDelegateFailed `系统找不到指定的文件。

’ @ error/pdf.c/ReadPDFImage/

之前是在pycharm中运行的，切换成python2.7后先在spyder试了一下，结果之前的乱码变成中文了。查了一下发现是因为没有安装ghostscript，安装之后问题解决。然后我又切换为python3.6，也不报错了。

Ghostscript ：https://ghostscript.com/download/gsdnld.html

④No OCR tool found或pytesseract.TesseractError:

(1, ‘Error opening data file \Program Files (x86)\Tesseract-OCR\tessdata/chi_sim.traineddata’)

安装Tesseract后需设置环境变量 TESSDATA_PREFIX并修改tesseract.py文件

# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY

#TESSERACT_CMD = 'tesseract.exe' if os.name == 'nt' else 'tesseract'

#TESSERACT_CMD = 'D:/Program Files (x86)/Tesseract-OCR/tesseract.exe' if os.name == 'nt' else 'tesseract'

TESSERACT_CMD = os.environ["TESSDATA_PREFIX"] + '/tesseract.exe' if os.name == 'nt' else 'tesseract'

D:\Program Files (x86)\Tesseract-OCR (供参考，以实际安装路径为准)

如果不设置环境变量，在安装盘之外的路径运行tesseract时会提示：

Please make sure the TESSDATA_PREFIX environment variable is set to the parent d irectory of your “tessdata” directory

注意设置环境变量后需重启项目

8.使用pdfminer解析pdf文件(准确率更高)

①安装pdfminer3k

pip install pdfminer3k

②编写并运行程序

from pdfminer.pdfparser import PDFParser, PDFDocument

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

from pdfminer.converter import PDFPageAggregator

from pdfminer.layout import LTTextBoxHorizontal, LAParams

from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

''' 解析pdf 文本，保存到txt文件中 '''

path = r'./pdf_file/stackoverflow.pdf'

def parse():

fp = open(path, 'rb') # 以二进制读模式打开

# 用文件对象来创建一个pdf文档分析器

praser = PDFParser(fp)

# 创建一个PDF文档

doc = PDFDocument()

# 连接分析器与文档对象

praser.set_document(doc)

doc.set_parser(praser)

# 提供初始化密码

# 如果没有密码就创建一个空的字符串

doc.initialize()

# 检测文档是否提供txt转换，不提供就忽略

if not doc.is_extractable:

raise PDFTextExtractionNotAllowed

else:

# 创建PDf 资源管理器来管理共享资源

rsrcmgr = PDFResourceManager()

# 创建一个PDF设备对象

laparams = LAParams()

device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# 创建一个PDF解释器对象

interpreter = PDFPageInterpreter(rsrcmgr, device)

# 循环遍历列表，每次处理一个page的内容

for page in doc.get_pages(): # doc.get_pages() 获取page列表

interpreter.process_page(page)

# 接受该页面的LTPage对象

layout = device.get_result()

""" 这里layout是一个LTPage对象里面存放着这个page解析出的各种对象一般包括LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal 等等想要获取文本就获得对象的text属性 """

for x in layout:

if isinstance(x, LTTextBoxHorizontal):

with open(r'./pdf_file/stackoverflow.txt', 'a') as f:

results = x.get_text()

print(results)

f.write(results + '\n')

if __name__ == '__main__':

parse()

参考链接：

https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/00140767171357714f87a053a824ffd811d98a83b58ec13000

https://www.cnblogs.com/zhiyong-ITNote/p/6852113.html

http://blog.csdn.net/HuangZhang_123/article/details/61920975

https://www.cnblogs.com/wzben/p/5930538.html

https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python

http://blog.topspeedsnail.com/archives/3571

https://www.cnblogs.com/yourstars/p/5849881.html

https://segmentfault.com/q/1010000007964197

weixin_39531780

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python手动安装pyocr_使用python获取pdf上的文字(in win10)

环境版本： WIN10 | Python 3.6 | ImageMagick-6.9.9-38-Q8-x64-dll | Ghostscript 9.22 for Windows整体思路：1.将PDF转为图片后进行文字识别 | 2.使用pdfminer解析pdf文件(准确率更高)目录1.下载安装tesseract在 github.com/UB-Mannheim/tesseract/wiki下载te...
复制链接

扫一扫