open cv roi提取_使用pytesseract open cv从扫描的pdf中提取文本-CSDN博客

open cv roi提取

The process of extracting information from a digital copy of invoice can be a tricky task. There are various tools that are available in the market that can be used to perform this task. However there are many factors due to which most of the people want to solve this problem using Open Source Libraries.

从发票的数字副本中提取信息的过程可能是一项棘手的任务。市场上有可以用来执行此任务的各种工具。但是，由于许多因素，大多数人都希望使用开放源代码库解决此问题。

I came across a similar set of problem a few days back and wanted to share with you all the approach through which I solved this problem. The libraries that I used for developing this solution were pdf2image (for converting PDF to images), OpenCV (for Image pre-processing) and finally PyTesseract for OCR along with Python.

几天前，我遇到了一系列类似的问题，并想与大家分享解决该问题的所有方法。我用于开发此解决方案的库是pdf2image (用于将PDF转换为图像)， OpenCV (用于图像预处理)，最后是用于OCR的PyTesseract和Python 。

将PDF转换为图像 (Converting PDF to Image)

pdf2image is a python library which converts PDF to a sequence of PIL Image objects using pdftoppm library. The following command can be used for installing the pdf2image library using pip installation method.

pdf2image是一个python库，可使用pdftoppm库将PDF转换为PIL Image对象序列。以下命令可用于通过pip安装方法安装pdf2image库。

pip install pdf2image

点安装pdf2image

Note: pdf2image uses Poppler which is a PDF rendering library based on the xpdf-3.0 code base and will not work without it. Please refer to the below resources for downloading and installation instructions for Poppler.

注意：pdf2image使用Poppler ，它是基于xpdf-3.0代码库的PDF渲染库，没有它就无法使用。请参考以下资源以获取Poppler的下载和安装说明。

https://anaconda.org/conda-forge/poppler

https://anaconda.org/conda-forge/popple r

https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows