open cv roi提取_使用pytesseract open cv从扫描的pdf中提取文本

open cv roi提取

The process of extracting information from a digital copy of invoice can be a tricky task. There are various tools that are available in the market that can be used to perform this task. However there are many factors due to which most of the people want to solve this problem using Open Source Libraries.

从发票的数字副本中提取信息的过程可能是一项棘手的任务。 市场上有可以用来执行此任务的各种工具。 但是,由于许多因素,大多数人都希望使用开放源代码库解决此问题。

I came across a similar set of problem a few days back and wanted to share with you all the approach through which I solved this problem. The libraries that I used for developing this solution were pdf2image (for converting PDF to images), OpenCV (for Image pre-processing) and finally PyTesseract for OCR along with Python.

几天前,我遇到了一系列类似的问题,并想与大家分享解决该问题的所有方法。 我用于开发此解决方案的库是pdf2image (用于将PDF转换为图像), OpenCV (用于图像预处理),最后用于OCR的PyTesseractPython

将PDF转换为图像 (Converting PDF to Image)

pdf2image is a python library which converts PDF to a sequence of PIL Image objects using pdftoppm library. The following command can be used for installing the pdf2image library using pip installation method.

pdf2image是一个python库,可使用pdftoppm库将PDF转换为PIL Image对象序列。 以下命令可用于通过pip安装方法安装pdf2image库。

pip install pdf2image

点安装pdf2image

Note: pdf2image uses Poppler which is a PDF rendering library based on the xpdf-3.0 code base and will not work without it. Please refer to the below resources for downloading and installation instructions for Poppler.

注意:pdf2image使用Poppler ,它是基于xpdf-3.0代码库的PDF渲染库,没有它就无法使用。 请参考以下资源以获取Poppler的下载和安装说明。

https://anaconda.org/conda-forge/poppler

https://anaconda.org/conda-forge/popple r

https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值