提取pdf文件中的图片
In NLP projects the input documents often come as PDFs. Sometimes the PDFs already contain underlying text information, which makes it possible to extract text without the use of OCR tools. In the following I want to present some open-source PDF tools available in Python that can be used to extract text. I will compare their features and point out some drawbacks.
在NLP项目中,输入文档通常以PDF格式出现。 有时,PDF已包含基础文本信息,这使得无需使用OCR工具即可提取文本。 在下面的内容中,我想介绍一些可用的Python开源PDF工具,这些工具可用于提取文本。 我将比较它们的功能并指出一些缺点。
Those tools are PyPDF2
, pdfminer
and PyMuPDF
.
这些工具是PyPDF2
, pdfminer
和PyMuPDF
。
There are other Python PDF libraries which are either not able to extract text or focused on other tasks. Furthermore, there are tools that are able to extract text from PDF documents, but which are not available in Python. Both will not be discussed here.
还有其他Python PDF库,它们要么无法提取文本,要么专注于其他任务。 此外,有些工具能够从PDF文档中提取文本,但是在Python中不可用。 两者都不会在这里讨论。
介绍 (Introduction)
We have already discussed different OCR tools for automatically extracting text from documents. Although there are well-performing tools, they still make errors. So, aiming at extracting information from documents one either has to build robust models which can manage small errors or seek for alternative ways of text extraction. For images and documents with no underlying text information, OCR tools are without alternative. But when it comes to PDF documents with underlying text, the question arises if one could access this text information directly, circumventing possible OCR errors. I want to discuss this and provide insights from our experiences in recent projects.
我们已经讨论了用于从文档中自动提取文本的各种OCR工具 。 尽管有性能良好的工具,但它们仍然会出错。 因此,以从文档中提取信息为目标,要么必须建立可以管理小错误的健壮模型,要么寻求文本提取的替代方法。 对于没有基础文本信息的图像和文档,OCR工具是不可替代的。 但是,当涉及带有基础文本的PDF文档时,就会出现一个问题,即是否可以直接访问此文本信息,从而避免可能的OCR错误。 我想对此进行讨论,并提供我们在最近项目中的经验中的见解。
First of all, it should be mentioned that PDF is not made for retrieving text information. PDF stands for Portable Document Format and was developed by Adobe. The main goal was to be able to exchange information platform-independently while preserving and protecting the content and layout of a document. This results in PDFs being hard to edit and difficult with extracting information from them. Which does not mean it is impossible.
首先,应该提到的是,PDF不是用于检索文本信息的。 PDF代表可移植文档格式 ,由Adobe开发。 主要目标是能够独立交换平台信息,同时保留并保护文档的内容和布局。 这导致PDF难以编辑,并且难以从中提取信息。 这并不意味着不可能。
Second, one has to decide how much information is actually needed. Do you only need the plain text information, do you also need the position of the text, do you maybe also want some font information? Those are questions which are also important when deciding on a suitable OCR tool. Everything is possible, but the task gets more complex and more messy with each additional layer of information needed.
其次,必须决定实际需要多少信息。 您只需要纯文本信息,是否还需要文本的位置,也许还需要一些字体信息? 这些是在决定合适的OCR工具时也很重要的问题。 一切皆有可能,但是任务变得更加复杂,并且每增加一层所需的信息就更加混乱。
We will test the three libraries on three simple sample PDFs:
我们将在三个简单的样本PDF上测试这三个库:
![Image for post](https://miro.medium.com/max/9999/1*yEP0vXt-R3Ft6nv3yRFu3Q.png)
![Image for post](https://miro.medium.com/max/9999/1*EDdLM4mhRmW9u8pu5IfXcg.png)
![Image for post](https://miro.medium.com/max/9999/1*3CtLX-PByDdkcfIzind-Mg.png)
PyPDF2 (PyPDF2)
PyPDF2
is a pure Python PDF library capable of splitting, merging together, cropping, and transforming pages of different PDF files. We can retrieve metadata from PDFs, like author, creator, creation date and others. It can also retrieve the PDF text as found in the content stream. This means that the text might not be ordered logically if it is not done so in the stream object associated with the PDF. Illogical ordering should not happen in general, but as the documents get more complex