提取pdf文件中的图片_如何从pdf文件中提取文本

提取pdf文件中的图片

In NLP projects the input documents often come as PDFs. Sometimes the PDFs already contain underlying text information, which makes it possible to extract text without the use of OCR tools. In the following I want to present some open-source PDF tools available in Python that can be used to extract text. I will compare their features and point out some drawbacks.

NLP项目中,输入文档通常以PDF格式出现。 有时,PDF已包含基础文本信息,这使得无需使用OCR工具即可提取文本。 在下面的内容中,我想介绍一些可用的Python开源PDF工具,这些工具可用于提取文本。 我将比较它们的功能并指出一些缺点。

Those tools are PyPDF2, pdfminer and PyMuPDF.

这些工具是PyPDF2pdfminerPyMuPDF

There are other Python PDF libraries which are either not able to extract text or focused on other tasks. Furthermore, there are tools that are able to extract text from PDF documents, but which are not available in Python. Both will not be discussed here.

还有其他Python PDF库,它们要么无法提取文本,要么专注于其他任务。 此外,有些工具能够从PDF文档中提取文本,但是在Python中不可用。 两者都不会在这里讨论。

介绍 (Introduction)

We have already discussed different OCR tools for automatically extracting text from documents. Although there are well-performing tools, they still make errors. So, aiming at extracting information from documents one either has to build robust models which can manage small errors or seek for alternative ways of text extraction. For images and documents with no underlying text information, OCR tools are without alternative. But when it comes to PDF documents with underlying text, the question arises if one could access this text information directly, circumventing possible OCR errors. I want to discuss this and provide insights from our experiences in recent projects.

我们已经讨论了用于从文档中自动提取文本的各种OCR工具 。 尽管有性能良好的工具,但它们仍然会出错。 因此,以从文档中提取信息为目标,要么必须建立可以管理小错误的健壮模型,要么寻求文本提取的替代方法。 对于没有基础文本信息的图像和文档,OCR工具是不可替代的。 但是,当涉及带有基础文本的PDF文档时,就会出现一个问题,即是否可以直接访问此文本信息,从而避免可能的OCR错误。 我想对此进行讨论,并提供我们在最近项目中的经验中的见解。

First of all, it should be mentioned that PDF is not made for retrieving text information. PDF stands for Portable Document Format and was developed by Adobe. The main goal was to be able to exchange information platform-independently while preserving and protecting the content and layout of a document. This results in PDFs being hard to edit and difficult with extracting information from them. Which does not mean it is impossible.

首先,应该提到的是,PDF不是用于检索文本信息的。 PDF代表可移植文档格式 ,由Adobe开发。 主要目标是能够独立交换平台信息,同时保留并保护文档的内容和布局。 这导致PDF难以编辑,并且难以从中提取信息。 这并不意味着不可能。

Second, one has to decide how much information is actually needed. Do you only need the plain text information, do you also need the position of the text, do you maybe also want some font information? Those are questions which are also important when deciding on a suitable OCR tool. Everything is possible, but the task gets more complex and more messy with each additional layer of information needed.

其次,必须决定实际需要多少信息。 您只需要纯文本信息,是否还需要文本的位置,也许还需要一些字体信息? 这些是在决定合适的OCR工具时也很重要的问题。 一切皆有可能,但是任务变得更加复杂,并且每增加一层所需的信息就更加混乱。

We will test the three libraries on three simple sample PDFs:

我们将在三个简单的样本PDF上测试这三个库:

Image for post
Image for post
Image for post
Sample PDFs 1, 2 and 3 (from left to right).
样本PDF 1、2和3(从左到右)。

PyPDF2 (PyPDF2)

PyPDF2 is a pure Python PDF library capable of splitting, merging together, cropping, and transforming pages of different PDF files. We can retrieve metadata from PDFs, like author, creator, creation date and others. It can also retrieve the PDF text as found in the content stream. This means that the text might not be ordered logically if it is not done so in the stream object associated with the PDF. Illogical ordering should not happen in general, but as the documents get more complex

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
PSD谱是指功率谱密度谱(Power Spectral Density Spectrum)的缩写,可以用来描述信号在不同频率上的功率分布情况。在Python,有一些库可以用来计算信号的功率谱密度谱,例如psd-tools和MNE库。 psd-tools是一个Python包,用于处理描述Adobe Photoshop PSD文件的工具。它提供了一些函数和方法,可以用来读取和处理PSD文件。你可以使用pip来安装psd-tools包,命令为:pip install psd-tools。 而MNE是一个用于处理脑电图(EEG)、磁共振成像(MEG)和脑电磁成像(ECG)数据的Python库。它提供了计算功率谱密度的功能,可以根据不同频段的划分来计算能量总和。不过需要注意的是,MNE库目前只实现了单一通道的计算。 因此,如果你想在Python计算PSD谱,你可以选择使用psd-tools或MNE库来处理你的数据,具体选择哪个库取决于你的需求和数据类型。<span class="em">1</span><span class="em">2</span> #### 引用[.reference_title] - *1* [psd-tools:用于读取 Adob​​e Photoshop PSD 文件Python 包](https://download.csdn.net/download/weixin_42099814/20703968)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* [PythonMNE库进行PSD分析(计算不同频率区间的累加和).zip](https://download.csdn.net/download/zhoudapeng01/12545345)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值