slate是一个python包,它简化了提取过程
PDF文件中的文本。这取决于pdfminer包。
slate提供了一个类,pdf。pdf接受一个类似文件的对象
将从文档中提取所有文本,表示每一页
作为文本字符串:>>> with open('example.pdf') as f:
... doc = slate.PDF(f)
...
>>> doc
[..., ..., ...]
>>> doc[1]
'Text from page 2...'
如果您的pdf受密码保护,请将密码作为
第二个参数:>>> with open('secrets.pdf') as f:
... doc = slate.PDF(f, 'password')
...
>>> doc[0]
"My mother doesn't know this, but..."
更复杂的操作
如果您想访问图像、字体文件和其他
信息,然后花点时间学习pdfminer api。
pdfminer怎么了?Getting simple things done, like extracting the text
is quite complex. The program is not designed to return
Python objects, which makes interfacing things irritating.
It’s an extremely complete set of tools, with multiple
and moderately steep learning curves.
It’s not written with hackability in mind.
欢迎加入QQ群-->: 979659372
推荐PyPI第三方库