python搜索pdf内容所在页码_使用pyPDF从文档中检索页码

最新推荐文章于 2022-07-19 23:23:26 发布

weixin_39883079

最新推荐文章于 2022-07-19 23:23:26 发布

阅读量325

点赞数

文章标签： python搜索pdf内容所在页码

本文链接：https://blog.csdn.net/weixin_39883079/article/details/111514204

版权

在尝试使用pyPdf合并PDF文件时，如果输入文件顺序不正确，作者计划通过搜索每一页的页码来确定正确的顺序。问题在于如何从PDF中提取页码信息。1) 有些PDF的页码存储在文档数据中，但用pyPdf读取时找不到相关信息；2) 如果无法直接获取，考虑遍历页面对象寻找页码。作者提到可以查看Adobe的PDF参考文档，并尝试使用pyPdf的trailer和IndirectObject来获取页码信息，或者检查文本对象来找出页码。

摘要由CSDN通过智能技术生成

At the moment I'm looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I'm looking into scraping each page for its page number to determine the order it should go in (e.g. if someone split up a book into 20 10-page PDFs and I want to put them back together).

I have two questions - 1.) I know that sometimes the page number is stored in the document data somewhere, as I've seen PDFs that render on Adobe as something like [1243] (10 of 150), but I've read documents of this sort into pyPDF and I can't find any information indicating the page number - where is this stored?

2.) If avenue #1 isn't available, I think I could iterate through the objects on a given page to try to find a page number - likely it would be its own object that has a single number in it. However, I can't seem to find any clear way to determine the contents of objects. If I run:

pdf.getPage(0).getContents()

This usually either returns:

{'/Filter': '/FlateDecode'}

or it returns a list of IndirectObject(num, num) objects. I don't really know what to do with either of these and there's no real documentation on it as far as I can tell. Is anyone familiar with this kind of thing that could point me in the right direction?

解决方案

For full documentation, see Adobe's 978-page PDF Reference. :-)

More specifically, the PDF file contains metadata that indicates how the PDF's physical pages are mapped to logical page numbers and how page numbers should be formatted. This is where you go for canonical results. Example 2 of this page shows how this looks in the PDF markup. You'll have to fish that out, parse it, and perform a mapping yourself.

In PyPDF, to get at this information, try, as a starting point:

pdf.trailer["/Root"]["/PageLabels"]["/Nums"]

By the way, when you see an IndirectObject instance, you can call its getObject() method to retrieve the actual object being pointed to.

Your alternative is, as you say, to check the text objects and try to figure out which is the page number. You could use extractText() of the page object for this, but you'll get one string back and have to try to fish out the page number from that. (And of course the page number might be Roman or alphabetic instead of numeric, and some pages may not be numbered.) Instead, have a look at how extractText() actually does its job—PyPDF is written in Python, after all—and use it as a basis of a routine that checks each text object on the page individually to see if it's like a page number. Be wary of TOC/index pages that have lots of page numbers on them!

weixin_39883079

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫