python 搜索pdf文件中的文字,使用python查找搜索字符串在pdf文档中位于哪一页上...

最新推荐文章于 2024-01-15 09:20:41 发布

阿莱克西斯

最新推荐文章于 2024-01-15 09:20:41 发布

阅读量469

点赞数

文章标签： python 搜索pdf文件中的文字

Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn't figure out which one I should use.

PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task.

Any advice ?

More precise:

I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .

解决方案

I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.

(1) a function to locate the string

def fnPDF_FindText(xFile, xString):

# xfile : the PDF file in which to look

# xString : the string to look for

import pyPdf, re

PageFound = -1

pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))

for i in range(0, pdfDoc.getNumPages()):

content = ""

content += pdfDoc.getPage(i).extractText() + "\n"

content1 = content.encode('ascii', 'ignore').lower()

ResSearch = re.search(xString, content1)

if ResSearch is not None:

PageFound = i

break

return PageFound

(2) a function to extract the pages of interest

def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):

from pyPdf import PdfFileReader, PdfFileWriter

output = PdfFileWriter()

pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))

for i in range(xPageStart, xPageEnd):

output.addPage(pdfOne.getPage(i))

outputStream = file(xFileNameOutput, "wb")

output.write(outputStream)

outputStream.close()

I hope this will be helpful to somebody else

阿莱克西斯

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python 搜索pdf文件中的文字,使用python查找搜索字符串在pdf文档中位于哪一页上...

Which python packages can I use to find out out on which page a specific “search string” is located ?I looked into several python pdf packages but couldn't figure out which one I should use.PyPDF doe...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。