python 搜索pdf文件中的文字,使用python查找搜索字符串在pdf文档中位于哪一页上...

Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn't figure out which one I should use.

PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task.

Any advice ?

More precise:

I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .

解决方案

I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.

(1) a function to locate the string

def fnPDF_FindText(xFile, xString):

# xfile : the PDF file in which to look

# xString : the string to look for

import pyPdf, re

PageFound = -1

pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))

for i in range(0, pdfDoc.getNumPages()):

content = ""

content += pdfDoc.getPage(i).extractText() + "\n"

content1 = content.encode('ascii', 'ignore').lower()

ResSearch = re.search(xString, content1)

if ResSearch is not None:

PageFound = i

break

return PageFound

(2) a function to extract the pages of interest

def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):

from pyPdf import PdfFileReader, PdfFileWriter

output = PdfFileWriter()

pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))

for i in range(xPageStart, xPageEnd):

output.addPage(pdfOne.getPage(i))

outputStream = file(xFileNameOutput, "wb")

output.write(outputStream)

outputStream.close()

I hope this will be helpful to somebody else

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值