python 拆分pdf指定页,Python按页拆分pdf

重构

我对代码进行了如下重构:import os

import PyPDF2

def split_pdf_pages(input_pdf_path, target_dir, fname_fmt=u"{num_page:04d}.pdf"):

if not os.path.exists(target_dir):

os.makedirs(target_dir)

with open(input_pdf_path, "rb") as input_stream:

input_pdf = PyPDF2.PdfFileReader(input_stream)

if input_pdf.flattenedPages is None:

# flatten the file using getNumPages()

input_pdf.getNumPages() # or call input_pdf._flatten()

for num_page, page in enumerate(input_pdf.flattenedPages):

output = PyPDF2.PdfFileWriter()

output.addPage(page)

file_name = os.path.join(target_dir, fname_fmt.format(num_page=num_page))

with open(file_name, "wb") as output_stream:

output.write(output_stream)

注意:很难做得更好

剖面

使用这个split_pdf_pages函数,您可以进行分析:import cProfile

import pstats

import io

pdf_path = "path/to/file.pdf"

directory = os.path.join(os.path.dirname(pdf_path), "pages")

pr = cProfile.Profile()

pr.enable()

split_pdf_pages(pdf_path, directory)

pr.disable()

s = io.StringIO()

ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')

ps.print_stats()

print(s.getvalue())

使用您自己的PDF文件运行分析,并分析结果

分析结果

分析结果如下:159696614 function calls (155047949 primitive calls) in 57.818 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.899 0.899 57.818 57.818 $HOME/workspace/pypdf2_demo/src/pypdf2_demo/split_pdf_pages.py:14(split_pdf_pages)

2136 0.501 -.--- 53.851 0.025 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:445(write)

103229/96616 1.113 -.--- 36.924 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:544(writeToStream)

27803 9.066 -.--- 25.381 0.001 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:445(writeToStream)

4185807/2136 5.054 -.--- 14.635 0.007 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:541(_sweepIndirectReferences)

50245/41562 0.117 -.--- 9.028 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:1584(getObject)

31421489 6.898 -.--- 8.193 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/utils.py:231(b_)

56779 2.070 -.--- 7.882 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:142(writeToStream)

8683 0.322 -.--- 7.020 0.001 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:1531(_getObjectFromStream)

459978/20068 1.098 -.--- 6.490 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:54(readObject)

26517/19902 0.484 -.--- 6.360 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:553(readFromStream)

27803 3.893 -.--- 5.565 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:1162(encode_pdfdocencoding)

15735379 4.173 -.--- 5.412 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/utils.py:268(chr_)

3617738 2.105 -.--- 4.956 -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:265(writeToStream)

18882076 3.856 -.--- 3.856 -.--- {method 'write' of '_io.BufferedWriter' objects}

看来:writeToStream函数被大量调用,但我不知道如何优化它。

write方法直接写入流,而不是在内存中=>可以进行优化。

改善

在缓冲区(内存中)中序列化PDF页,然后将缓冲区写入文件:buffer = io.BytesIO()

output.write(buffer)

with open(file_name, "wb") as output_stream:

output_stream.write(buffer.getvalue())

我用35秒而不是40秒处理了2135页。

优化确实很差:-(

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值