python合并多个pdf,pypdf将多个pdf文件合并为一个pdf

If I have 1000+ pdf files need to be merged into one pdf,

input = PdfFileReader()

output = PdfFileWriter()

filename0000 ----- filename 1000

input = PdfFileReader(file(filename, "rb"))

pageCount = input.getNumPages()

for iPage in range(0, pageCount):

output.addPage(input.getPage(iPage))

outputStream = file("document-output.pdf", "wb")

output.write(outputStream)

outputStream.close()

Execute the above code,when input = PdfFileReader(file(filename500+, "rb")),

An error message:

IOError: [Errno 24] Too many open files:

I think this is a bug, If not, What should I do?

解决方案

I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.

Note: I am assuming that filename is a well-formed file path string. Assume the same for all of my code

The Short Answer

Use the PdfFileMerger() class instead of the PdfFileWriter() class. I've tried to provide the following to as closely resemble your content as I could:

from PyPDF2 import PdfFileMerger, PdfFileReader

[...]

merger = PdfFileMerger()

for filename in filenames:

merger.append(PdfFileReader(file(filename, 'rb')))

merger.write("document-output.pdf")

The Long Answer

The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.

To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.

My initial attempts looked like the following, and were resulting in the same IO Problems:

merger = PdfFileMerger()

for filename in filenames:

merger.append(filename)

merger.write(output_file_path)

Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):

if type(fileobj) in (str, unicode):

fileobj = file(fileobj, 'rb')

my_file = True

elif type(fileobj) == file:

fileobj.seek(0)

filecontent = fileobj.read()

fileobj = StringIO(filecontent)

my_file = True

elif type(fileobj) == PdfFileReader:

orig_tell = fileobj.stream.tell()

fileobj.stream.seek(0)

filecontent = StringIO(fileobj.stream.read())

fileobj.stream.seek(orig_tell)

fileobj = filecontent

my_file = True

We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader() object holding open a file until the file is eventually written!

However, if we either make a file object of the file path string or a PdfFileReader(see Edit 2) object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.

I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().

Hope this helped!

EDIT: I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.

If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you. In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object.

EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks @Agostino).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 你可以使用 PyPDF2 库来合并多个 PDF 文件一个 PDF 文件。首先,你需要安装这个库,使用以下命令: ```python pip install pypdf2 ``` 然后,你可以使用以下代码来合并多个 PDF 文件: ```python import os import glob import PyPDF2 # 创建一个 PDF 合并器 merger = PyPDF2.PdfFileMerger() # 读取当前目录下的所有 PDF 文件 for pdf in glob.glob("*.pdf"): # 读取每个 PDF 文件 with open(pdf, "rb") as f: merger.append(f) # 将所有 PDF 文件合并一个新的 PDF 文件中 with open("merged.pdf", "wb") as f: merger.write(f) ``` 这段代码会将当前目录下的所有 PDF 文件合并一个新的 PDF 文件,命名为 "merged.pdf"。 希望这对你有帮助! ### 回答2: 要合并多个PDF文件一个PDF文件,可以使用PythonPyPDF2库来实现。 首先,你需要安装PyPDF2库,可以使用以下命令来安装: ``` pip install PyPDF2 ``` 接下来,你可以使用以下代码来合并多个PDF文件一个PDF文件: ```python import PyPDF2 def merge_pdf(input_paths, output_path): merger = PyPDF2.PdfFileMerger() for path in input_paths: with open(path, 'rb') as file: merger.append(file) with open(output_path, 'wb') as file: merger.write(file) if __name__ == '__main__': input_paths = ['file1.pdf', 'file2.pdf', 'file3.pdf'] # 待合并PDF文件路径列表 output_path = 'merged.pdf' # 合并后的PDF文件路径 merge_pdf(input_paths, output_path) ``` 在上述代码中,首先导入PyPDF2库。然后定义了一个名为`merge_pdf`的函数,接收两个参数:`input_paths`和`output_path`。`input_paths`是待合并PDF文件路径列表,`output_path`是合并后的PDF文件路径。 在`merge_pdf`函数中,创建了一个`PdfFileMerger`对象用于合并PDF文件。然后,通过遍历`input_paths`列表,打开每一个PDF文件,并使用`append`方法将其添加到合并对象中。 最后,使用`wb`模式打开输出路径的文件,并使用`write`方法将合并对象的内容写入文件。 使用上述代码,你可以将多个PDF文件合并一个PDF文件,并保存为指定路径。 ### 回答3: 要使用Python合并多个pdf文件一个pdf文件,你可以使用PyPDF2库。以下是一个简单的代码示例: ```python from PyPDF2 import PdfMerger # 定义要合并pdf文件列表 pdf_files = ['file1.pdf', 'file2.pdf', 'file3.pdf'] # 创建PdfMerger对象 merger = PdfMerger() # 逐个合并pdf文件 for pdf_file in pdf_files: merger.append(pdf_file) # 定义目标合并后的pdf文件output_file = 'merged_file.pdf' # 将合并后的pdf文件保存到目标文件中 merger.write(output_file) # 关闭PdfMerger对象 merger.close() print('pdf文件合并完成!') ``` 在代码中,我们首先导入了`PdfMerger`类。然后,我们定义了要合并pdf文件列表,并创建了一个`PdfMerger`对象`merger`。 接着,我们使用`append()`方法逐个将pdf文件添加到`merger`对象中。 然后,我们定义了合并后的pdf文件名`output_file`,并通过`write()`方法将合并后的pdf文件保存到目标文件中。 最后,我们关闭了`merger`对象,并输出了合并完成的提示信息。 希望以上代码能够帮助你实现合并多个pdf文件一个pdf文件的需求!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值