python读取rar文件,Python - 从大型(6GB +)zip文件中提取文件

I have a Python script where I need to extract the contents of a ZIP file. However, the zip file is over 6GB in size.

There is a lot of information about zlib and zipfile modules, however, I can't find a single approach that works in my case.

I have the code:

with zipfile.ZipFile(fname, "r") as z:

try:

log.info("Extracting %s " %fname)

head, tail = os.path.split(fname)

z.extractall(folder + "/" + tail)

except zipfile.BadZipfile:

log.error("Bad Zip file")

except zipfile.LargeZipFile:

log.error("Zip file requires ZIP64 functionality but that has not been enabled (i.e., too large)")

except zipfile.error:

log.error("Error decompressing ZIP file")

I know that I need to set the allowZip64 to true but I'm unsure of how to do this. Yet, even as is, the LargeZipFile exception is not thrown, but instead the BadZipFile exception is. I have no idea why.

Also, is this the best approach to handle extracting a 6GB zip archive???

Update:

Modifying the BadZipfile exception to this:

except zipfile.BadZipfile as inst:

log.error("Bad Zip file")

print type(inst) # the exception instance

print inst.args # arguments stored in .args

print inst

shows:

('Bad magic number for file header',)

Update #2:

The full traceback shows

BadZipfile Traceback (most recent call last)

in ()

6 for member in z.infolist():

7 print member.filename[-70:],

----> 8 f = z.open(member, 'r')

9 size = 0

10 while True:

/Users/brspurri/anaconda/python.app/Contents/lib/python2.7/zipfile.pyc in open(self, name, mode, pwd)

965 fheader = struct.unpack(structFileHeader, fheader)

966 if fheader[_FH_SIGNATURE] != stringFileHeader:

--> 967 raise BadZipfile("Bad magic number for file header")

968

969 fname = zef_file.read(fheader[_FH_FILENAME_LENGTH])

BadZipfile: Bad magic number for file header

Running the code:

import sys

import zipfile

with open(zip_filename, 'rb') as zf:

z = zipfile.ZipFile(zf, allowZip64=True)

z.testzip()

doesn't output anything.

解决方案

The problem is that you have a corrupted zip file. I can add more details about the corruption below, but first the practical stuff:

You can use this code snippet to tell you which member within the archive is corrupted. However, print z.testzip() would already tell you the same thing. And zip -T or unzip on the command line should also give you that info with the appropriate verbosity.

So, what do you do about it?

Well, obviously, if you can get an uncorrupted copy of the file, do that.

If not, if you want to just skip over the bad file and extract everything else, that's pretty easy—mostly the same code as the snippet linked above:

with open(sys.argv[1], 'rb') as zf:

z = zipfile.ZipFile(zf, allowZip64=True)

for member in z.infolist():

try:

z.extract(member)

except zipfile.error as e:

# log the error, the member.filename, whatever

The Bad magic number for file header exception message means that zipfile was able to successfully open the zipfile, parse its directory, find the information for a member, seek to that member within the archive, and read the header of that member—all of which means you probably have no zip64-related problems here. However, when it read that header, it did not have the expected "magic" signature of PK\003\004. That means the archive is corrupted.

The fact that the corruption happens to be at exactly 4294967296 implies very strongly that you had a 64-bit problem somewhere along the chain, because that's exactly 2**32.

The command-line zip/unzip tool has some workarounds to deal with common causes of corruption that lead to problems like this. it looks like those workarounds may be working for this archive, given that you get a warning, but all of the files are apparently recovered. Python's zipfile library does not have those workarounds, and I doubt you want to write your own zip-handling code yourself…

However, that does open the door for two more possibilities:

First, zip might be able to repair the zipfile for you, using the -F of -FF flag. (Read the manpage, or zip -h, or ask at a site like SuperUser if you need help with that.)

And if all else fails, you can run the unzip tool from Python, instead of using zipfile, like this:

subprocess.check_output(['unzip', fname])

That gives you a lot less flexibility and power than the zipfile module, of course—but you're not using any of that flexibility anyway; you're just calling extractall.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值