python读取rar文件,Python - 从大型（6GB +）zip文件中提取文件

最新推荐文章于 2023-07-28 14:41:05 发布

张北晨

最新推荐文章于 2023-07-28 14:41:05 发布

阅读量269

点赞数

文章标签： python读取rar文件

I have a Python script where I need to extract the contents of a ZIP file. However, the zip file is over 6GB in size.

There is a lot of information about zlib and zipfile modules, however, I can't find a single approach that works in my case.

I have the code:

with zipfile.ZipFile(fname, "r") as z:

try:

log.info("Extracting %s " %fname)

head, tail = os.path.split(fname)

z.extractall(folder + "/" + tail)

except zipfile.BadZipfile:

log.error("Bad Zip file")

except zipfile.LargeZipFile:

log.error("Zip file requires ZIP64 functionality but that has not been enabled (i.e., too large)")

except zipfile.error:

log.error("Error decompressing ZIP file")

I know that I need to set the allowZip64 to true but I'm unsure of how to do this. Yet, even as is, the LargeZipFile exception is not thrown, but instead the BadZipFile exception is. I have no idea why.

Also, is this the best approach to handle extracting a 6GB zip archive???

Update:

Modifying the BadZipfile exception to this:

except zipfile.BadZipfile as inst:

log.error("Bad Zip file")

print type(inst) # the exception instance

print inst.args # arguments stored in .args

print inst

shows:

('Bad magic number for file header',)

Update #2:

The full traceback shows

BadZipfile Traceback (most recent call last)

in ()

6 for member in z.infolist():

7 print member.filename[-70:],

----> 8 f = z.open(member, 'r')

9 size = 0

10 while True:

/Users/brspurri/anaconda/python.app/Contents/lib/python2.7/zipfile.pyc in open(self, name, mode, pwd)

965 fheader = struct.unpack(structFileHeader, fheader)

966 if fheader[_FH_SIGNATURE] != stringFileHeader:

--> 967 raise BadZipfile("Bad magic number for file header")

968

969 fname = zef_file.read(fheader[_FH_FILENAME_LENGTH])

BadZipfile: Bad magic number for file header

Running the code:

import sys

import zipfile

with open(zip_filename, 'rb') as zf:

z = zipfile.ZipFile(zf, allowZip64=True)

z.testzip()

doesn't output anything.

解决方案

The problem is that you have a corrupted zip file. I can add more details about the corruption below, but first the practical stuff:

You can use this code snippet to tell you which member within the archive is corrupted. However, print z.testzip() would already tell you the same thing. And zip -T or unzip on the command line should also give you that info with the appropriate verbosity.

So, what do you do about it?

Well, obviously, if you can get an uncorrupted copy of the file, do that.

If not, if you want to just skip over the bad file and extract everything else, that's pretty easy—mostly the same code as the snippet linked above:

with open(sys.argv[1], 'rb') as zf:

z = zipfile.ZipFile(zf, allowZip64=True)

for member in z.infolist():

try:

z.extract(member)

except zipfile.error as e:

# log the error, the member.filename, whatever

The Bad magic number for file header exception message means that zipfile was able to successfully open the zipfile, parse its directory, find the information for a member, seek to that member within the archive, and read the header of that member—all of which means you probably have no zip64-related problems here. However, when it read that header, it did not have the expected "magic" signature of PK\003\004. That means the archive is corrupted.

The fact that the corruption happens to be at exactly 4294967296 implies very strongly that you had a 64-bit problem somewhere along the chain, because that's exactly 2**32.

The command-line zip/unzip tool has some workarounds to deal with common causes of corruption that lead to problems like this. it looks like those workarounds may be working for this archive, given that you get a warning, but all of the files are apparently recovered. Python's zipfile library does not have those workarounds, and I doubt you want to write your own zip-handling code yourself…

However, that does open the door for two more possibilities:

First, zip might be able to repair the zipfile for you, using the -F of -FF flag. (Read the manpage, or zip -h, or ask at a site like SuperUser if you need help with that.)

And if all else fails, you can run the unzip tool from Python, instead of using zipfile, like this:

subprocess.check_output(['unzip', fname])

That gives you a lot less flexibility and power than the zipfile module, of course—but you're not using any of that flexibility anyway; you're just calling extractall.

张北晨

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python读取rar文件,Python - 从大型（6GB +）zip文件中提取文件

I have a Python script where I need to extract the contents of a ZIP file. However, the zip file is over 6GB in size.There is a lot of information about zlib and zipfile modules, however, I can't find...
复制链接

扫一扫