我自己也有必要这么做,天真的做法是:def unzip(file, dir):
zips = zipfile.ZipFile(file)
for info in zips.infolist():
info.filename = info.filename.encode("cp437").decode("shift-jis")
print("Extracting: " + info.filename.encode(sys.stdout.encoding,errors='replace').decode(sys.stdout.encoding))
zips.extract(info,dir)
print("")
ZipFile似乎在内部将所有文件名视为DOS(代码页437)。与python2不同,python3在内部将所有字符串存储为某种UTF。因此,我们将文件名转换为字节数组,并将原始字节字符串解码为shift-JIS,以得到最终的文件名。在
print行也做了类似的事情,但是使用默认的stdout和back编码。这可以防止在Windows上发生错误,因为它的终端几乎从不支持Unicode。(如果是,则应正确显示名称。)
这对两个zip文件很有效,直到bam。。。在
^{pr2}$
奖金内容!它花了一些脑筋来解决这个问题,但问题是一些有效的shift-JIS字符包含反斜杠,ZipFile将其转换为正斜杠!例如,十在移位JIS中编码为8F 5C。这被转换为8F 2F,这是一个非法的序列。如果发生错误,下面的代码(可能过于复杂)将检查此情况,并尝试修复它。但可能还有其他字符会发生这种情况,而且序列是有效的,所以您得到的字符是错误的,而不是错误。:(def convert_filename(inname):
err_ctr=0
keep_going = True
trans_filename = bytearray(inname.encode("cp437"))
while keep_going:
keep_going = False
try:
outname = trans_filename.decode("shift-jis")
except UnicodeDecodeError as e:
keep_going = True
if e.args[4]=="illegal multibyte sequence":
p0, p1 = e.args[2], e.args[3]
print("Trying to fix encoding error at positions " + str(p0) +", "+ str(p1) + " caused by shift-jis sequence " + hex(trans_filename[p0]) +", "+ hex(trans_filename[p1]) )
if (trans_filename[p0]>127 and trans_filename[p1] == 0x2f):
trans_filename[p1] = 0x5c
else:
print("Don't know how to fix this error. Quitting. :(")
raise e
err_ctr = err_ctr + 1
print("This is error #" + str(err_ctr) + " for this filename.")
else:
raise e
if err_ctr>50:
print("More than 50 iterations. Are we stuck in an endless loop? Quitting...")
sys.exit(1)
return outname
def unzip(file, dir):
zips = zipfile.ZipFile(file)
for info in zips.infolist():
info.filename = convert_filename(info.filename)
print("Extracting: " + info.filename.encode(sys.stdout.encoding,errors='replace').decode(sys.stdout.encoding))
zips.extract(info,dir)
print("")