在下载了很多资料之后,由于分类不好,很多文件夹下都放了重复的文件,就想用python写个查找重复文件的小工具。
主要思路如下:
1. 查找同命文件
2. 利用了crc32,先检查出同样尺寸的文件,再计算crc32,得出相同的文件名列表。
下面是转载的一个代码,虽然可以满足要求,但是在查找大量文件时候,速度很慢,我抽空把它调优。
代码
Code highlighting produced by Actipro CodeHighlighter (freeware)http://www.CodeHighlighter.com/--> 1 #!/usr/bin/env python
#coding=utf-8
import binascii, os
filesizes = {}
samefiles = []
def filesize(path):
if os.path.isdir(path):
files = os.listdir(path)
for file in files:
filesize(path + "/" + file)
else:
size = os.path.getsize(path)
if not filesizes.has_key(size):
filesizes[size] = []
filesizes[size].append(path)
def filecrc(files):
filecrcs = {}
for file in files:
f = open(file, "r")
crc = binascii.crc32(f.read())
f.close()
if not filecrcs.has_key(crc):
filecrcs[crc] = []
filecrcs[crc].append(file)
for filecrclist in filecrcs.values():
if len(filecrclist) > 1:
samefiles.append(filecrclist)
if __name__ == '__main__':
path = r"J:\My Work"
filesize(path)
for sizesamefilelist in filesizes.values():
if len(sizesamefilelist) > 1:
filecrc(sizesamefilelist)
for samfile in samefiles:
print "****** same file group ******"
for file in samefile:
print file