Windows下使用python删除重复图片

最新推荐文章于 2024-08-01 13:25:03 发布

Blue summer

最新推荐文章于 2024-08-01 13:25:03 发布

阅读量3.1k

点赞数 4

分类专栏： Python 文章标签： python 图片去重遍历目录 MD5值

本文链接：https://blog.csdn.net/u010039418/article/details/80835775

版权

Python 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

注：该文基于python 2.7.13编写

之前一直有习惯每隔一段时间把手机相册里的照片拷贝到电脑里，有时候分不清拷贝时间，因此照片里有一些是重复的，如果手工删除重复的，实在太费时间了，况且有8000+，想要找出重复的，似乎不太可能，因此考虑用python脚本来做。

去重的思路大概是，通过MD5值来判断是否是同一个文件，这里暂不考虑相似图片的问题。

不多说，看代码

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import os
import time
import hashlib

def getmd5(file):
    if not os.path.isfile(file):  
        return  
    fd = open(file,'rb')
    md5 = hashlib.md5()
    md5.update(fd.read())
    fd.close()
    return md5.hexdigest() 

if __name__ == "__main__":
    allfile = []
    md5list = []
    identicallist = []

    start = time.time()
    inpath = "D:\照片集"
    uipath = unicode(inpath, "utf8")

    for path,dir,filelist in os.walk(uipath):
        for filename in filelist:
            allfile.append(os.path.join(path,filename))

    #根据MD5值比较
    for photo in allfile:
        md5sum = getmd5(photo)
        if md5sum not in md5list:
            md5list.append(md5sum)
        else:
            identicallist.append(photo)


    end = time.time()
    last = end - start

    print("identical photos: " + str(len(identicallist)))
    print("time: " + str(last) +"s")
    print("count: " + str(len(allfile)))

该脚本并没有真正删除重复的图片，只是将重复的图片路径加到identicallist中。

有几个点是比较主要的：
1、字符编码
因为我的照片目录是在硬盘上，目录用的也是中文，有些照片的名字也是中文，因此必须要考虑中文的处理。

因为脚本里使用utf-8编码，因此在使用中文目录时，需要将路径编码为utf-8格式，

unicode(inpath, "utf8")

其中inpath即为中文路径。

2、目录遍历
考虑存放照片的目录还有很多级目录，因此使用os模块的walk方法，取的目录及文件，这个方法还是很好用的。

for path,dir,filelist in os.walk(uipath)：

其中filelist返回的就是uipath目录下的所有文件，注意，这个文件是不带路径的文件名。

3、获取文件的MD5值
计算文件的MD5值需要将文件打开后才能计算，和Linux下md5sum命令的使用不太一样。这里使用的是hashlib模块的相关函数。

这里考虑文件比较小，因此如果对于大文件，脚本无法一次将文件读取，需要分次读取，

def get_big_file_md5(file):  
    if not os.path.isfile(file):  
        return  
    md5 = hashlib.md5()  
    fd = open(file,'rb')  
    while True:  
        buff = fd.read(8096)  
        if not buff:  
            break  
        md5.update(buff)  
    fd.close()  
    return md5.hexdigest()