java 读文件 重复内容,使用Java检查重复文件内容

We have a 150 Gb data folder. Within that, file content is any format (doc, jpg, png, txt, etc). We need to check all file content against each other to check if there are is duplicate file content. If so, then print the file path name list. For that, first I used ArrayList to store all files, then used FileUtils.contentEquals(file1, file2) method. When I try it for a small amount of files(Folder) it's working but for this 150Gb data folder, it's not showing any result. I think first storing all files in an ArrayList makes the problem. JVM Heap problem, I am not sure.

Anyone have better advice and sample code to handle this amount of data? Please help me.

解决方案

Calculate the MD5 hash of each file and store in a HashMap with the MD5 hash as the key and the file path as the value. When you add a new file to the HashMap, you can easily check if there is already a file with that MD5 hash.

The chance of a false match is very small, but if you want you can use FileUtils.contentEquals to confirm the match.

e.g:

void findMatchingFiles(List filepaths)

{

HashMap hashmap = new HashMap();

for(String filepath in filepaths)

{

String md5 = getFileMD5(filepath); // see linked answer

if(hashmap.containsKey(md5))

{

String original = hashmap.get(md5);

String duplicate = filepath;

// found a match between original and duplicate

}

else

{

hashmap.put(md5, filepath);

}

}

}

If there are multiple identical files this will find a match of each of them with the first one, but not a match of all of them to each other. If you want the latter you can store a hash from the MD5 string to a list of filepaths instead of just to the first one.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值