[Leetcode] 609. Find Duplicate File in System 解题报告

本文链接：https://blog.csdn.net/magicbean2/article/details/78986975

题目：

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:  
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Note:

No order is required for the final output.
You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50].
The number of files given is in the range of [1,20000].
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. Directory path and file info are separated by a single blank space.

Follow-up beyond contest:

Imagine you are given a real file system, how will you search files? DFS or BFS?
If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

思路：

算法上的难度不大，我觉得主要在于如何很好地回答follow up问题。所以我首先简单地总结一下思路，然后重点探讨如何回答follow up（自己的回答不一定准确，仅供参考）。

算法上关键还是在于建立一个哈希表，建立从fn_content到文件全路径（folder_path + "/" + fn）数组之间的映射。对于paths中的每一个string，我们用istringstream进行解析，其第一个string就是folder_name，之后的每个string就是该文件夹下面的一个fn + fn_content。解析出来之后，再进一步解析fn和fn_content，然后将对应数据加入哈希表中。最后遍历一遍哈希表，找出同样内容出现在多个文件中的项目，加入结果集中即可。

关于follow-up：

1）如果是一个真实的文件系统，那么其实DFS和BFS搜索都是可以的。不同点在于用DFS搜索时，如果文件深度太深，那么可能会需要占用较大的内存栈空间；用BFS时，如果每个文件夹下面的文件夹数量太多，则队列可能会比较长，因此也会占用较大的内存空间。

2）如果文件内容过大，那么显然就不能将文件内容直接作为哈希表的key了。我觉得有几种可能的改进方法：a）用MD5算法（或者其它类似的算法）生成文件内容对应的key，然后用这个key作为哈希表的key；2）采用多次哈希的方法，例如第一次用fn_content的长度来作为key，第二次处理的时候，在长度相同的文件列表中，再用MD5算法判断文件内容是不是相同。这样做的好处在于第一次哈希的时候，就排除了大量内容大小不同的文件，从而使得需要用MD5算法的文件数目大大减少，从而提高效率。

3）那就每读进来1kb的数据，做一次MD5吧，最后把所有的MD5结果生成一个向量，或者再次MD5，用最终结果来作为哈希表的key。

4）如果认为平均每个文件夹下面有常数个文件（夹），那么修改后的算法的时间复杂度就是O(file_counts * avg_file_length)，其中file_counts表示文件系统中所有的文件个数；avg_file_length表示平均文件长度。最耗时的有可能是两部分：a）遍历文件的部分（如果文件数量特别大）；b）读取每个文件内容的部分（如果每个文件特别大）。要优化，那么除了采用3）中的方法之外，还可以采用并行化方法来处理。例如遍历文件的时候，采用多线程或者map reduce的思路进行并行，最后再合成所有的结果。读取每个文件内容的时候，也可以采用多线程（需要硬盘支持了，有这样的硬盘吗？）。

5）再牛逼的哈希算法也不能保证完全消除冲突，那么为了make sure，就只好最后对结果采用最笨的方法进行double check了。。。

代码：

class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        unordered_map<string, vector<string>> hash;     // {fn_content, fn_path}
        for (int i = 0; i < paths.size(); ++i) {
            string folder_path;
            vector<string> files;
            resolve(paths[i], folder_path, files);
            for (int j = 0; j < files.size(); ++j) {
                string name, content;
                getNameContent(files[j], name, content);
                string path = folder_path + "/" + name;
                hash[content].push_back(path);
            }
        }
        vector<vector<string>> ret;
        for (auto it = hash.begin(); it != hash.end(); ++it) {
            if (it->second.size() > 1) {
                ret.push_back(it->second);
            }
        }
        return ret;
    }
private:
    void resolve(const string &s, string &folder_path, vector<string> &files) {
        istringstream iss(s);
        iss >> folder_path;
        string file;
        while (!iss.eof()) {
            iss >> file;
            files.push_back(file);
        }
    }
    void getNameContent(const string &file, string &name, string &content) {
        int index = file.find('(');
        name = file.substr(0, index);
        content = file.substr(index);
    }
};