LeetCode字符串(二)

今天主要刷了LeetCode 609. Find Duplicate File in System1题目:**

题目

> Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.


It means there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:"directory_path/file_name.txt"

Example 1 :

Input:
[“root/a 1.txt(abcd) 2.txt(efgh)”, “root/c 3.txt(abcd)”, “root/c/d 4.txt(efgh)”, “root 4.txt(efgh)”]
Output:
[[“root/a/2.txt”,”root/c/d/4.txt”,”root/4.txt”],[“root/a/1.txt”,”root/c/3.txt”]]

Note :

1.No order is required for the final output.
2.You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50].
3.The number of files given is in the range of [1,20000].
4.You may assume no files or directories share the same name in the same directory.
5.You may assume each given directory info represents a unique directory. Directory path and file info are separated by a single blank space.

Follow-up beyond contest :

1.Imagine you are given a real file system, how will you search files? DFS or BFS?
2.If the file content is very large (GB level), how will you modify your solution?
3.If you can only read the file by 1kb each time, how will you modify your solution?
4.What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?
5.How to make sure the duplicated files you find are not false positive?

分析:

这道题有点难,也是看了很多资料才弄懂了这道题。刚开始主要采用了暴力求解的方法。后来用了关联容器的方法。这道题最后还有5道思考题。感觉很有深度、有意思。

暴力求解思路:

先用stringsteam流把读入的路径、文件名、文件内容分解出来,然后按照路径+文件名保存作为输出的格式,按照输出格式+文件内容组成新的字符串序列。对新的字符串序列进行遍历,发现有重复的字符内容则保存到输出结果中。
Complexity Analysis

Time complexity : O(n∗x+f2∗s). Creation of listlistlist will take O(n∗x), where n is the number of directories and x is the average string length. Every file is compared with every other file. Let f files are there with average size of s, then files comparision will take O(f2∗s), equals can take O(s). Here, Worst case will be when all files are unique.

Space complexity : O(n∗x). Size of lists resresres and listlistlist can grow upto n∗x.

哈希求解思路:

先用stringsteam流把读入的路径、文件名、文件内容分解出来。然后按照文件内容作为key,把文件路径+文件名作为value值。之后遍历这个unordered_map,对其中value值的size大于1的进行保存输出。Complexity Analysis

Time complexity : O(n∗x). n strings of average length x is parsed.

Space complexity : O(n∗x). map and resresres size grows upto n∗x.

暴力代码:

class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        vector<vector<string>> files;
        vector<vector<string>> result;
        for (auto path: paths)
        {
            stringstream ss(path);
            string root;
            string s;
            getline(ss, root, ' ');
            while (getline(ss, s, ' ')) 
            {
                string fileName = root + '/' + s.substr(0, s.find('('));
                string fileContent = s.substr(s.find('(') + 1, s.find(')') - s.find('(') - 1);
                vector<string> temp;
                temp.push_back(fileName);
                temp.push_back(fileContent);
                files.push_back(temp);
            }
        }
        bool *visited=new bool[files.size()];
        memset(visited,false,files.size()); 
        for(int i=0;i<files.size()-1;i++)
        {
            vector<string> temp;
            if(visited[i])
                continue;
            for(int j=i+1;j<files.size();j++)
            {
                if(files[i][1]==files[j][1])
                {
                    visited[j]=true;
                    temp.push_back(files[j][0]);
                }
            }
            if(temp.size()>0)
            {
                temp.push_back(files[i][0]);
                result.push_back(temp);
            }
        }
        return result;
    }
};

哈希代码:

class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        unordered_map<string, vector<string>> files;
        vector<vector<string>> result;
        for(auto path:paths){
            stringstream  ss(path);
            string root;
            string s;
            getline(ss, root, ' ');
            while (getline(ss, s, ' ')) {
                string fileName = root + '/' + s.substr(0, s.find('('));
                string fileContent = s.substr(s.find('(') + 1, s.find(')') - s.find('(') - 1);
                files[fileContent].push_back(fileName);
            }
        }
        for (auto file : files) {
            if (file.second.size() > 1)
                result.push_back(file.second);
        }
        return result;
    }
};

心得体会:

首先在暴力求解时二重循环用visited数组的方法我是没有想到过,这样就可以降低循环的次数,很厉害。然后也是哈希让我对stl也有一点了解,所以在这里我先总结一些标准模板库的知识。
首先string 是可变长的字符序列,vector是存放某种类型对象的可变长序列。map是一种关联容器。

string

初始化:
- string s1;
- string s2(s1);
- string s2=s1;
- string s3(“value”);
- string s3=”value”;
- string s4(n,’c’); //把s4初始化为n个字符‘c’
string读取时遇到空格或回车就停止,而getline遇到回车才结束。
其中两个字符串相加时,必须确保每个加法运算符两侧的运算对象至少有一个是string。例如 :string s6=s1+”,”+”world”正确。string s7=”hello”+”,”+s2错误。
for(auto c: str)也就是从字符数组中遍历字符给c,当然也可以传字符的引用。其中auto是自动返回对象的类型,简单的理解就是返回最小原子的类型。而decltype则是返回操作数的类型,可以简单的理解为函数的返回类型。

string流:

istringstream从string读取数据,ostringstream向string写入数据,而头文件stringstream既可以从string读数据也可以向string写数据。
stringstream特有的操作:
sstream strm;//strm 是一个未绑定的stringstream对象。sstream是头文件sstream中定义的一个类型
sstream strm(s);//strm是一个sstream对象,保存string s的一个拷贝。
strm.str();//返回strm所保存的string的拷贝
strm.str(s);//将string s拷贝到strm中。返回void。

vector

初始化:
- vector <.T> v1;//T类型,没有.
- vector <.T> v2(v1);
- vector <.T> v2=v1;
- vector <.T> v3(n,val);//n个重复的val
- vector <.T> v4(n);//n个重复的对象
- vector <.T> v5{a,b,c,};
- vector <.T> v5={a,b,c,};
-
vector支持的操作:

  • v.empty() //空为真
  • v.size()
  • v.push_back(t);//尾部插入t
  • v[n] ; //第n个位置的引用
  • v1=v2;//拷贝
  • v1==v2等

关联容器:

                        关联容器的类型:

按关键字有序保存元素:
map 关联数组;保存关键字–值对
set 关联字即值,即只保存关键字的容器
multimap 关键字可重复的map
multiset 关键字可重复的set
无序集合:
unordered_map 用哈希函数组织的map
unordered_set 用哈希函数组织的set
unordered_multimap 哈希组织的map;关键字可以重复出现
unordered_multiset 哈希组织的set;关键字可以重复出现

思考答案

  1. Imagine you are given a real file system, how will you search files? DFS or BFS ?
    In general, BFS will use more memory then DFS. However BFS can take advantage of the locality of files in inside directories, and therefore will probably be faster
  2. If the file content is very large (GB level), how will you modify your solution?
    In a real life solution we will not hash the entire file content, since it’s not practical. Instead we will first map all the files according to size. Files with different sizes are guaranteed to be different. We will than hash a small part of the files with equal sizes (using MD5 for example). Only if the md5 is the same, we will compare the files byte by byte
  3. If you can only read the file by 1kb each time, how will you modify your solution?
    This won’t change the solution. We can create the hash from the 1kb chunks, and then read the entire file if a full byte by byte comparison is required.
  4. What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
    Time complexity is O(n^2 * k) since in worse case we might need to compare every file to all others. k is the file size
  5. How to make sure the duplicated files you find are not false positive?
    We will use several filters to compare: File size, Hash and byte by byte comparisons.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值