LeetCode Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
根据题目意思,很容易想到最直接的办法:取所有可能的子串,逐个比较。
<span style="white-space:pre"> </span>vector<string> res;
if (s.length() <=10)
{
return res;
}
int len = s.length();
for (int i = 0; i < len - 10; i++)
{
string str1 = s.substr(i, 10);
for (int j = i + 1; j < len - 10; j++)
{
string str2 = s.substr(j, 10);
if (str1 == str2)
{
res.push_back(str1);
break;
}
}
}
return res;
时间复杂度O(n^2),结果很明显:超时。
这时想到另外一个办法,用空间换时间:哈希。将所有可能出现的字符序列保存在map<string,int> table中(其中第一个字段表示出现的子串,第二个字段表示此子串出现的次数),然后对map遍历,如果table中第二个字段>1表示此子串有重复,加入返回数组中。
<span style="white-space:pre"> </span>vector<string> res;
if (s.length() < 11)
return res;
int len = s.length();
map<string, int> table;
for (int i = 0; i < len - 10; i++)
{
string str = s.substr(i, 10);
if (table.find(str) != table.end())
{
table[str]++;
}
else
{
table[str] = 1;
}
}
for (map<string, int>::iterator it = table.begin(); it != table.end(); ++it)
{
if (it->second > 1)
res.push_back(it->first);
}
return res;
结果:超内存。。。
超内存的原因是什么?肯定是table中的第一个字段,它是字符串,肯定需要很多空间。能不能换个思维,把字符串哈希一下,哈希成可以用比较小的空间就能表示的?因为可能出现的字符只有四种:AGCT,那么这么设计哈希函数:A->0,C->1,G->2,T->3。如此一来,把长度为10的字符串映射(哈希)成整数。
int hash_fun1(string s)
{
int n = 0;
for (int i = 0; i < s.length(); i++){
n <<= 2;
char c = s[i];
if (c == 'C'){
n += 1;
}
else if (c == 'G'){
n += 2;
}
else if (c == 'T'){
n += 3;
}
}
return n;
}
vector<string> findRepeatedDnaSequences(string s)
{
vector<string> res;
if (s.length() < 11)
return res;
int len = s.length();
map<int, int> table;
for (int i = 0; i <= len - 10; i++)
{
string str = s.substr(i, 10);
int val = hash_fun1(str);//把字符串哈希成int
map<int, int>::iterator it = table.find(val);
if (it != table.end())
{
if (it->second == 1)//出现过一次,则加入
res.push_back(str);
it->second++;
}
else
{
table[val] = 1;
}
}
return res;
}
当然还有其他哈希办法。比如网上一位朋友的代码,利用位运算(AGCT后3为的位各不相同)
int str2int(string s) {
int str=0;
for (int i = 0; i < s.size(); ++i)
str = (str<<3) +(s[i]&7);
return str;
}
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
unordered_map<int,int> coll;
for (int i = 0; i+10 <= s.size(); ++i)
if (coll[str2int(s.substr(i,10))]++ == 1)
res.push_back(s.substr(i,10));
return res;
}