All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
基本思路:
1)遍历字符串,hash长度为10的字串到hash表中
2)遍历hash表,如果字串出现次数大于1次,输出
得到如下算法:
vector<string> findRepeatedDnaSequences(string s) {
vector<string>result;
map<string,int>tbl;
for(int i = 0;(i+10)< s.size();i++){
tbl[s.substr(i,10)]++;
}
map<string,int>::iterator iter;
for(iter = tbl.begin();iter != tbl.end();iter++){
if(iter->second > 1)
result.push_back(iter->first);
}
return result;
}
Leetcode显示内存超出限制,原因是对超长的字符串,字串数目太多,耗费内存更多
所以用字串作HASH的KEY耗非内存太多
考虑到只有‘A’,‘C’,‘G’,‘T’四个字符,我们用0b00,0b01,0b10,0b11来表示
这样10个字串长的字串只需要用20个BIT来表示,算法同上,只是需要加上10个字符串到对应BIT整数表示的转换
<pre name="code" class="cpp">class Solution {
public:
/*algorithm: search
basic idea is to get 10 substring, then match in string
if it can be found, output to the result,this needs many comparison
for A,C,G,T, we can use 2 bit to represent them,
so 10 long sring can be represented by one int32 number
so we only need to hash the string to numbers, and store in map
all the map elements with value > 1 are output candiate
A:00, C: 01, G:10, T:11
*/
string int2seq(int num){
string s;
char tbl[4]={'A','C','G','T'};//0b00,0b01,0b10,0b11
for(int i = 0;i < 10;i++){
s.append(1,tbl[num&0x3]);
num >>= 2;
}
return s;
}
int seq2int(string &s,int start){
unordered_map<char,int>table={
{'A',0b00},{'C',0b01},{'G',0b10},{'T',0b11}
};
int val = 0;
for(int i = 9;i >= 0;i--){
val <<= 2;
val |= table[s[start+i]];
}
return val;
}
vector<string> findRepeatedDnaSequences(string s) {
vector<string>result;
unordered_map<int,int>table;
for(int i = 0;(i+10) <= s.size();i++){
table[seq2int(s,i)]++;
}
for(auto it = table.begin();it != table.end();it++){
if(it->second > 1)
result.push_back(int2seq(it->first));
}
return result;
}
};