All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Subscribe to see which companies asked this question
每个基因可以用两位表示
A:00 ->1
B:01 ->1
G::10 ->2
T:11 ->3
10个字符可以2^20种表达形式2^20<2^32,所以可以用int来存放。
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
int len=s.size();
if(len<10) return res;
map<int,int> m;
for(int i=0;i<=len-10;i++){
string sub=s.substr(i,10);
int code=encode(sub);
if(m.count(code)){
if(m[code]==1) res.push_back(sub);
m[code]++;
}else{
m[code]++;
}
}
return res;
}
private:
int encode(string sub){
int code=0;
for(int i=0;i<sub.size();i++){
code<<=2;
switch(sub[i]){
case 'A':code+=1;break;
case 'C':code+=2;break;
case 'G':code+=3;break;
case 'T':code+=4;break;
}
}
return code;
}
};