All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,
Return:
[“AAAAACCCCC”, “CCCCCAAAAA”].
因为短的string总共有4的是10次方种,所以KMP是不可能的。
一开始用2个hashset,一个保存出现1次的,一个存超过1次的,总是memory limit exceeed. 改成一个hashmap就过了,猜测是因为自动扩容的原因,第一个先变大,然后remove掉add到第二个,但是第一个不会缩小。
看到tag中有bit manipulation, 猜想memory limit exceed应该是需要编码来节省空间首先考虑将ACGT进行二进制
A -> 00
C -> 01
G -> 10
T -> 11
10位的字符串需要20位编码;一般来说int有4个字节,32位,够用。一个char是2个byte,所以本来需要20byte,现在只需要4个。比如说
ACGTACGTAC -> 00011011000110110001
AAAAAAAAAA -> 00000000000000000000
不过既然过了我也就懒得写了。。。
[code]
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
HashMap<String,Integer> map=new HashMap<String, Integer>();
if(s.length()<=10)return new ArrayList<String>();
for(int i=0;i<s.length()-9;i++)
{
String temp=s.substring(i,i+10);
if(map.containsKey(temp)==false)map.put(temp,1);
else map.put(temp,map.get(temp)+1);
}
ArrayList<String> r=new ArrayList<String>();
for(String str: map.keySet())
{
if(map.get(str)>1)r.add(str);
}
return r;
}
}