题目描述
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
分析
考察位图。按位操作,A C G T分别用如下bits表示:
A 00
C 01
G 10
T 11
所以10个连续的字符,只需要20位即可表示,而一个int(32位)就可以表示。定义变量hash,后20位表示字符串序列,其余位数置0 。
定义一个set用来存放已经出现过的hash,计算新hash时,如果已经出现过,就放入结果的set中。
代码
public static List<String> findRepeatedDnaSequences(String s) {
if (s == null || s.length() < 11) {
return new ArrayList<String>();
}
int hash = 0;
Set<Integer> appear = new HashSet<Integer>();
Set<String> set = new HashSet<String>();
Map<Character, Integer> map = new HashMap<Character, Integer>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
hash = (hash << 2) + map.get(c);
hash &= (1 << 20) - 1;
if (i >= 9) {
if (appear.contains(hash)) {
set.add(s.substring(i - 9, i + 1));
} else {
appear.add(hash);
}
}
}
return new ArrayList<String>(set);
}