All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
[分析]
此题思路是容易想到的,遍历输入字符串的每个长度为10的substring,利用HashMap 检查其出现次数,出现两次或者以上的则加入到结果中。
实现时仅当某个substring第二次出现时加入结果可避免结果中出现重复字符串。但直接实现会得到Memory Limit Exceed,就是程序内存开销太大了。
此题的关键就是要将那些待检查的substring转换为int来节省内存,如何高效的编码substring?共4个字符,ACGT,可用两个bit区分它们,分别是00,01,10,11,
参考解答中的掩码技巧值得学习,使用一个20位的数字0x3ffff称为eraser,每次要更新一位字符时,将老的编码hint & eraser, 然后左移两位,然后加上新字符对应的编码,
这样就得到了新substring的编码,很巧妙~
[ref]
[url]http://blog.csdn.net/coderhuhy/article/details/43647731[/url]
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
[分析]
此题思路是容易想到的,遍历输入字符串的每个长度为10的substring,利用HashMap 检查其出现次数,出现两次或者以上的则加入到结果中。
实现时仅当某个substring第二次出现时加入结果可避免结果中出现重复字符串。但直接实现会得到Memory Limit Exceed,就是程序内存开销太大了。
此题的关键就是要将那些待检查的substring转换为int来节省内存,如何高效的编码substring?共4个字符,ACGT,可用两个bit区分它们,分别是00,01,10,11,
参考解答中的掩码技巧值得学习,使用一个20位的数字0x3ffff称为eraser,每次要更新一位字符时,将老的编码hint & eraser, 然后左移两位,然后加上新字符对应的编码,
这样就得到了新substring的编码,很巧妙~
[ref]
[url]http://blog.csdn.net/coderhuhy/article/details/43647731[/url]
public class Solution {
// Method 2: hashmap store int instead of string to bypass MLE
public static final int eraser = 0x3ffff;
public static HashMap<Character, Integer> ati = new HashMap<Character, Integer>();
static {
ati.put('A', 0);
ati.put('C', 1);
ati.put('G', 2);
ati.put('T', 3);
}
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<String>();
if (s == null || s.length() <= 10)
return result;
int N = s.length();
int hint = 0;
for (int i = 0; i < 10; i++) {
hint = (hint << 2) + ati.get(s.charAt(i));
}
HashMap<Integer, Integer> checker = new HashMap<Integer, Integer>();
checker.put(hint, 1);
for (int i = 10; i < N; i++) {
hint = ((hint & eraser) << 2) + ati.get(s.charAt(i));
Integer value = checker.get(hint);
if (value == null) {
checker.put(hint, 1);
} else if (value == 1) {
checker.put(hint, value + 1);
result.add(s.substring(i - 9, i + 1));
}
}
return result;
}
// Method 1: Memory Limit Exceed & may contain duplicates
public List<String> findRepeatedDnaSequences1(String s) {
HashMap<String, Integer> map = new HashMap<String, Integer>();
int last = s.length() - 10;
for (int i = 0; i <= last; i++) {
String key = s.substring(i, i + 10);
if (map.containsKey(key)) {
map.put(key, map.get(key) + 1);
} else {
map.put(key, 1);
}
}
List<String> result = new ArrayList<String>();
for (String key : map.keySet()) {
if (map.get(key) > 1)
result.add(key);
}
return result;
}
}