题目链接:https://leetcode.com/problems/repeated-dna-sequences/
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
思路一:
用两个set然后从前往后用substring取出后依次判断即可。
AC 最初版本 21ms:
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> setDict=new HashSet();
Set<String> setJudge=new HashSet();
List<String> list=new ArrayList();
if(s==null||s.length()<11)
return list;
for(int i=0;i<=s.length()-10;i++){
String sub=s.substring(i,i+10);
if(!setDict.contains(sub))
setDict.add(sub);
else{
if(!setJudge.contains(sub))
list.add(sub);
setJudge.add(sub);
}
}
return list;
}
}
后来看评论区有同样的方法,利用了add的返回值,优化如下:
19ms:
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> setDict=new HashSet(),setJudge=new HashSet();
for(int i=0;i<=s.length()-10;i++){
String sub=s.substring(i,i+10);
if(!setDict.add(sub))
setJudge.add(sub);
}
return new ArrayList<>(setJudge);
}
}
思路二:
位操作,主要是节省空间。复杂度相同。
此题由于构成输入字符串的字符只有四种,分别是A, C, G, T,
下面我们来看下它们的ASCII码用二进制来表示:
A: 0100 0001 C: 0100 0011 G: 0100 0111 T: 0101 0100
我们的目的是利用位来区分字符,当然是越少位越好,通过观察发现,每个字符的后三位都不相同
,故而我们可以用末尾三位来区分这四个字符。而题目要求是10个字符长度的串,每个字符用三位来区分,
10个字符需要30位,在32位机上也OK。为了提取出后30位,我们还需要用个mask,取值为0x7ffffff,
用此mask可取出后27位,再向左平移三位即可。
AC 15ms:
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<Integer> set=new HashSet();
Set<String> repeat=new HashSet();
int cur=0;
int mask=0x7ffffff;
if(s.length()<11)
return new ArrayList<String>();
for(int i=0;i<9;i++){
cur=(cur<<3)|(s.charAt(i)&7);
}
for(int i=9;i<s.length();i++){
cur=((cur&mask)<<3)|(s.charAt(i)&7);
if(!set.add(cur))
repeat.add(s.substring(i-9,i+1));
}
return new ArrayList<>(repeat);
}
}