Problem Statement
(Source) All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Solution
Naive first:
class Solution(object):
def findRepeatedDnaSequences(self, s):
"""
:type s: str
:rtype: List[str]
"""
if not s:
return []
n = len(s)
if n < 11:
return []
res = set()
dic = {}
for i in range(n - 9):
sub = s[i : i + 10]
if sub not in dic:
dic[sub] = 1
else:
res.add(sub)
return list(res)
Tweak it a little bit:
class Solution(object):
def findRepeatedDnaSequences(self, s):
"""
:type s: str
:rtype: List[str]
"""
if not s:
return []
n = len(s)
if n < 11:
return []
res = []
dic = {}
for i in range(n - 9):
sub = s[i : i + 10]
dic[sub] = dic.get(sub, 0) + 1
if dic[sub] == 2:
res.append(sub)
return res
Final solution using Bit Manipulation
to save space by converting strings to integers:
class Solution(object):
def findRepeatedDnaSequences(self, s):
"""
:type s: str
:rtype: List[str]
"""
n = len(s)
if n <= 10:
return []
res = []
y = 0
m = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
counter = dict()
for i in xrange(n):
y = (y * 4 + m[s[i]]) & 0xFFFFF
if i < 9:
continue
counter[y] = counter.get(y, 0) + 1
if counter[y] == 2:
res.append(s[i - 9 : i + 1])
return res