187. Repeated DNA Sequences

原创 2016年05月31日 20:44:25

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Subscribe to see which companies asked this question


1.我的答案

很简单的用map,但是耗时很长

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> vec;
        if(s.size() < 10)
        return vec;
        map<string,int>mp;
        for(int i = 0; i <= s.size()-10; i++){
            string str = s.substr(i,10);
            if(mp.find(str) == mp.end())
                mp[str] = 1;
                else{
                    mp[str]++;
                    if(mp[str] == 2)
                        vec.push_back(str);
                }
        }
        return vec;
    }
};


2.别人的答案 8ms

1048576=4^10

他用  ‘A’--1   'C'--3  'G'-2  'T'--0

再用位计算

hashmap用char来计数 0~256(因此这里有隐患,若相同字符串的次数大于256次呢?)

vector<string> findRepeatedDnaSequences(string s) {
    char  hashMap[1048576] = {0};
    vector<string> ans;
    int len = s.size(),hashNum = 0;
    if (len < 11) return ans;
    for (int i = 0;i < 9;++i)
        hashNum = hashNum << 2 | (s[i] - 'A' + 1) % 5;
    for (int i = 9;i < len;++i)
        if (hashMap[hashNum = (hashNum << 2 | (s[i] - 'A' + 1) % 5) & 0xfffff]++ == 1)
            ans.push_back(s.substr(i-9,10));
    return ans;
}


对于上述代码的未考虑点,这里有人给出进一步改进(我也没看懂什么意思)

a simple solution is only allowing hashMap to has three status, 0 for none, 1 for 1, 3 for multiple.

vector<string> findRepeatedDnaSequences(string s) {
    char flag[262144] ={0};
    vector<string> result;
    int len,DNA=0,i;
    if((len=s.length())< 11) return result;
    for(i = 0 ; i < 9; i++)
        DNA = DNA << 2| (s[i] - 'A' + 1) % 5;
    for(i = 9;i<len;i++)
    {
        DNA = (DNA<<2 & 0xFFFFF)|(s[i] - 'A' + 1)%5;
            if(!(flag[DNA>>2]&(1<<((DNA&3) << 1)))) 
                flag[DNA>>2] |= (1<<((DNA&3) << 1));
            else if(!(flag[DNA>>2]&(2<<((DNA&3) << 1)))) 
            {
                result.push_back(s.substr(i-9,10));
                flag[DNA>>2] |= (2<<((DNA&3) << 1));
            }
    }
    return result;
}


3.另一个大神的答案

The main idea is to store the substring as int in map to bypass the memory limits.

There are only four possible character A, C, G, and T, but I want to use 3 bits per letter instead of 2.

Why? It's easier to code.

A is 0x41, C is 0x43, G is 0x47, T is 0x54. Still don't see it? Let me write it in octal.

A is 0101, C is 0103, G is 0107, T is 0124. The last digit in octal are different for all four letters. That's all we need!

We can simply use s[i] & 7 to get the last digit which are just the last 3 bits, it's much easier than lookup table or switch or a bunch of if and else, right?

We don't really need to generate the substring from the int. While counting the number of occurrences, we can push the substring into result as soon as the count becomes 2, so there won't be any duplicates in the result.


vector<string> findRepeatedDnaSequences(string s) {
    unordered_map<int, int> m;
    vector<string> r;
    int t = 0, i = 0, ss = s.size();
    while (i < 9)
        t = t << 3 | s[i++] & 7;
    while (i < ss)
        if (m[t = t << 3 & 0x3FFFFFFF | s[i++] & 7]++ == 1)
            r.push_back(s.substr(i - 10, 10));
    return r;
}

Update:

I realised that I can use s[i] >> 1 & 3 to get 2 bits, but then I won't be able to remove the first loop as 1337c0d3r suggested.


4.对3中的代码进行精简

Another observation is the mapped value need not be an integer counter, and could simply be a boolean to further save space. This requires some extra logic though:

vector<string> findRepeatedDnaSequences(string s) {
    unordered_map<int, bool> m;
    vector<string> r;
    for (int t = 0, i = 0; i < s.size(); i++) {
        t = t << 3 & 0x3FFFFFFF | s[i] & 7;
        if (m.find(t) != m.end()) {
            if (m[t]) {
                r.push_back(s.substr(i - 9, 10));
                m[t] = false;
            }
        } else {
            m[t] = true;
        }
    }
    return r;
}



LeetCode刷题-187. Repeated DNA Sequences

题目:All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAA...

算法作业HW15:LeetCode187 Repeated DNA Sequences

Description: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for ex...

Leetcode 187. Repeated DNA Sequences[medium]

题目: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACG...

Leetcode NO.187 Repeated DNA Sequences

本题题目要求如下: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for exa...

LeetCode-187.Repeated DNA Sequences

https://leetcode.com/problems/repeated-dna-sequences/ All DNA is composed of a series of nucleotid...

LeetCode(187)Repeated DNA Sequence

题目如下: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "AC...

LeetCode——Repeated DNA Sequences

题目描述: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "...

LeetCode:Repeated DNA Sequences

问题: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG...

[LeetCode] Repeated DNA Sequences

Repeated DNA Sequences   All DNA is composed of a series of nucleotides abbreviated as A, C, G...

Repeated DNA Sequences(统计字符串出现次数)

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTC...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:187. Repeated DNA Sequences
举报原因:
原因补充:

(最多只允许输入30个字)