[Leetcode]Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

基本思路:

1)遍历字符串,hash长度为10的字串到hash表中

2)遍历hash表,如果字串出现次数大于1次,输出

得到如下算法:

    vector<string> findRepeatedDnaSequences(string s) {
        vector<string>result;
        map<string,int>tbl;
        for(int i = 0;(i+10)< s.size();i++){
            tbl[s.substr(i,10)]++;
        }
        map<string,int>::iterator iter;
        for(iter = tbl.begin();iter != tbl.end();iter++){
            if(iter->second > 1)
                result.push_back(iter->first);
        }
        
        return result;
    }

Leetcode显示内存超出限制,原因是对超长的字符串,字串数目太多,耗费内存更多

所以用字串作HASH的KEY耗非内存太多

考虑到只有‘A’,‘C’,‘G’,‘T’四个字符,我们用0b00,0b01,0b10,0b11来表示

这样10个字串长的字串只需要用20个BIT来表示,算法同上,只是需要加上10个字符串到对应BIT整数表示的转换

<pre name="code" class="cpp">class Solution {
public:
    /*algorithm: search
        basic idea is to get 10 substring, then match in string
        if it can be found, output to the result,this needs many comparison
        for A,C,G,T, we can use 2 bit to represent them,
        so 10 long sring can be represented by one int32 number
        so we only need to hash the string to numbers, and store in map
        all the map elements with value > 1 are output candiate
        A:00, C: 01, G:10, T:11
    */
    string int2seq(int num){
        string s;
        char tbl[4]={'A','C','G','T'};//0b00,0b01,0b10,0b11
        for(int i = 0;i < 10;i++){
            s.append(1,tbl[num&0x3]);
            num >>= 2;
        }
        return s;
    }
    int seq2int(string &s,int start){
        unordered_map<char,int>table={
          {'A',0b00},{'C',0b01},{'G',0b10},{'T',0b11} 
        };
        int val = 0;
        for(int i = 9;i >= 0;i--){
            val <<= 2;
            val |= table[s[start+i]];
        }
        return val;
    }
    vector<string> findRepeatedDnaSequences(string s) {
            vector<string>result;
            unordered_map<int,int>table;
            for(int i = 0;(i+10) <= s.size();i++){
                table[seq2int(s,i)]++;
            }
            for(auto it = table.begin();it != table.end();it++){
                if(it->second > 1)
                    result.push_back(int2seq(it->first));
            }
            return result;
    }
};


 



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值