Leetcode: Repeated DNA Sequence

最新推荐文章于 2023-02-25 23:03:36 发布

denisewu

最新推荐文章于 2023-02-25 23:03:36 发布

阅读量369

点赞数

分类专栏：算法

本文链接：https://blog.csdn.net/denisewu/article/details/44277209

版权

算法专栏收录该内容

15 篇文章 0 订阅

订阅专栏

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

思路：

如果将所有十长度的子字符串都存储到unordered_set中进行hash，会导致memory limit exceeded。由于字符串中的字符是有限字符集，那么可以考虑用bits来代表每个字符。由于字符集只有四个，那么每个字符用两个bit位就可以表示。那么十位的子字符串均可以用二十个bit位组成的整数表示，该整数剩下的12位高位均为0.那么，将这些整数放到unordered_set极大压缩内存使用量。另外，如果十位子字符串如果重复超过2遍，但是这些字符串在结果数组中只能出现一次，那么需要另外使用一个unordered_set来存储当前已经作为结果字符串存入结果数组中的字符串，防止结果中出现相同的字符串。

代码：

class Solution {
    int char2int(char c)
    {
        int ret = 0;
        switch(c)
        {
            case 'A':
                ret = 0;
                break;
            case 'C':
                ret = 1;
                break;
            case 'G':
                ret = 2;
                break;
            case 'T':
                ret = 3;
                break;
        }
        return ret;
    }
    int eraser = 0x3ffff ;
public:
    vector<string> findRepeatedDnaSequences(string s) {
        
        int n = s.length();
        vector<string> ret;
        if(n < 10)
            return ret;
            
        unordered_set<int> wordset;
        unordered_set<int> resultset;
        unsigned num = 0;
        for(int j = 0; j < 10; j++)
        {
            num |= (char2int(s[j]));
            num <<= 2;
        }
        num >>= 2;
        int index = 0;
        for(int i = 10; i < n; i++)
        {
            if(wordset.find(num) != wordset.end() )
            {
                if(resultset.find(num) == resultset.end())
                {
                    ret.push_back(s.substr(index, 10));
                    resultset.insert(num);
                }
            }
            else
                wordset.insert(num);
            index++;
            num =(( num & eraser ) << 2) | char2int(s[i]);
        }
         if(wordset.find(num) != wordset.end() )
            {
                if(resultset.find(num) == resultset.end())
                {
                    ret.push_back(s.substr(index, 10));
                    resultset.insert(num);
                }
            }
            else
                wordset.insert(num);
        return ret;
    }
};

改进：

以上代码平均执行时间为105ms。有人提出不要使用STL中的数据结构，使用长度为pow(2, 21)的bool型数组来表示已经找到字符串的集合，可以极大的提高算法执行效率。这种实现需要注意的一点是，这种数组占用内存量很大，有时候会超过编译器所设置的栈大小。代码执行过程中会出现segment fault的中断，这是因为编译器在进入一个函数之前，会检查该函数所要占用的栈内存大小是否超过栈大小限制，如果超过就会触发中断。以下代码我在vs 2010下执行就出现了这种情况。但是在leetcode中执行正常，并AC.以下代码在leetcode执行平均时间为16ms.性能得到很大的提高。

class Solution {
    int char2int(char c)
    {
        int ret = 0;
        switch(c)
        {
            case 'A':
                ret = 0;
                break;
            case 'C':
                ret = 1;
                break;
            case 'G':
                ret = 2;
                break;
            case 'T':
                ret = 3;
                break;
        }
        return ret;
    }
    int eraser = 0x3ffff ;
public:
    vector<string> findRepeatedDnaSequences(string s) {
        
        int n = s.length();
        vector<string> ret;
        if(n < 10)
            return ret;
            
        bool wordset[1 << 20];
        bool resultset[1 << 20];
        memset(wordset, 0, sizeof(bool) * (1 << 20));
        memset(resultset, 0, sizeof(bool) * (1 << 20));
        unsigned num = 0;
        for(int j = 0; j < 10; j++)
        {
            num |= (char2int(s[j]));
            num <<= 2;
        }
        num >>= 2;
        int index = 0;
        for(int i = 10; i < n; i++)
        {
            if(wordset[num])
            {
                if(!resultset[num])
                {
                    ret.push_back(s.substr(index, 10));
                    resultset[num] = true;
                }
            }
            else
                wordset[num] = true;
            index++;
            num =(( num & eraser ) << 2) | char2int(s[i]);
        }
        if(wordset[num])
            {
                if(!resultset[num])
                {
                    ret.push_back(s.substr(index, 10));
                    resultset[num] = true;
                }
            }
            else
                wordset[num] = true;
        return ret;
    }
};

denisewu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Leetcode: Repeated DNA Sequence

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.Wri
复制链接

扫一扫

专栏目录