[leetcode] 187. Repeated DNA Sequences

最新推荐文章于 2024-09-18 15:03:01 发布

TstsUgeg

最新推荐文章于 2024-09-18 15:03:01 发布

阅读量458

点赞数

分类专栏： leetcode 文章标签： leetcode Bit Manipulation Hash Table

本文链接：https://blog.csdn.net/TstsUgeg/article/details/50736013

版权

leetcode 专栏收录该内容

254 篇文章 0 订阅

订阅专栏

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

这道题是找出DNA序列中长度为10的重复序列，题目难度为Medium。

直接用字符串比对的方式比较费时，没有尝试，不知道能不能通过，感兴趣的同学可以自己试下。

每个字符位置只有‘A’、‘C’、‘G’、‘T’四种状态，用二进制的两位即可以表示一个字符，这样用一个20位的二进制数字就可以表示长度为10的序列。查看有没有重复出现，很自然会想到用Hash Table。这里不用unordered_set是因为有的序列会重复多次，不能多次加入结果中，用unordered_map计数，在出现第二次时把序列加入结果。具体代码：

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> rst;
        int curStr = 0;
        unordered_map<int, int> hash;
        for(int i=0; i<s.size(); ++i) {
            curStr = (curStr << 2) & 0xfffff | ((s[i] - 'A' + 1) % 5);
            if(i < 9) continue;
            if(hash[curStr]++ == 1)
                rst.push_back(s.substr(i-9, 10));
        }
        return rst;
    }
};

这里通过 (s[i] - 'A' + 1) % 5进行编码是查看别人代码时学到的，没想到合适的编码方法，开始时用的是switch，大家可以学习一下。