Leetcode Repeated DNA Sequences_[leetcode] repeated dna sequences 求重复的dna序列-CSDN博客

本文链接：https://blog.csdn.net/coderhuhy/article/details/43647731

本文探讨LeetCode上的一个挑战：寻找DNA分子中长度为10且重复出现的子序列。通过采用散列表和位操作优化算法，实现高效地识别重复序列。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Leetcode Repeated DNA Sequences

今天突然看到Leetcode上出了一个新题, 就立刻着手试试;

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].

题意: 所有DNA都是由一系列碱基构成, 分别为ACGT, 题目要求找出所有长度为10的子串, 这些子串在原串中出现次数必须大于1次(重复出现);

思路历程:

-- 这是一个子串对自身匹配的问题;

-- 第一反应有3种方法: 其一, 使用hash(c++11的unordered_set)来存储所有长度为10的子串; 具体步骤是构造unordered_set<string> repeated, 遍历输入的原串, 对s[i]到s[i+9]的序列构成的子串, 如未出现在repeated中, 则存入repeated; 如出现在repeated中, 则说明该子串曾出现过, 符合题意要求, 将其存入答案vector<string> answer;

当然, 这个代码最后是memory limit exceeded; 先看代码, 后面分析;

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
    	vector<string> answer;
        unordered_set<string> repeated;
        for (int i = 0; i + 9 < s.size(); ++i) {
        	string t(s, i, 10);
        	unordered_set<string>::iterator it = repeated.find(t);
        	if (it != repeated.end())
        		answer.push_back(t);//在repeated中查找成功, 说明曾出现过该子串, 则存入answer
        	else
        		repeated.insert(t);//repeated中查找失败, 说明未曾出现过, 则存入repeated
        }
        return answer;
    }
};//memory limit exceeded

分析:

时间复杂度是O(n), 其中n为原串长度;

MLE的可能原因: unordered_set<string>对于超长的输入串, 会消耗大量的存储空间; 另外, 上面代码并未考虑answer中的重复答案, 因为每次只要出现在repeated中就放入answer, 这显然有问题;

改进方法:

-- 由于碱基无非ACGT四种类型, 可以使用00 01 10 11四个状态代替ACGT四种碱基, 比如AAGCT就是00 00 10 01 11, 对任意一个长度为10的子串都可以转化使用20个位的int值hint; 这样就可将unordered_set<string> repeated转变为unordered_set<int> repeated, 一定程度上减少了所需的存储空间;

-- 对于如何去重, 其一可以先收集所有答案, 再sort, unique去重, 当然这样很慢也很麻烦; 其二, 可以再构造一个unordered_set<int> check, 用于存储已经存入answer中的重复子串对应的hint值;

-- 值得一提的是, 每次从s[i]->s[i+9]变为s[i+1]->s[i+10], 使用了这样一个方法:

hint = ((hint & eraser) << 2) + ati[s[i]];

其中eraser是一个宏定义, 值为0x3ffff, 二进制是00111111111111111111, 用于擦除hint中的最高2位s[i]碱基对应的值, 再左移2, 最后加上s[i+10]的碱基对应的值;

代码如下:

class Solution {
public:
	#define eraser 0x3ffff
	vector<string> findRepeatedDnaSequences(string s) {
	    vector<string> answer;
	    int hint = 0;//存储长度为10的子串翻译后的int值
	    if (s.size() < 10)
	    	return answer;
	    unordered_set<unsigned int> repeated, check;//repeated存储已出现的子串, check用于防止重复答案
	    unordered_map<char, unsigned int> ati{{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};//此处ati是存储各碱基对应的值00 01 10 11(c++11新语法)
	    for (int i = 0; i != 10; ++i) {
	    	hint = (hint << 2) + ati[s[i]];//用s的前10个碱基构造初始hint值
	    }
	    repeated.insert(hint);
	    for (int i = 10; i != s.size(); ++i) {
	    	hint = ((hint & eraser) << 2) + ati[s[i]]; //子串变化对应hint值变化
	    	unordered_set<unsigned int>::iterator it = repeated.find(hint);
	    	if (it != repeated.end()) {
	    		it = check.find(hint);
	    		if (it == check.end()) {
	    			answer.push_back(string(s, i - 9, 10));
	    			check.insert(hint);
	    		}
	    	}
	    	else
	    		repeated.insert(hint);
	    }
	    return answer;
	}
};

分析:

一开始由于忽略了移位与其他运算符的优先级关系, 一直出问题, 后来才发现

hint = ((hint & eraser) << 2) + ati[s[i]];

里面的外面那层括号没加上, 浪费了不少时间, 确实不应该

最后:

-- 在代码中用到了unordered_set和unordered_map等散列容器, 不知道是否有O(n)算法可以不适用这些方法, 如果有请大家告诉我;

-- 一开始其实还有一种想法, 就是枚举每个长度为10的子串, 使用KMP算法, 时间复杂度应该是O(n2), 但占用的存储空间为O(1), 不知道效果会怎么样, 如果有朋友试过, 也请与我讨论一下~