leetcode 187. Repeated DNA Sequences 编码计数统计重复字符串 + 移动窗口_repeated dna sequences(统计字符串出现次数)js实现-CSDN博客

本文链接：https://blog.csdn.net/JackZhang_123/article/details/78037187

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,

Return:
[“AAAAACCCCC”, “CCCCCAAAAA”].

这道题考察的就是重复出现的字符串，这道题给我的启发很强。

当然，最直接的方法就是暴力求解，但是这个会超时，其实也可以采用HashMap来做（这个方法也可以直接accept），我在网上看到了一个基于编码的做法，这个做法很棒，直接看代码吧！

这道题十分需要注意的地方就是位运算的优先级，注意所有的位运算以后都注意要添加括号，因为加法的优先级高于位运算，所以不添加括号是错误的。

代码如下：

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

/*
 * http://blog.csdn.net/xudli/article/details/43666725
 * 
 * 因为只有4个字母,所以可以创建自己的hashkey, 每两个BITS,
 * 
 * 对应一个 incoming character. 超过20bit 即10个字符时, 只保留20bits.
 * 
 * (hash<<2) + map.get(c)  符号优先级,  << 一定要括起来.
 * 
 * */
public class Solution 
{
    /*
     * 这个是网上编码的做法，很不错的想法，值得反思和学习
     * 不过有点麻烦
     * */
    public List<String> findRepeatedDnaSequencesByCode(String s) 
    {
        List<String> res=new ArrayList<>();
        if(s==null || s.length()<=10)
            return res;

        Map<Character, Integer> map=new HashMap<Character, Integer>();
        map.put('A', 0);
        map.put('C', 1);
        map.put('G', 2);
        map.put('T', 3);

        /*
         * set保存的是所有的可能的字符串，
         * uniqueSet保存的的是判断的重复出现的字符串
         * 使用uniqueSet是为了避免重复添加重复元素
         * */
        Set<Integer> set=new HashSet<>();
        Set<Integer> uniqueSet=new HashSet<>();
        int hash=0;
        for(int i=0;i<s.length();i++)
        {
            Character one=s.charAt(i);
            if(i<9)
                hash = (hash<<2) + map.get(one);
            else
            {
                hash = (hash<<2) + map.get(one);
                hash &= (1<<20)-1;

                if(set.contains(hash)==false)
                    set.add(hash);
                else if(set.contains(hash) && !uniqueSet.contains(hash))
                {
                    uniqueSet.add(hash);
                    res.add(s.substring(i-9,i+1));
                }
            }
        }
        return res;
    }

    /*
     * 使用HashMap去做，这个也很不错，可以直接accept
     * 
     * */
    public List<String> findRepeatedDnaSequences(String s) 
    {
        List<String> res=new ArrayList<>();
        if(s==null || s.length()<=10)
            return res;

        Map<String, Integer> map=new HashMap<>();
        for(int i=10;i<=s.length();i++)
        {
            String key=s.substring(i-10, i);
            map.put(key, map.getOrDefault(key, 0)+1);
        }

        Set<String> set=map.keySet();
        for(String key : set)
        {
            if(map.get(key)>=2)
                res.add(key);
        }
        return res;
    }


    /*
     * 暴力去做，这个肯定超时
     * */
    public List<String> findRepeatedDnaSequencesByLoop(String s) 
    {
        List<String> res=new ArrayList<>();
        if(s==null || s.length()<=10)
            return res;

        for(int i=0;i<s.length();i++)
        {
            if(i+10<=s.length())
            {
                String one=s.substring(i,i+10);
                for(int j=i+1;j<s.length();j++)
                {
                    if(j+10<=s.length())
                    {
                        if(one.equals(s.substring(j, j+10)))
                        {
                            if(!res.contains(one))
                                res.add(one);
                            break;
                        }
                    }else
                        break;
                }
            }
            else
                break;
        }
        return res;
    }
}

下面是C++的做法，这道题最直接的方法就是使用map来做查询，但是可能超时，后来网上看到了一个使用编码的做法，很棒的做法，值得学习

代码如下：

#include <iostream>
#include <algorithm>
#include <climits>
#include <vector>
#include <stack>
#include <queue>
#include <map>
#include <set>
#include <string>
#include <unordered_map>

using namespace std;

class Solution 
{
public:
    vector<string> findRepeatedDnaSequences(string s) 
    {
        vector<string> res;
        map<char, int> mmp;
        mmp['A'] = 0;
        mmp['C'] = 1;
        mmp['G'] = 2;
        mmp['T'] = 3;
        set<int> st;
        set<int> uniqueSt;
        int hash = 0;
        for (int i = 0; i < s.length(); i++)
        {
            if (i < 9)
                hash = (hash << 2) + mmp[s[i]];
            else
            {
                hash = (hash << 2) + mmp[s[i]];
                hash = hash & ((1 << 20) - 1);
                if (st.find(hash) == st.end())
                    st.insert(hash);
                else if (uniqueSt.find(hash) == uniqueSt.end())
                {
                    uniqueSt.insert(hash);
                    res.push_back(s.substr(i-9,10));
                }
            }
        }
        return res;
    }
};