字符串匹配问题 ----- Rabin-Karp算法

最新推荐文章于 2023-12-01 11:13:37 发布

__anonymous_

最新推荐文章于 2023-12-01 11:13:37 发布

阅读量195

点赞数

分类专栏：算法

本文链接：https://blog.csdn.net/k909397116/article/details/107829758

版权

算法专栏收录该内容

72 篇文章 1 订阅

订阅专栏

题意：

任意给定一段字符串str（“123abc123abc00abc”）
再输入一个关键字key（“abc”）
要求返回str中包含key的所有子串的头下标

解法1（暴力法）

思路：

以key串长度进行窗口滑动
str中[startIndex, endIndex]的子串与key相同则匹配

复杂度：窗口在n长度的母串滑动复杂度为O(n)，每次比较m长度子串与key串的复杂度为O(m)，综合来看暴力法复杂度为O(n*m)

    public static ArrayList<Integer> match(String str, String key) {
        ArrayList<Integer> list = new ArrayList<>();

        for(int startIndex = 0; startIndex < str.length(); startIndex++) {
            int endIndex;
            if((endIndex = startIndex + key.length() -1) > str.length() - 1) break;
            if(str.substring(startIndex,endIndex + 1).equals(key)) {
                list.add(startIndex);
            }
        }
        return list;
    }

解法2（Rabin-Karp算法 ----- 哈希法）

思路：

以key串长度进行窗口滑动
使用哈希算法计算出key串的哈希值
str中[startIndex, endIndex]的子串的哈希值与key串哈希值相同则匹配

复杂度：窗口在n长度的母串滑动复杂度为O(n)，每次比较m长度子串的哈希值与key串哈希值复杂度为O(1)，每次计算m长度子串的哈希值与key串哈希值复杂度为O(m)，综合来看该方法复杂度为O(n*m)

	private static final int seed = 31;
	
    public static ArrayList<Integer> match2(String str, String key) {
        ArrayList<Integer> list = new ArrayList<>();
        long hashOfKey = hash(key);

        for (int startIndex = 0; startIndex < str.length(); startIndex++) {
            int endIndex;
            if ((endIndex = startIndex + key.length() - 1) > str.length() - 1) break;
            if (hash(str.substring(startIndex, endIndex + 1)) == hashOfKey) list.add(startIndex);
        }
        return list;
    }
	
	// 哈希算法
    public static long hash(String str) {
        int hash = 0;
        for (int i = 0; i < str.length(); i++) {
            hash = hash * seed + str.charAt(i);
        }
        return hash % Long.MAX_VALUE;
    }

关于哈希算法作用、原理和哈希冲突：

作用：能够唯一表示一个字符串，相同的字符串其哈希值相同，不相同的字符串哈希值一定不同
原理："abc"字符串的哈希值为 $seed^2 * a + seed^1 * a + seed^0 * a)$ % Long.MAX_VALUE，可以理解成一个递推式： $a_{n+1} = seed*a_n + srt(n)$ (n=0,1,2,3…)，当n=0时 $a_n=0$
哈希冲突：哈希冲突的意思就是，使用该哈希算法后，可能会出现不同字符串的哈希值也相同，这是有一定概率会出现的误差，本哈希算法计算100万个字符串的哈希值，可能会出现110个左右的冲突数

优化：

使用滚动哈希法计算出母串中所有key串长度的子串哈希值，保存在一个数组中
滚动哈希的原理：
1. 先计算出第一个窗口长度的子串哈希值，复杂度为O(m)
2. 利用公式：本窗口子串哈希值=上一个窗口子串的哈希值 * seed + 本窗口最后一个字符 - 上一个窗口第一个字符*pow(seed,lengthOfKey)，例如 $C_0*seed^2+C_1*seed^1+C_2*seed^0)$ 是第一个窗口的哈希值，那么第二个窗口的哈希值就为(第一个窗口哈希值 * seed) $C_0*seed^3$ ，该循环复杂度为O(n)
3. 综上上滚动哈希的复杂度为O(m+n)
最后进行窗口滑动把key串哈希值与保存了所有子串哈希值的数组进行匹配

    public static ArrayList<Integer> match3(String str, String key) {

        ArrayList<Integer> list = new ArrayList<>();
        // 滚动哈希(O(m+n))
        long[] hashes = hashes(str, key.length());
        // 求key串哈希值(O(m))
        int hashOfKey = hash(key);
		// 窗口扫描匹配哈希值(O(n))
        for (int startIndex = 0; startIndex < str.length(); startIndex++) {
            int endIndex;
            if ((endIndex = startIndex + key.length() - 1) > str.length() - 1) break;
            if (hashOfKey == hashes[startIndex]) list.add(startIndex);
        }
        return list;
    }


    public static long[] hashes(String str, int lengthOfKey) {
        long[] hashes = new long[str.length() - lengthOfKey + 1];
        hashes[0] = hash(str.substring(0, lengthOfKey));

        for (int startIndex = 1; startIndex < str.length(); startIndex++) {
            int endIndex;
            if ((endIndex = startIndex + lengthOfKey - 1) > str.length() - 1) break;
            // 滚动哈希算法：本窗口子串哈希值=上一个窗口子串的哈希值 * seed + 本窗口最后一个字符 - 上一个窗口第一个字符*pow(seed,lengthOfKey)
            hashes[startIndex] = hashes[startIndex - 1] * seed + str.charAt(endIndex) - str.charAt(startIndex - 1) * (long) Math.pow(seed, lengthOfKey);
        }
        return hashes;
    }

    public static long hash(String str) {
        int hash = 0;
        for (int i = 0; i < str.length(); i++) {
            hash = hash * seed + str.charAt(i);
        }
        return hash % Long.MAX_VALUE;
    }