串（Sequence）

最新推荐文章于 2022-11-03 21:08:54 发布

AnEra

最新推荐文章于 2022-11-03 21:08:54 发布

阅读量223

点赞数

分类专栏： # 数据结构文章标签： KMP

本文链接：https://blog.csdn.net/qq_38975553/article/details/108759250

版权

数据结构专栏收录该内容

12 篇文章 2 订阅

订阅专栏

KMP – 为什么是“最大“公共子串长度？

串（Sequence）

字符串 thank 的前缀（prefix）、真前缀（proper prefix）、后缀（suffix）、真后缀（proper suffix）

串匹配算法

蛮力(Brute Force)
KMP
Boyer-Moore
Karp-Rabin
Sunday

蛮力（Brute Force）

以字符为单位，从左到右移动模式串，直到匹配成功

蛮力算法1

蛮力1 – 执行过程

蛮力1 – 实现

/**
 * 蛮力匹配1
 */
public static int indexOf1(String text, String pattern){
    if(text == null || pattern == null) return -1;
    char[] tChars = text.toCharArray();
    char[] pChars = pattern.toCharArray();
    int tlen = tChars.length;
    int plen = pChars.length;
    if(tlen == 0 || plen == 0 || tlen < plen) return -1;
    int ti = 0, pi = 0;
    //当ti和pi都未到达字符串尾部,那么字符串匹配还未完成
    while(ti < tlen && pi < plen){
        if(pChars[pi] == tChars[ti]){
            pi++;
            ti++;
        }else{
            ti = ti - pi + 1;  //下一次比较时ti的起始位置, ti-pi: 本次比较中ti的起始位置
            pi = 0;
        }
    }
    return pi == plen ? ti-pi: -1;
}

蛮力1 – 优化

此前实现的蛮力算法，在恰当的时候可以提前退出，减少比较次数

因此，ti 的退出条件可以从 ti < tlen 改为:

ti – pi <= tlen – plen
ti – pi 是指每一轮比较中 text 首个比较字符的位置

蛮力1 – 优化实现

/**
 * 蛮力匹配1 -- 优化
 */
public static int indexOf1_1(String text, String pattern){
    if(text == null || pattern == null) return -1;
    char[] tChars = text.toCharArray();
    char[] pChars = pattern.toCharArray();
    int tlen = tChars.length;
    int plen = pChars.length;
    if(tlen == 0 || plen == 0 || tlen < plen) return -1;
    int ti = 0, pi = 0;
    int tmax = tlen-plen;
    //当ti和pi都未到达字符串尾部,那么字符串匹配还未完成
    while(ti-pi <= tmax && pi < plen){
        if(pChars[pi] == tChars[ti]){
            pi++;
            ti++;
        }else{
            ti = ti - pi + 1;
            pi = 0;
        }
    }
    return pi == plen ? ti-pi: -1;
}

蛮力算法2

蛮力2 – 执行过程

蛮力2 – 实现

/**
 * 蛮力匹配2
 */
public static int indexOf2(String text, String pattern){
    if(text == null || pattern == null) return -1;
    char[] tChars = text.toCharArray();
    char[] pChars = pattern.toCharArray();
    int tlen = tChars.length;
    int plen = pChars.length;
    if(tlen == 0 || plen == 0 || tlen < plen) return -1;
    int ti = 0, tmax = tlen-plen;
    //当ti和pi都未到达字符串尾部,那么字符串匹配还未完成
    for (;ti < tmax;ti++){
       int pi = 0;
       for(;pi < plen; pi++){
           if(tChars[ti + pi] != pChars[pi]) break;
       }
       if(pi == plen) return ti;
    }
    return -1;
}

蛮力 – 性能分析

最好情况:
只需一轮比较就完全匹配成功，比较 m 次（ m 是模式串的长度）
时间复杂度为 O(m)

最坏情况(字符集越大，出现概率越低) :

执行了 n – m + 1 轮比较（ n 是文本串的长度）
每轮都比较至模式串的末字符后失败（ m – 1 次成功，1 次失败）
时间复杂度为 O(m ∗ (n − m + 1))，由于一般 m 远小于 n，所以为 O(mn)

蛮力 vs KMP

对比蛮力算法，KMP的精妙之处：充分利用了此前比较过的内容，可以很聪明地跳过一些不必要的比较位置

KMP

KMP – next表的使用

KMP会预先根据模式串的内容生成一张 next 表（一般是个数组）

KMP – 核心原理

当 d、e 失配时，如果希望 pattern 能够一次性向右移动一大段距离，然后直接比较 d、c 字符
前提条件是 A 必须等于 B
所以 KMP 必须在失配字符 e 左边的子串中找出符合条件的 A、B，从而得知向右移动的距离
向右移动的距离：e左边子串的长度 – A的长度，等价于：e的索引 – c的索引
且 c的索引 == next[e的索引]，所以向右移动的距离：e的索引 – next[e的索引]

总结:
如果在 pi 位置失配，向右移动的距离是 pi – next[pi]，所以 next[pi] 越小，移动距离越大
next[pi] 是 pi 左边子串的真前缀后缀的最大公共子串长度

KMP – 真前缀后缀的最大公共子串长度

KMP – 计算next表

将最大公共子串长度都向后移动 1 位，首字符设置为负1，就得到了 next 表

KMP – -1的精妙之处

相当于在负1位置有个假想的通配字符（哨兵）, 匹配成功后 ti++、pi++

KMP – next表的实现

/**
 * KMP
 */
public static int indexOf(String text, String pattern){
    if(text == null || pattern == null) return -1;
    char[] tChars = text.toCharArray();
    char[] pChars = pattern.toCharArray();
    int tlen = tChars.length;
    int plen = pChars.length;
    if(tlen == 0 || plen == 0 || tlen < plen) return -1;
    int[] next = next_better(pChars);
    int ti = 0, pi = 0;
    int imax = tlen-plen;
    //当ti和pi都未到达字符串尾部,那么字符串匹配还未完成
    while(ti-pi <= imax && pi < plen){
        if(pi < 0 || pChars[pi] == tChars[ti]){
            pi++;
            ti++;
        }else{
            pi = next[pi];
        }
    }
    return pi == plen ? ti-pi: -1;
}

KMP – 为什么是“最大“公共子串长度？

假设文本串是AAAAABCDEF，模式串是AAAAB

KMP – next表的构造思路

已知 next[i] == n
(1) 如果 pattern.charAt(i) == pattern.charAt(n)
那么 next[i + 1] == n + 1
(2) 如果 pattern.charAt(i) != pattern.charAt(n)
已知 next[n] == k
如果 pattern.charAt(i) == pattern.charAt(k)
那么 next[i + 1] == k + 1
如果 pattern.charAt(i) != pattern.charAt(k)
将 k 代入 n ，重复执行(2)

KMP – next表的优化实现

private static int[] next(char[] pChars) {
    int len = pChars.length;
    int[] next = new int[len];
    int i = 0;
    int n = next[0] = -1;
    int imax = len-1;
    while(i < imax){
        if(n < 0 || pChars[i] == pChars[n]){
            next[++i] = ++n;
        }else{
            n = next[n];
        }
    }
    return next;
}

KMP – next表的不足之处

假设文本串是 AAABAAAAB, 模式串是 AAAAB

在这种情况下，KMP显得比较笨拙

KMP – next表的优化思路

已知：next[i] == n，next[n] == k

如果 pattern[i] != d，就让模式串滑动到 next[i]（也就是n）位置跟 d 进行比较
如果 pattern[n] != d，就让模式串滑动到 next[n]（也就是k）位置跟 d 进行比较
如果 pattern[i] == pattern[n]，那么当 i 位置失配时，模式串最终必然会滑到 k 位置跟 d 进行比较
所以 next[i] 直接存储 next[n]（也就是k）即可

KMP – next表的优化实现

private static int[] next_better(char[] pChars) {
    int len = pChars.length;
    int[] next = new int[len];
    int i = 0;
    int n = next[0] = -1;
    int imax = len-1;
    while(i < imax){
        if(n < 0 || pChars[i] == pChars[n]){
            i++;
            n++;
            if(pChars[i] == pChars[n]){
                next[i] = next[n];
            }else{
                next[i] = n;
            }
        }else{
            n = next[n];
        }
    }
    return next;
}

KMP – next表的优化效果

KMP – 性能分析

KMP 主逻辑
最好时间复杂度:O(m)
最坏时间复杂度:O(n), 不超过O(2n)

next 表的构造过程跟 KMP 主体逻辑类似, 时间复杂度:O(m)

KMP 整体
最好时间复杂度:O(m)
最坏时间复杂度:O(n + m)
空间复杂度:O(m)

AnEra

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
串（Sequence）

串（Sequence）字符串 thank 的前缀（prefix）、真前缀（proper prefix）、后缀（suffix）、真后缀（proper suffix）串匹配算法蛮力(Brute Force) KMP Boyer-Moore Karp-Rabin Sunday蛮力（Brute Force）以字符为单位，从左到右移动模式串，直到匹配成功蛮力算法1蛮力1 – 执行过程蛮力1 – 实现/** * 蛮力匹配1 */public static
复制链接

扫一扫

专栏目录