字符串匹配

lyh20021209

已于 2024-02-26 19:20:51 修改

阅读量390

点赞数 6

分类专栏：数据结构与算法文章标签： java 算法 leetcode

于 2024-01-16 14:25:34 首次发布

本文链接：https://blog.csdn.net/lyh20021209/article/details/135623896

版权

数据结构与算法专栏收录该内容

32 篇文章 0 订阅

订阅专栏

模板：

KMP:

细节在代码中

看不懂的可以参照：如何更好地理解和掌握 KMP 算法? - 阮行止的回答 - 知乎
https://www.zhihu.com/question/21923021/answer/1032665486

package StringMatch.KMP;

import java.util.ArrayList;
import java.util.List;

public class KMP {

    /**
     * 计算p在s中所有匹配子串的开始位置
     * @param s 主串
     * @param p 模式串
     * @return 所有匹配子串的开始位置
     */
    public List<Integer> search(String s,String p){
        ArrayList<Integer> ans = new ArrayList<>();
        char[] sch = s.toCharArray();
        char[] pch = p.toCharArray();
        int[] next = buildNext(p);

        int count = 0;
        for (int i = 0; i < sch.length; i++) {
            while(count > 0 && pch[count] != sch[i]){
                // 回退到上一轮可以复用的前缀的长度
                count = next[count-1];
            }

            if( pch[count] == sch[i]){
                count++;
            }

            if(count == pch.length){
                ans.add(i-pch.length+1);
                // 复用整个模式串的k-前/后缀长度
                count = next[count-1];
            }
        }

        return ans;
    }

    private int[] buildNext(String p){
        char[] ch = p.toCharArray();
        int plen = ch.length;
        int[] next = new int[plen];
        /*
        注意next[i]的值不可能是i+1，选取整个子串对跳过失败位置没有任何帮助
        例如abcabcd，匹配到d错了，此时看前方next[5]=3，即前五位正确，并且3前缀=3后缀
        然后向后挪动三位 也就是用第二个abc去匹配第一个abc
        但如果选取整个子串 next[5] = 6，这样向后挪动6位，就从d那一位开始匹配了
        反而错过了第二个abc，导致可能的错误
         */
        next[0] = 0;
        for(int i=1;i<plen;i++){
            // 先找到上一位匹配的k的长度
            int prev = next[i - 1];
            // 若上一轮匹配的k的后面一位等于当前这一位，相当于可以扩展一位
            /*
            例如，abcdabc,i=7,ch[7] = d
            既然要看这一轮的最大k，那么检查上一轮的k=3，说明ch[0,6]中 ch[0,2] 与 ch[4,6]是相等的
            这样如果 ch[3] = ch[7] = d，就可以在ch[0,2] = ch[4,6]上扩展一位
            变成 ch[0,3] = ch[4,7]
             */
            if(ch[prev]==ch[i]){
                next[i] = prev + 1;
            }
            /*
            若不等于，我们也不一定要从0再开始。举个例子
            abcabd dd abcab, i=13 ch[13] = c
            那么 c!= ch [ next[13-1] ] = ch[5] = d
            但 c = ch[ next[4] ]啊 那么这个4怎么来的？
            由于next[12] = 5，即ch[0,12]的5前缀等于5后缀 也就是说ch[0,4] = ch[8,12]
            那么我们想求最大的K，使得 ch[0,k-1] = ch[13-k+1,13]
            其中这个K很显然<5，不然就直接匹配上了，变成第一种情况了，也就是ch[13] = ch[5]了

            可以检查 next[4] ，也就是ch[0,4]中的最大K，这里是2 也即 ch[0,1] = ab = ch[3,4]
            又因为ch[0,4] = ch[8,12] 所以 ch[0,1] = ch[11,12]
            这个next[ next[13-1] - 1 ] = 2，就是本轮算上ch[13] = c之前的最大匹配长度
            所以 next[13] = 2+1 = 3
            可以用反证法证明
            假设存在3<k<5，使得ch[0,k-1] = ch[13-k+1,13]
            那么必有 abca = cabc，矛盾
             */
            else{
                /*
                如果得知ch[0,next[i-1]-1]中的最大K值，也就是next[ next[i-1]-1 ]，为0，
                 */
                int pnxtk = next[Math.max(next[i - 1] - 1, 0)];
                next[i] = 0;
                if(ch[i]==ch[pnxtk]){
                    next[i] = pnxtk + 1;
                }
            }
        }

        return next;
    }

}

Z-Function

Z函数，又叫扩展KMP。

可视化网址：

visualization: <https://personal.utdallas.edu/~besp/demo/John2010/z-algorithm.htm>

视频讲解（B站0x3f）

<https://www.bilibili.com/video/BV1it421W7D8/?spm_id_from=333.999.0.0&vd_source=b408ab4c35f1aa86e5d9431d34e3aeac>

package StringMatch.Z_Function;

public class Z_Function {
    /*
    1. abababzabababab
    2. aabcabxaaaz
    visualization: <https://personal.utdallas.edu/~besp/demo/John2010/z-algorithm.htm>
     */
    public int[] z_algorithm(String str){
        char[] ch = str.toCharArray();
        int n = ch.length;
        int[] z = new int[n];

        int l,r;
        l = r = 0;

        for(int i=1;i<n;i++){
            if(i<=r){
                z[i] = Math.min(z[i-l],r-i+1);
            }

            while(i+z[i]<n && ch[z[i]] == ch[i+z[i]]){
                l = i;
                r = i+z[i];
                z[i]+=1;
            }
        }

        return z;
    }

    public int LongestSuffixIndex(int[] z){
        for (int i = 1; i < z.length; i++) {
            if(z[i]+i==z.length){
                return i;
            }
        }
        return -1;
    }
}

其中z[i]表示，若z[i]不为0，则ch[i:i+z[i]-1]为原字符串的前缀的最大值。可以看两个例子：

第一个，abababzabababab

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
a	b	a	b	a	b	z	a	b	a	b	a	b	a	b
a	b	a	b	a	b	z	a	b	a	b	a	b	a	b
/	0	4	0	2	0	0	6	0	6	0	4	0	2	0

i = 0 时，显然这个后缀匹配没有意义。所以i=0直接不管了
i = 1 时，b≠a，z[1] = 0
i = 2 时，ch[z[2]] = ch[z[2] + 2] = ch[2] = a，z[2]++；ch[z[2]] = ch[z[2]+2] = ch[3] = b，z[2]++…一直到z[2] = 4的时候 ch[6]≠ch[4]了。停下来。
i = 3，b≠a，z[3] = 0
这一步是核心了。z[2] = 4告诉我们 ch[0:3] = ch[2:5]（用红色标出来了，这一段对应代码中l和r标识出来的区间，也即z-box）。那么既然ch[2] = ch[4]，ch[2] = ch[0]，则有ch[4] = ch[0]。同样的，ch[5] = ch[3]，且ch[1] = ch[3]，所以ch[5] = ch[1]。

这里疑惑的是，这个ch[2] = ch[4]和ch[5] = ch[3]是从哪里知道的？代码里没有显式地去做比较？这个放在第二个里说。

第二个，aabcaabxaaaz

0	1	2	3	4	5	6	7	8	9	10	11
a	a	b	c	a	a	b	x	a	a	a	z
a	a	b	c	a	a	b	x	a	a	a	z
/	1	0	0	3

这个例子是为了解释z[i] = min( z[i-l],right-i+1 )这句的。直到z[4] = 3都还没什么问题。这里我们看i=5的时候。

此时z[4] = 3,l = 4,r = 6。这告诉我们ch[0:2] = ch[4:6]，所以我们直接跟ch[4:6]去匹配就可以了。但是z[1] = 1告诉我们ch[0:0] = ch[1:1]，这不是重点，重点是它还告诉我们，ch[1] ≠ ch[2]。这是最关键的。由于ch[1]≠ch[2]，那么ch[2] = ch[6]，说明ch[1]≠ch[6]。那么此时匹配就终止了。

也就是说，当i处于z-box时，z[i]是由z[i-l]控制的，因为z[i-l]=len告诉我们ch[0:len-1] = ch[i-l:i-l+len-1]，但就像上面说的，更重要的是它告诉我们ch[len]≠(ch[i-l+len = ch[i+len])。那么如果ch[ i+j ] 想要等于ch[j]的话，就得保证j<len。不然就会因为ch[len]≠ch[i+len]而卡死，所以才会在min中有z[i-l]。

那么怎么用z函数做字符串匹配呢？既然求的是后缀是前缀的最长长度的值，那么可以把模式串拼在文本串前面，然后从z数组的模式串长度的位置开始，如果匹配的长度≥模式串长度，就是匹配上了。

public List<Integer> search(String s,String p){
    ArrayList<Integer> ans = new ArrayList<>();

    String tmp = p+s;
    int[] z = z_algorithm(tmp);

    int start = p.length();
    for(int i=start;i<z.length;i++){
        if(z[i]>=start){
            ans.add(i-start);
        }
    }

    return ans;
}

1. LC 3008 找出数组中的美丽下标Ⅱ

思路比较简单：

KMP找出所有匹配的a模式串开始索引
… b模式串 …
由于KMP查找是顺序的，所以索引也是顺序的，对于任意一个index∈kmp(a)，对kmp(b)二分查找即可

这道题就是教kmp板子的（周赛的时候不会板子直接T了捏

import java.util.ArrayList;
import java.util.List;

class Solution {
    static int interval;
    public List<Integer> beautifulIndices(String s, String a, String b, int k) {
        interval = k;
        char[] sch = s.toCharArray();
        char[] ach = a.toCharArray();
        char[] bch = b.toCharArray();
        ArrayList<Integer> ans = new ArrayList<>();
        List<Integer> ares = kmp(sch, ach);
        List<Integer> bres = kmp(sch, bch);
        if(bres.isEmpty()){
            return ans;
        }
        for (Integer num : ares) {
            int bs = bs(num, bres);
            if(check(bs,num)){
                ans.add(num);
            }
        }
        return ans;
    }

    private int bs(int index,List<Integer> bres){
        int lp,rp,mid,ans;
        lp = 0;
        rp = bres.size();
        ans = -interval-1;
        while(lp<rp){
            mid = ((rp-lp)>>>1)+lp;
            Integer num = bres.get(mid);
            ans = Math.abs(num-index)<Math.abs(ans-index)?num:ans;
            if(num==index){
                return index;
            }else if(num<index){
                lp = mid+1;
            }else{
                rp = mid;
            }
        }
        return ans;
    }

    private boolean check(int i,int j){
        return Math.abs(i-j)<=interval;
    }

    private List<Integer> kmp(char[] sch,char[] pch){
        ArrayList<Integer> ans = new ArrayList<>();
        int[] next = buildNext(pch);

        int count = 0;
        for (int i = 0; i < sch.length; i++) {
            while(count > 0 && pch[count] != sch[i]){
                count = next[count-1];
            }

            if( pch[count] == sch[i]){
                count++;
            }

            if(count == pch.length){
                ans.add(i-pch.length+1);
                count = next[count-1];
            }
        }

        return ans;
    }

    private int[] buildNext(char[] pch){
        int[] next = new int[pch.length];
        next[0] = 0;
        for(int i=1;i<pch.length;i++){
            int prev = next[i - 1];

            if(pch[prev] == pch[i]){
                next[i] = prev+1;
            }else{
                int j = next[Math.max(next[i - 1] - 1, 0)];
                if(pch[j]==pch[i]){
                    next[i] = j+1;
                }else{
                    next[i] = 0;
                }
            }
        }
        return next;
    }
}

2. LC 3031 将单词恢复初始状态所需的最短时间Ⅱ

这道题思路已经写过了。主要是匹配的算法。也即，什么时候第一次原字符串的某后缀是其前缀？

这个就是Z函数解决的事情。

class Solution {
    public int minimumTimeToInitialState(String word, int k) {
        char[] ch = word.toCharArray();
        int n = ch.length;

        int[] z = new int[n];
        int l,r;
        l = r = 0;
        for(int i=1;i<n;i++){
            if(i<=r){
                z[i] = Math.min(z[i-l],r-i+1);
            }

            while(i+z[i]<n && ch[z[i]]==ch[i+z[i]]){
                l = i;
                r = i+z[i];
                z[i]+=1;
            }

            if(i%k==0 && z[i]+i == n){
                return i/k;
            }
        }
        
        return (n-1)/k+1;
    }

}