字符串匹配算法

DouMiaoO_Oo

于 2019-08-10 10:33:33 发布

阅读量336

点赞数

分类专栏：数据结构 | 算法文章标签：算法

本文链接：https://blog.csdn.net/DouMiaoO_Oo/article/details/78510135

版权

数据结构 | 算法专栏收录该内容

49 篇文章 0 订阅

订阅专栏

朴素方法

一种最朴素的思想，每一轮 $epoch_i$ 都从主字符串S中的S[i]开始匹配模式串。如果失败，则在下一轮 $epoch_{i+1}$ 的匹配中，从主字符串的下一位S[i+1]开始作为起点，匹配模式串。

int strStr(string s, string p){
    /*
        input:
            s: string
            p: pattern
        output:
            -1 for not found pattern p in s
            otherwise, the index of p first appear in s
    */
    int i = 0, j = 0;
    while(i < s.length() && j < p.length()){
        if(s[i] == p[j]) {
            ++i;
            ++j;
        } else{
          i = i-j+1;  // 回到这一趟比较开始时，i的下一个位置S[i+1]
          j = 0;
        }
    } if(j == p.length()) return i-j;
    return -1;
}

最坏的时间复杂度为 O(n*m)，其中n和m分别是主串和模式串的长度。

KMP

高端一点的方法是KMP算法，该算法的时间复杂度是O(m+n)。我最近发现考研书中讲解的很详细，大家可以直接去找那本书，ISBN提供在本文最后了。我复述一遍书中的内容，同时更正书中的一些错误，然后还会添加一些我自己的理解。
对于一般的情况，我们记主串为 $S$ , 长度为 $n$ ： $S_{0}S_{1}S_{2}......S_{n-1}$ ；模式串为 $P$ ，长度为 $m$ ： $P_{0}P_{1}...P_{m-1}$ 。现在我们在主串中匹配模式串，匹配到如下的位置：

$S_{0}S_{1}S_{2}...S_{i-j}S_{i-j+1}...S_{i-1}S_{i}S_{i+1}...S_{n-1}$
$P_{0}\ \ P_{1} ......... P_{j-1}P_{j}...P_{m-1}$

假设现在 $S_{i-j}S_{i-j+1}...S_{i-1}$ 和 $P_{0}P_{1}...P_{j-1}$ 这个部分匹配成功，其中 $\le i$ ，恰好到 $S_{i}$ 和 $P_{j}$ 的时候匹配失败。我们希望主串 $S$ 的下标 $i$ 不要回退（ $i$ 不会变小，只是保持不变或者增加1），更新 $j$ 以达到让模式串 $P$ 相对主串 $S$ 右移。方法就是找到一个最大的 $k$ 满足 $\lt j$ 并且 $S_{i-k}S_{i-k+1}...S_{i-1} = P_{j-k}P_{j-k+1}...P_{j-1} = P_{0}P_{1}...P_{k-1}$ 。所以我们希望能找到模式串 $P$ 在位置 $j$ 之前的部分 $P_{j-k}P_{j-k+1}...P_{j-1}$ ，与 $P$ 开始部分的前缀 $P_{0}P_{1}...P_{k-1}$ 的最长的公共部分，此时这个 $k$ 即是的最长公共部分的长度。

我们令 $n e x t [j]$ 对应模式串 $S$ 在位置 $j$ 匹配失败时需要重新调整到的新的位置，位置的下标为 $k$ (也即是 $n e x t [j] = k$ )，下标 $k$ 对应的元素为 $P_{k}$ 。此时意味着字符串开头处有长度为 $k$ 的最长公共部分 $P_{0}P_{1}...P_{k-1}$ ，而这些公共部分我们不用再进行比较了，只需要去尝试匹配 $P_{k}$ 。这与我们上面介绍的部分相一致。

朴素的我们可以知道对于 $i = 1$ 时， $n e x t [1] = 0$ 。因为在尝试匹配第二个元素失败时，无论如何都要将模式串调整到第一个元素的位置。同时 $i = 0$ 时，我们人为设置 $n e x t [0] = - 1$ ，也就是模式串的开头字符就与当前主串的字符不相同，在后续的匹配算法中意味着主串S的下标需要右移。对于更一般的情况，我们看下面一道例题。

例题

摘自6.5.3 单项选择题

下标	0	1	2	3	4	5	6	7	8	9	10	11
P	a	b	a	b	a	a	a	b	a	b	a	a
next	-1	0	0	1	2	3	1	1	2	3	4	5

我们来尝试理解一下next那一行数字的含义：

下标	next	和前缀的公共部分
2	0	_
3	1	a_
4	2	ab_
5	3	aba_
6	1	a_
7	1	a_
8	2	ab_
9	3	aba_
10	4	abab_
11	5	ababa_

这里要强调一下，P最后一个元素，P[11]的a没有参与构建next数组的过程，也就是说我们可以不给出P的最后一个元素同时正确的构建next数组。

最后给出KMP算法的代码：


vector<int> get_next(string p){
    /*
        通过模式串构造next数组
        kmp算法调整时参考的next数组取决于模式串而不是主串
    */
    if(p.size() == 0) return vector<int>();
    vector<int> next(p.length());  // next的长度等于模式串长度
    int i = 0, j = -1;
	next[i] = j; // next[1] = 0;  // 也可以直接初始化
    while (i+1 < p.length()){  // 递推next[i+1]
		// assert (j = next[i])
        if(j == -1 || p[i] == p[j]){
            next[++i] = ++j;
        } else j = next[j];
    } return next;
}
int KMP(string s, string p, int pos=0){
    /*
        input:
            s: string
            p: pattern
            pos: find pattern start with s[pos]
        output:
            -1 for not found pattern p in s
            otherwise, the index of p first appear in s started in pos
    */
    if(!(0 <= pos && pos < s.length())){
        cout << "pos must be in the [0, s.length)" << endl;
        return -1;
    }
    vector<int> next = get_next(p);
    //cout << "next: "; for (int i = 0; i < next.size(); ++i) cout << next[i]+1 << ' '; cout << endl;
    int i = pos, j = 0;
    int step = 0;
    while((i < s.length()) && (j < int(p.length()))){
        if(j == -1 || (s[i] == p[j])) {
            cout << "step=" << step++ << ' ';
            cout << "match. s[" << i << "]=p[" << j << "]" << endl;
            ++i;
            ++j;
        }
        else{
          cout << "step=" << step++ << ' ';
          cout << "not match. current j=" << j << ",";
          cout << "new_j=next[j]=next[" << j << "]=" << next[j] << endl;
          j = next[j];
        }
    }
    cout << "step=" << step++ << '\n';
    if(j == p.length()) return i-j;
    return -1;
}

int main(){
    //vector<int> res = get_next("abaabcac");
    //for (int i = 0; i < res.size(); ++i) cout << res[i]+1 << endl;
    //cout << strStr("ababcabcacbab", "abcac") << endl;
    //cout << KMP("ababcabcacbab", "abcac") << endl;
    cout << "result of finding pattern: " << KMP("abcabaaabaabcac", "abaabcac") << endl;
    cout << "result of finding pattern: " << KMP("", "") << endl;
    cout << "result of finding pattern: " << KMP("mississippi", "issipi") << endl;
    cout << "result of finding pattern: " << KMP("mississippi", "issip") << endl;


    //cout << "result of finding pattern: " << strStr("", "") << endl;

    return 0;
}