字符串匹配(二) KMP Algorithm

最新推荐文章于 2023-12-28 21:45:17 发布

那后生

最新推荐文章于 2023-12-28 21:45:17 发布

阅读量253

点赞数

分类专栏：算法文章标签： java 算法

本文链接：https://blog.csdn.net/wangrui1605/article/details/104314739

版权

算法专栏收录该内容

8 篇文章 0 订阅

订阅专栏

概述

给定一个字符串数组，str1[0…n-1] 和 str2[0…m-1],编写一个函数search（char str1 []，char str2 []），将所有出现在str1 []中的str2 [] 的位置打印出来。假设n> m。

例子：

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

在这里插入图片描述
字符串匹配是计算机科学中的一个重要问题。当我们在记事本/单词文件或浏览器或数据库中搜索字符串时，将使用字符串匹配算法来显示搜索结果。

我们在上一篇文章中讨论了基本的字符串匹配算法。算法的最坏情况复杂度是O（m（n-m + 1））。在最坏的情况下，KMP算法的时间复杂度为O（n）。

什么是KMP算法 (Knuth Morris Pratt) Pattern Searching

如果发现可以匹配的字符串后面有很多不匹配的字符串。那么使用基本的字符串匹配算法的效果不好。
例子

   str1[] = "AAAAAAAAAAAAAAAAAB"
   str2[] = "AAAAB"

   str1[] = "ABABABCABABABCABABABC"
   str2[] =  "ABABAC" (not a worst case, but a bad case for Naive)

首先我们需要了解几个概念

举例说明

step1 第一次匹配：在str1 中找到了str2 的匹配，这个和我们朴素算法的方式一样没有什么区别。
在这里插入图片描述
step2 按照朴素的算法将 str2 向右移动一位

这里就是 KMP 和朴素算法进行优化的地方，在第二次比较中，我们使用str2 中的第四个字符来决定当前str2 是否匹配。无论如何，前三个字符都会匹配，我们跳过了匹配前三个字符。

这里会有一个问题 ----- 我们如何知道要跳过的字符。要跳过多少个字符？这里需要做一些提前的处理！
step0
部分匹配表(Partial Match Table)的数组

KMP 算法对str2[] 进行预处理，构造一个大小为M(和str2 大小相同)的辅助 lps[]，用于在匹配是跳过字符。
lps 数组表示最长的正确前缀，这里有两个概念"前缀"指除了最后一个字符以外，一个字符串的全部头部组合；"后缀"指除了第一个字符以外，一个字符串的全部尾部组合。
举例说明
我们lps 中存储的是，在str2 中搜索的前缀和后缀。
lps[i] 存储的是最大匹配的适当前缀的长度，该前缀也是str2 中的后缀。

lps[i] = the longest proper prefix of pat[0..i] 
              which is also a suffix of pat[0..i].

lps[i] 可以定义为最长前缀，这也是后缀。我们需要在一个地方正确的使用确保不需要考虑整个子字符串。

Examples of lps[] construction:
For the pattern “AAAA”, 
lps[] is [0, 1, 2, 3]

For the pattern “ABCDE”, 
lps[] is [0, 0, 0, 0, 0]

For the pattern “AABAACAABAA”, 
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

For the pattern “AAACAAAAAC”, 
lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] 

For the pattern “AAABAAA”, 
lps[] is [0, 1, 2, 0, 1, 2, 3]

在这里插入图片描述

匹配算法

和朴素的算法不一样，我们将str2 滑动每次移动一位，并在每次移位时候比较所有的字符，我们使用lps 中存储的值来确定下一个要匹配的字符。这个想法是不匹配我们知道会匹配的字符。
这里会有一个问题
如何使用lps []决定下一个位置（或知道要跳过的字符数）？

我们开始匹配如果 str2[j] 并且 j = 0 如果str1[i] i = 0 的值相等，继续匹配下一个。
我们保持匹配字符str1 [i] 和 str2 [j]，并保持i和j递增，而str2 [j]和 str1 [i]保持匹配。
如果发现字符串不匹配
- 我们知道字符str2 [0…j-1]与str1 [i-j…i-1]相匹配（请注意，j以0开头，仅在存在匹配项时递增）。
- 从上面的定义中我们还知道lps [j-1]是str2 [0…j-1]的字符计数，它们都是正确的前缀和后缀。
- 从以上两点可以看出，
  我们不需要将这些lps [j-1]字符与txt [i-j…i-1]匹配，因为我们知道这些字符仍然可以匹配。让我们考虑上面的例子来理解这一点。

举例说明：
在这里插入图片描述

C++ 实现

// C++ program for implementation of KMP pattern searching 
// algorithm 
#include <bits/stdc++.h> 
  
void computeLPSArray(char* pat, int M, int* lps); 
  
// Prints occurrences of txt[] in pat[] 
void KMPSearch(char* pat, char* txt) 
{ 
    int M = strlen(pat); 
    int N = strlen(txt); 
  
    // create lps[] that will hold the longest prefix suffix 
    // values for pattern 
    int lps[M]; 
  
    // Preprocess the pattern (calculate lps[] array) 
    computeLPSArray(pat, M, lps); 
  
    int i = 0; // index for txt[] 
    int j = 0; // index for pat[] 
    while (i < N) { 
        if (pat[j] == txt[i]) { 
            j++; 
            i++; 
        } 
  
        if (j == M) { 
            printf("Found pattern at index %d ", i - j); 
            j = lps[j - 1]; 
        } 
  
        // mismatch after j matches 
        else if (i < N && pat[j] != txt[i]) { 
            // Do not match lps[0..lps[j-1]] characters, 
            // they will match anyway 
            if (j != 0) 
                j = lps[j - 1]; 
            else
                i = i + 1; 
        } 
    } 
} 
  
// Fills lps[] for given patttern pat[0..M-1] 
void computeLPSArray(char* pat, int M, int* lps) 
{ 
    // length of the previous longest prefix suffix 
    int len = 0; 
  
    lps[0] = 0; // lps[0] is always 0 
  
    // the loop calculates lps[i] for i = 1 to M-1 
    int i = 1; 
    while (i < M) { 
        if (pat[i] == pat[len]) { 
            len++; 
            lps[i] = len; 
            i++; 
        } 
        else // (pat[i] != pat[len]) 
        { 
            // This is tricky. Consider the example. 
            // AAACAAAA and i = 7. The idea is similar 
            // to search step. 
            if (len != 0) { 
                len = lps[len - 1]; 
  
                // Also, note that we do not increment 
                // i here 
            } 
            else // if (len == 0) 
            { 
                lps[i] = 0; 
                i++; 
            } 
        } 
    } 
} 
  
// Driver program to test above function 
int main() 
{ 
    char txt[] = "ABABDABACDABABCABAB"; 
    char pat[] = "ABABCABAB"; 
    KMPSearch(pat, txt); 
    return 0; 
}

java 实现

// JAVA program for implementation of KMP pattern 
// searching algorithm 
  
class KMP_String_Matching { 
    void KMPSearch(String pat, String txt) 
    { 
        int M = pat.length(); 
        int N = txt.length(); 
  
        // create lps[] that will hold the longest 
        // prefix suffix values for pattern 
        int lps[] = new int[M]; 
        int j = 0; // index for pat[] 
  
        // Preprocess the pattern (calculate lps[] 
        // array) 
        computeLPSArray(pat, M, lps); 
  
        int i = 0; // index for txt[] 
        while (i < N) { 
            if (pat.charAt(j) == txt.charAt(i)) { 
                j++; 
                i++; 
            } 
            if (j == M) { 
                System.out.println("Found pattern "
                                   + "at index " + (i - j)); 
                j = lps[j - 1]; 
            } 
  
            // mismatch after j matches 
            else if (i < N && pat.charAt(j) != txt.charAt(i)) { 
                // Do not match lps[0..lps[j-1]] characters, 
                // they will match anyway 
                if (j != 0) 
                    j = lps[j - 1]; 
                else
                    i = i + 1; 
            } 
        } 
    } 
  
    void computeLPSArray(String pat, int M, int lps[]) 
    { 
        // length of the previous longest prefix suffix 
        int len = 0; 
        int i = 1; 
        lps[0] = 0; // lps[0] is always 0 
  
        // the loop calculates lps[i] for i = 1 to M-1 
        while (i < M) { 
            if (pat.charAt(i) == pat.charAt(len)) { 
                len++; 
                lps[i] = len; 
                i++; 
            } 
            else // (pat[i] != pat[len]) 
            { 
                // This is tricky. Consider the example. 
                // AAACAAAA and i = 7. The idea is similar 
                // to search step. 
                if (len != 0) { 
                    len = lps[len - 1]; 
  
                    // Also, note that we do not increment 
                    // i here 
                } 
                else // if (len == 0) 
                { 
                    lps[i] = len; 
                    i++; 
                } 
            } 
        } 
    } 
  
    // Driver program to test above function 
    public static void main(String args[]) 
    { 
        String txt = "ABABDABACDABABCABAB"; 
        String pat = "ABABCABAB"; 
        new KMP_String_Matching().KMPSearch(pat, txt); 
    } 
} 
// This code has been contributed by Amit Khandelwal.

预处理算法

预处理主要是用来计算pls 的值。

pat[] = “AAACAAAA”

len = 0, i = 0.
lps[0] is always 0, we move
to i = 1

len = 0, i = 1.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2

len = 1, i = 2.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3

len = 2, i = 3.
Since pat[len] and pat[i] do not match, and len > 0,
set len = lps[len-1] = lps[1] = 1

len = 1, i = 3.
Since pat[len] and pat[i] do not match and len > 0,
len = lps[len-1] = lps[0] = 0

len = 0, i = 3.
Since pat[len] and pat[i] do not match and len = 0,
Set lps[3] = 0 and i = 4.
We know that characters pat
len = 0, i = 4.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5

len = 1, i = 5.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6

len = 2, i = 6.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7

len = 3, i = 7.
Since pat[len] and pat[i] do not match and len > 0,
set len = lps[len-1] = lps[2] = 2

len = 2, i = 7.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8

We stop here as we have constructed the whole lps[].

参考：https://translate.google.com/#view=home&op=translate&sl=en&tl=zh-CN&text=We%20keep%20matching%20characters%20txt%5Bi%5D%20and%20pat%5Bj%5D%20and%20keep%20incrementing%20i%20and%20j%20while%20pat%5Bj%5D%20and%20txt%5Bi%5D%20keep%20matching.

那后生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
字符串匹配(二) KMP Algorithm

目录概述什么是KMP算法 (Knuth Morris Pratt) Pattern Searching概述给定一个字符串数组，str1[0…n-1] 和 str2[0…m-1],编写一个函数search（char str1 []，char str2 []），将所有出现在str1 []中的str2 [] 的位置打印出来。假设n> m。例子：Input: txt[] = "THIS...
复制链接

扫一扫

专栏目录