KMP算法

最新推荐文章于 2021-07-31 17:59:26 发布

K.Sun

最新推荐文章于 2021-07-31 17:59:26 发布

阅读量511

点赞数

分类专栏： Algorithm 文章标签： kmp 算法搜索

Algorithm 专栏收录该内容

139 篇文章 1 订阅

订阅专栏

原文地址：Searching for Patterns | Set 2 (KMP Algorithm)

已知一段文本txt[0..n-1]与一个模式pat[0..m-1]，写一个函数search(char pat[], char txt[])打印pat[]在txt[]所有出现的位置。

例子：

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAAABAA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 13

模式搜索是计算机科学中一个十分重要的问题。当我们在记事本/word，或者浏览器或者数据库中查找字符串的时候，模式搜索算法用于显示这些查询结果。

我们已经在前面的章节中讨论过了简单的模式搜索算法（Naive pattern searching algorithm）。简单模式搜索算法在最差情况下的时间复杂度是O(m(n-m+1))。KMP算法在最差情况的时间复杂度是O(n)。

KMP (Knuth Morris Pratt)模式搜索

当许多匹配的字符后面有一个不匹配的字符的时候，简单模式搜索算法效果就不是那么好了。下面就是一些例子。

   txt[] = "AAAAAAAAAAAAAAAAAB"
   pat[] = "AAAAB"

   txt[] = "ABABABCABABABCABABABC"
   pat[] = "ABABAC" (not a worst case, but a bad case for Naive)

KMP搜索算法利用模式退化属性（模式有相同的子字符串并且在模式中出现不止一次），并且把最坏情况的复杂度改进到O(n)。KMP算法的基本思想是：无论啥时候检测到不匹配的字符串（在一些匹配的字符串之后），我们已经知道在下一个窗口的文本中的一些字符。我们利用这个信息的优势，避免匹配那些将要匹配的字符。我们考虑下面的例子来理解这个问题。

匹配概述
txt = "AAAAABAAABA" 
pat = "AAAA"

我们首先用pat比较第一个窗口中的文本
txt = "AAAAABAAABA" 
pat = "AAAA"  [初始化位置]
我们找到了一个匹配的位置。这与简单的字符串匹配是一样的。

在下一步中，我们用pat比较下一个窗口中的文本
txt = "AAAAABAAABA" 
pat =  "AAAA" [模式切换到位置1]

这就是为啥KMP优化了简单的搜索算法。在第二个窗口中，我们用当前窗口中的第四个字符模式比较第四个A来决定是否当前窗口匹配。因为我们知道无论如何前三个字符是匹配的，我们可以忽略匹配前三个字符。

还需要预处理吗？

上述的解释提出这样一个重要的问题，我们咋能知道有多少个字符可以略过呢。为了得到这个答案，我们要预处理模式，并准备一个整形数组lps[]，这个数组可以告诉我们有几个字符可以略过。

预处理概述：

KMP算法做预处理pat[]并建立一个大小为m（与模式的大小相同）的附加数组lps[]，它是用于在匹配过程中略过字符的。
lps表示的是longest proper prefix，也就是后缀。一个合适的前缀就是不允许带有整个字符串的前缀。例如，“ABC”的前缀有“”, “A”, “AB”和“ABC”。合适的前缀是“”, “A”和“AB”。这个字符串的后缀是“”, “C”, “BC” and “ABC”。
对于每个子模式pat[0..i]，在这里i从0到m-1，lps保存的是匹配的合适前缀的最大长度，这也是子模式pat[0..i]的一个后缀。

lps[i] = the longest proper prefix of pat[0..i] 
              which is also a suffix of pat[0..i].

注意：lps[i]可以被定义为最长前缀，也是合适的后缀。我们需要用合适的在一个地方来确保整个字符串没被考虑。

Examples of lps[] construction:
For the pattern “AAAA”, 
lps[] is [0, 1, 2, 3]

For the pattern “ABCDE”, 
lps[] is [0, 0, 0, 0, 0]

For the pattern “AABAACAABAA”, 
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

For the pattern “AAACAAAAAC”, 
lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] 

For the pattern “AAABAAA”, 
lps[] is [0, 1, 2, 0, 1, 2, 3]

搜索算法：

与简单算法不一样，我们逐一滑动模式，并在每一次改变都比较所有的字符串，我们用lps[]中的一个值确定下一个将要匹配的字符。这个思想不是我们无论怎样都匹配的匹配字符。

怎样利用lps[]确定下一个位置呢（或者知道略过字符的个数）？

我们从字符串中当前窗口的字符与pat[j]，j=1开始比较
我们保持txt[i]与pat[j]字符串的匹配，并随着txt[i]与pat[j]的匹配增加i和j。
当发现匹配失败的时候
– 我们知道字符pat[0..j-1]与txt[i-j+1…i-1]匹配（注意：j是从0开始的，只有出现了匹配它才增加）。
– 我们也知道（从上面的定义）lps[j-1]计算的是合适前缀和后缀pat[0…j-1]中字符的个数。
– 从以上两点我们可以推出，我们不需要用lps[j-1]个字符去匹配txt[i-j…i-1]，因为我们直到这些字符无论怎样都能匹配得上。我们考虑下上面的例子来理解它。

txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
lps[] = {0, 1, 2, 3} 

i = 0, j = 0
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 1, j = 1
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 2, j = 2
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
pat[i] and pat[j[ match, do i++, j++

i = 3, j = 3
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 4, j = 4
Since j == M, print pattern found and resset j,
j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA" 
pat[] =  "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA" 
pat[] =   "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[2] = 2

i = 5, j = 2
txt[] = "AAAAABAAABA" 
pat[] =    "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1 

i = 5, j = 1
txt[] = "AAAAABAAABA" 
pat[] =     "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0

i = 5, j = 0
txt[] = "AAAAABAAABA" 
pat[] =      "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

We continue this way...

// JAVA program for implementation of KMP pattern
// searching algorithm

class KMP_String_Matching
{
    void KMPSearch(String pat, String txt)
    {
        int M = pat.length();
        int N = txt.length();

        // create lps[] that will hold the longest
        // prefix suffix values for pattern
        int lps[] = new int[M];
        int j = 0;  // index for pat[]

        // Preprocess the pattern (calculate lps[]
        // array)
        computeLPSArray(pat,M,lps);

        int i = 0;  // index for txt[]
        while (i < N)
        {
            if (pat.charAt(j) == txt.charAt(i))
            {
                j++;
                i++;
            }
            if (j == M)
            {
                System.out.println("Found pattern "+
                              "at index " + (i-j));
                j = lps[j-1];
            }

            // mismatch after j matches
            else if (i < N && pat.charAt(j) != txt.charAt(i))
            {
                // Do not match lps[0..lps[j-1]] characters,
                // they will match anyway
                if (j != 0)
                    j = lps[j-1];
                else
                    i = i+1;
            }
        }
    }

    void computeLPSArray(String pat, int M, int lps[])
    {
        // length of the previous longest prefix suffix
        int len = 0;
        int i = 1;
        lps[0] = 0;  // lps[0] is always 0

        // the loop calculates lps[i] for i = 1 to M-1
        while (i < M)
        {
            if (pat.charAt(i) == pat.charAt(len))
            {
                len++;
                lps[i] = len;
                i++;
            }
            else  // (pat[i] != pat[len])
            {
                // This is tricky. Consider the example.
                // AAACAAAA and i = 7. The idea is similar 
                // to search step.
                if (len != 0)
                {
                    len = lps[len-1];

                    // Also, note that we do not increment
                    // i here
                }
                else  // if (len == 0)
                {
                    lps[i] = len;
                    i++;
                }
            }
        }
    }

    // Driver program to test above function
    public static void main(String args[])
    {
        String txt = "ABABDABACDABABCABAB";
        String pat = "ABABCABAB";
        new KMP_String_Matching().KMPSearch(pat,txt);
    }
}
// This code has been contributed by Amit Khandelwal.

输出：

Found pattern at index 10

预处理算法：

在预处理部分，我们计算了lps[]的值。为了达到目的，我们跟踪前后缀值的最长长度（这里我们用变量len），我们初始化lps[0],len为0。如果pat[len]与pat[i]匹配，那么我们就加1，并将这个值赋给lps[i]。如果pat[i]与pat[len]不匹配，并且len不为0，那么我们更新len到lps[len-1]。详情请看下面代码中的computeLPSArray ()。

预处理描述（lps[]的构造）

pat[] = "AAACAAAA"

len = 0, i  = 0.
lps[0] is always 0, we move 
to i = 1

len = 0, i  = 1.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2

len = 1, i  = 2.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3

len = 2, i  = 3.
Since pat[len] and pat[i] do not match, and len > 0, 
set len = lps[len-1] = lps[1] = 1

len = 1, i  = 3.
Since pat[len] and pat[i] do not match and len > 0, 
len = lps[len-1] = lps[0] = 0

len = 0, i  = 3.
Since pat[len] and pat[i] do not match and len = 0, 
Set lps[3] = 0 and i = 4.

len = 0, i  = 4.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5

len = 1, i  = 5.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6

len = 2, i  = 6.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7

len = 3, i  = 7.
Since pat[len] and pat[i] do not match and len > 0,
set len = lps[len-1] = lps[2] = 2

len = 2, i  = 7.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8

We stop here as we have constructed the whole lps[].

K.Sun

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
KMP算法

原文地址：Searching for Patterns | Set 2 (KMP Algorithm)已知一段文本txt[0..n-1]与一个模式pat[0..m-1]，写一个函数search(char pat[], char txt[])打印pat[]在txt[]所有出现的位置。例子：Input: txt[] = "THIS IS A TEST TEXT" pat[
复制链接

扫一扫

专栏目录