KMP算法

原文地址:Searching for Patterns | Set 2 (KMP Algorithm)

已知一段文本txt[0..n-1]与一个模式pat[0..m-1],写一个函数search(char pat[], char txt[])打印pat[]txt[]所有出现的位置。

例子:

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAAABAA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 13

模式搜索是计算机科学中一个十分重要的问题。当我们在记事本/word,或者浏览器或者数据库中查找字符串的时候,模式搜索算法用于显示这些查询结果。

我们已经在前面的章节中讨论过了简单的模式搜索算法(Naive pattern searching algorithm)。简单模式搜索算法在最差情况下的时间复杂度是O(m(n-m+1))。KMP算法在最差情况的时间复杂度是O(n)。

KMP (Knuth Morris Pratt)模式搜索

当许多匹配的字符后面有一个不匹配的字符的时候,简单模式搜索算法效果就不是那么好了。下面就是一些例子。

   txt[] = "AAAAAAAAAAAAAAAAAB"
   pat[] = "AAAAB"

   txt[] = "ABABABCABABABCABABABC"
   pat[] = "ABABAC" (not a worst case, but a bad case for Naive)

KMP搜索算法利用模式退化属性(模式有相同的子字符串并且在模式中出现不止一次),并且把最坏情况的复杂度改进到O(n)。KMP算法的基本思想是:无论啥时候检测到不匹配的字符串(在一些匹配的字符串之后),我们已经知道在下一个窗口的文本中的一些字符。我们利用这个信息的优势,避免匹配那些将要匹配的字符。我们考虑下面的例子来理解这个问题。

匹配概述
txt = "AAAAABAAABA" 
pat = "AAAA"

我们首先用pat比较第一个窗口中的文本
txt = "AAAAABAAABA" 
pat = "AAAA"  [初始化位置]
我们找到了一个匹配的位置。这与简单的字符串匹配是一样的。

在下一步中,我们用pat比较下一个窗口中的文本
txt = "AAAAABAAABA" 
pat =  "AAAA" [模式切换到位置1]

这就是为啥KMP优化了简单的搜索算法。在第二个窗口中,我们用当前窗口中的第四个字符模式比较第四个A来决定是否当前窗口匹配。因为我们知道无论如何前三个字符是匹配的,我们可以忽略匹配前三个字符。

还需要预处理吗?

上述的解释提出这样一个重要的问题,我们咋能知道有多少个字符可以略过呢。为了得到这个答案,我们要预处理模式,并准备一个整形数组lps[],这个数组可以告诉我们有几个字符可以略过。

预处理概述:

  • KMP算法做预处理pat[]并建立一个大小为m(与模式的大小相同)的附加数组lps[],它是用于在匹配过程中略过字符的。
  • lps表示的是longest proper prefix,也就是后缀。一个合适的前缀就是不允许带有整个字符串的前缀。例如,“ABC”的前缀有“”, “A”, “AB”和“ABC”。合适的前缀是“”, “A”和“AB”。这个字符串的后缀是“”, “C”, “BC” and “ABC”。
  • 对于每个子模式pat[0..i],在这里i从0到m-1,lps保存的是匹配的合适前缀的最大长度,这也是子模式pat[0..i]的一个后缀。
lps[i] = the longest proper prefix of pat[0..i] 
              which is also a suffix of pat[0..i]. 

注意:lps[i]可以被定义为最长前缀,也是合适的后缀。我们需要用合适的在一个地方来确保整个字符串没被考虑。

Examples of lps[] construction:
For the patternAAAA”, 
lps[] is [0, 1, 2, 3]

For the patternABCDE”, 
lps[] is [0, 0, 0, 0, 0]

For the patternAABAACAABAA”, 
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

For the patternAAACAAAAAC”, 
lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] 

For the patternAAABAAA”, 
lps[] is [0, 1, 2, 0, 1, 2, 3]

搜索算法:

与简单算法不一样,我们逐一滑动模式,并在每一次改变都比较所有的字符串,我们用lps[]中的一个值确定下一个将要匹配的字符。这个思想不是我们无论怎样都匹配的匹配字符。

怎样利用lps[]确定下一个位置呢(或者知道略过字符的个数)?

  • 我们从字符串中当前窗口的字符与pat[j],j=1开始比较
  • 我们保持txt[i]与pat[j]字符串的匹配,并随着txt[i]与pat[j]的匹配增加i和j。
  • 当发现匹配失败的时候
    – 我们知道字符pat[0..j-1]与txt[i-j+1…i-1]匹配(注意:j是从0开始的,只有出现了匹配它才增加)。
    – 我们也知道(从上面的定义)lps[j-1]计算的是合适前缀和后缀pat[0…j-1]中字符的个数。
    – 从以上两点我们可以推出,我们不需要用lps[j-1]个字符去匹配txt[i-j…i-1],因为我们直到这些字符无论怎样都能匹配得上。我们考虑下上面的例子来理解它。
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
lps[] = {0, 1, 2, 3} 

i = 0, j = 0
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 1, j = 1
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 2, j = 2
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
pat[i] and pat[j[ match, do i++, j++

i = 3, j = 3
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 4, j = 4
Since j == M, print pattern found and resset j,
j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA" 
pat[] =  "AAAA"
txt[i] and pat[j[ match, do i++, j++

i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA" 
pat[] =   "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[2] = 2

i = 5, j = 2
txt[] = "AAAAABAAABA" 
pat[] =    "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1 

i = 5, j = 1
txt[] = "AAAAABAAABA" 
pat[] =     "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0

i = 5, j = 0
txt[] = "AAAAABAAABA" 
pat[] =      "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

We continue this way...
// JAVA program for implementation of KMP pattern
// searching algorithm

class KMP_String_Matching
{
    void KMPSearch(String pat, String txt)
    {
        int M = pat.length();
        int N = txt.length();

        // create lps[] that will hold the longest
        // prefix suffix values for pattern
        int lps[] = new int[M];
        int j = 0;  // index for pat[]

        // Preprocess the pattern (calculate lps[]
        // array)
        computeLPSArray(pat,M,lps);

        int i = 0;  // index for txt[]
        while (i < N)
        {
            if (pat.charAt(j) == txt.charAt(i))
            {
                j++;
                i++;
            }
            if (j == M)
            {
                System.out.println("Found pattern "+
                              "at index " + (i-j));
                j = lps[j-1];
            }

            // mismatch after j matches
            else if (i < N && pat.charAt(j) != txt.charAt(i))
            {
                // Do not match lps[0..lps[j-1]] characters,
                // they will match anyway
                if (j != 0)
                    j = lps[j-1];
                else
                    i = i+1;
            }
        }
    }

    void computeLPSArray(String pat, int M, int lps[])
    {
        // length of the previous longest prefix suffix
        int len = 0;
        int i = 1;
        lps[0] = 0;  // lps[0] is always 0

        // the loop calculates lps[i] for i = 1 to M-1
        while (i < M)
        {
            if (pat.charAt(i) == pat.charAt(len))
            {
                len++;
                lps[i] = len;
                i++;
            }
            else  // (pat[i] != pat[len])
            {
                // This is tricky. Consider the example.
                // AAACAAAA and i = 7. The idea is similar 
                // to search step.
                if (len != 0)
                {
                    len = lps[len-1];

                    // Also, note that we do not increment
                    // i here
                }
                else  // if (len == 0)
                {
                    lps[i] = len;
                    i++;
                }
            }
        }
    }

    // Driver program to test above function
    public static void main(String args[])
    {
        String txt = "ABABDABACDABABCABAB";
        String pat = "ABABCABAB";
        new KMP_String_Matching().KMPSearch(pat,txt);
    }
}
// This code has been contributed by Amit Khandelwal.

输出:

Found pattern at index 10

预处理算法:

在预处理部分,我们计算了lps[]的值。为了达到目的,我们跟踪前后缀值的最长长度(这里我们用变量len),我们初始化lps[0],len为0。如果pat[len]与pat[i]匹配,那么我们就加1,并将这个值赋给lps[i]。如果pat[i]与pat[len]不匹配,并且len不为0,那么我们更新len到lps[len-1]。详情请看下面代码中的computeLPSArray ()。

预处理描述(lps[]的构造)

pat[] = "AAACAAAA"

len = 0, i  = 0.
lps[0] is always 0, we move 
to i = 1

len = 0, i  = 1.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2

len = 1, i  = 2.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3

len = 2, i  = 3.
Since pat[len] and pat[i] do not match, and len > 0, 
set len = lps[len-1] = lps[1] = 1

len = 1, i  = 3.
Since pat[len] and pat[i] do not match and len > 0, 
len = lps[len-1] = lps[0] = 0

len = 0, i  = 3.
Since pat[len] and pat[i] do not match and len = 0, 
Set lps[3] = 0 and i = 4.

len = 0, i  = 4.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5

len = 1, i  = 5.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6

len = 2, i  = 6.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7

len = 3, i  = 7.
Since pat[len] and pat[i] do not match and len > 0,
set len = lps[len-1] = lps[2] = 2

len = 2, i  = 7.
Since pat[len] and pat[i] match, do len++, 
store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8

We stop here as we have constructed the whole lps[].
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值