原文地址:Searching for Patterns | Set 2 (KMP Algorithm)
已知一段文本txt[0..n-1]与一个模式pat[0..m-1],写一个函数search(char pat[], char txt[])打印pat[]在txt[]所有出现的位置。
例子:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAAABAA"
pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
模式搜索是计算机科学中一个十分重要的问题。当我们在记事本/word,或者浏览器或者数据库中查找字符串的时候,模式搜索算法用于显示这些查询结果。
我们已经在前面的章节中讨论过了简单的模式搜索算法(Naive pattern searching algorithm)。简单模式搜索算法在最差情况下的时间复杂度是O(m(n-m+1))。KMP算法在最差情况的时间复杂度是O(n)。
KMP (Knuth Morris Pratt)模式搜索
当许多匹配的字符后面有一个不匹配的字符的时候,简单模式搜索算法效果就不是那么好了。下面就是一些例子。
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
KMP搜索算法利用模式退化属性(模式有相同的子字符串并且在模式中出现不止一次),并且把最坏情况的复杂度改进到O(n)。KMP算法的基本思想是:无论啥时候检测到不匹配的字符串(在一些匹配的字符串之后),我们已经知道在下一个窗口的文本中的一些字符。我们利用这个信息的优势,避免匹配那些将要匹配的字符。我们考虑下面的例子来理解这个问题。
匹配概述
txt = "AAAAABAAABA"
pat = "AAAA"
我们首先用pat比较第一个窗口中的文本
txt = "AAAAABAAABA"
pat = "AAAA" [初始化位置]
我们找到了一个匹配的位置。这与简单的字符串匹配是一样的。
在下一步中,我们用pat比较下一个窗口中的文本
txt = "AAAAABAAABA"
pat = "AAAA" [模式切换到位置1]
这就是为啥KMP优化了简单的搜索算法。在第二个窗口中,我们用当前窗口中的第四个字符模式比较第四个A来决定是否当前窗口匹配。因为我们知道无论如何前三个字符是匹配的,我们可以忽略匹配前三个字符。
还需要预处理吗?
上述的解释提出这样一个重要的问题,我们咋能知道有多少个字符可以略过呢。为了得到这个答案,我们要预处理模式,并准备一个整形数组lps[],这个数组可以告诉我们有几个字符可以略过。
预处理概述:
- KMP算法做预处理pat[]并建立一个大小为m(与模式的大小相同)的附加数组lps[],它是用于在匹配过程中略过字符的。
- lps表示的是longest proper prefix,也就是后缀。一个合适的前缀就是不允许带有整个字符串的前缀。例如,“ABC”的前缀有“”, “A”, “AB”和“ABC”。合适的前缀是“”, “A”和“AB”。这个字符串的后缀是“”, “C”, “BC” and “ABC”。
- 对于每个子模式pat[0..i],在这里i从0到m-1,lps保存的是匹配的合适前缀的最大长度,这也是子模式pat[0..i]的一个后缀。
lps[i] = the longest proper prefix of pat[0..i]
which is also a suffix of pat[0..i].
注意:lps[i]可以被定义为最长前缀,也是合适的后缀。我们需要用合适的在一个地方来确保整个字符串没被考虑。
Examples of lps[] construction:
For the pattern “AAAA”,
lps[] is [0, 1, 2, 3]
For the pattern “ABCDE”,
lps[] is [0, 0, 0, 0, 0]
For the pattern “AABAACAABAA”,
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
For the pattern “AAACAAAAAC”,
lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
For the pattern “AAABAAA”,
lps[] is [0, 1, 2, 0, 1, 2, 3]
搜索算法:
与简单算法不一样,我们逐一滑动模式,并在每一次改变都比较所有的字符串,我们用lps[]中的一个值确定下一个将要匹配的字符。这个思想不是我们无论怎样都匹配的匹配字符。
怎样利用lps[]确定下一个位置呢(或者知道略过字符的个数)?
- 我们从字符串中当前窗口的字符与pat[j],j=1开始比较
- 我们保持txt[i]与pat[j]字符串的匹配,并随着txt[i]与pat[j]的匹配增加i和j。
- 当发现匹配失败的时候
– 我们知道字符pat[0..j-1]与txt[i-j+1…i-1]匹配(注意:j是从0开始的,只有出现了匹配它才增加)。
– 我们也知道(从上面的定义)lps[j-1]计算的是合适前缀和后缀pat[0…j-1]中字符的个数。
– 从以上两点我们可以推出,我们不需要用lps[j-1]个字符去匹配txt[i-j…i-1],因为我们直到这些字符无论怎样都能匹配得上。我们考虑下上面的例子来理解它。
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
lps[] = {0, 1, 2, 3}
i = 0, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 1, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 2, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
pat[i] and pat[j[ match, do i++, j++
i = 3, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 4, j = 4
Since j == M, print pattern found and resset j,
j = lps[j-1] = lps[3] = 3
Here unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
Again unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[2] = 2
i = 5, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1
i = 5, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0
i = 5, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.
i = 6, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
i = 7, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
We continue this way...
// JAVA program for implementation of KMP pattern
// searching algorithm
class KMP_String_Matching
{
void KMPSearch(String pat, String txt)
{
int M = pat.length();
int N = txt.length();
// create lps[] that will hold the longest
// prefix suffix values for pattern
int lps[] = new int[M];
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[]
// array)
computeLPSArray(pat,M,lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat.charAt(j) == txt.charAt(i))
{
j++;
i++;
}
if (j == M)
{
System.out.println("Found pattern "+
"at index " + (i-j));
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat.charAt(j) != txt.charAt(i))
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
}
void computeLPSArray(String pat, int M, int lps[])
{
// length of the previous longest prefix suffix
int len = 0;
int i = 1;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat.charAt(i) == pat.charAt(len))
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0)
{
len = lps[len-1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = len;
i++;
}
}
}
}
// Driver program to test above function
public static void main(String args[])
{
String txt = "ABABDABACDABABCABAB";
String pat = "ABABCABAB";
new KMP_String_Matching().KMPSearch(pat,txt);
}
}
// This code has been contributed by Amit Khandelwal.
输出:
Found pattern at index 10
预处理算法:
在预处理部分,我们计算了lps[]的值。为了达到目的,我们跟踪前后缀值的最长长度(这里我们用变量len),我们初始化lps[0],len为0。如果pat[len]与pat[i]匹配,那么我们就加1,并将这个值赋给lps[i]。如果pat[i]与pat[len]不匹配,并且len不为0,那么我们更新len到lps[len-1]。详情请看下面代码中的computeLPSArray ()。
预处理描述(lps[]的构造)
pat[] = "AAACAAAA"
len = 0, i = 0.
lps[0] is always 0, we move
to i = 1
len = 0, i = 1.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2
len = 1, i = 2.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3
len = 2, i = 3.
Since pat[len] and pat[i] do not match, and len > 0,
set len = lps[len-1] = lps[1] = 1
len = 1, i = 3.
Since pat[len] and pat[i] do not match and len > 0,
len = lps[len-1] = lps[0] = 0
len = 0, i = 3.
Since pat[len] and pat[i] do not match and len = 0,
Set lps[3] = 0 and i = 4.
len = 0, i = 4.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5
len = 1, i = 5.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6
len = 2, i = 6.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7
len = 3, i = 7.
Since pat[len] and pat[i] do not match and len > 0,
set len = lps[len-1] = lps[2] = 2
len = 2, i = 7.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8
We stop here as we have constructed the whole lps[].