Knuth-Morris-Pratt简称KMP,是对字符串匹配算法的改进。该算法对于任何字符串的匹配都可以在线性时间内完成匹配,不会发生退化。
对于给定的字符串strings和sub_s,判断strings中是否包含sub_s,并返回出现位置,暴力算法匹配字符串过程:把strings [0] 跟sub_s [0] 匹配,如果相同则匹配下一个字符,出现不匹配的字符时我们会丢弃前面的匹配信息,然后把strings [1] 跟sub_s [1] 匹配,循环进行,直到主串结束,或者匹配成功。这种匹配算法极大地降低了匹配效率,时间复杂度是O(nm)。
KMP算法较之暴力匹配算法引进了一个部分匹配表,从该表中可以得到向后移动位数。下面举例说明:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
strings | B | A | B | C | B | A | B | C | A | B | C | A | A | B | C | A | B | C | A | B | C | A | C | A | B |
sub_s | A | B | C | A | B | C | A | C | A | B |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1.首先将strings和sub_s的第一个字符进行比较,如果不匹配,向后移动一位
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
strings | B | A | B | C | B | A | B | C | A | B | C | A | A | B | C | A | B | C | A | B | C | A | C | A | B |
sub_s |
| A | B | C | A | B | C | A | C | A | B |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2.此时A与A匹配,接着比较字符串的后几位,发现第5位出现不匹配现象,这时最自然的反映是,将sub_s向后移动一位(如下图所示),然后从sub_s[0]与strings[3]开始逐个比较,你会发现已经比较过的位置需要再次作比较,这样做极大的降低了效率。
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
strings | B | A | B | C | B | A | B | C | A | B | C | A | A | B | C | A | B | C | A | B | C | A | C | A | B |
sub_s |
|
| A | B | C | A | B | C | A | C | A | B |
|
|
|
|
|
|
|
|
|
|
|
|
|
3.当第5位出现B和A不匹配时,我们知道字符串前三位 “ABC”,根据这个信息可以算出sub_s向后移动的位数,这样对于已经比较过的位置无需重复比较,从而提高匹配效率。移动位数=已匹配的字符数-对应的部分匹配值。
4.对于此例中的strings和sub_s的匹配其对应的部分匹配表,如下表所示:
j | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
sub_[j] | A | B | C | A | B | C | A | C | A | B |
部分匹配值 | 0 | 0 | 0 | 1 | 2 | 3 | 1 | 0 | 1 | 2 |
5. 当第5位出现B和A不匹配时,前面三个字符“ABC”是匹配的,最后一个匹配字符C对应的匹配值是1,根据
移动位数=已匹配的字符数-对应的部分匹配值,得到移动位数为3(3 - 0 = 3)。
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
strings | B | A | B | C | B | A | B | C | A | B | C | A | A | B | C | A | B | C | A | B | C | A | C | A | B |
sub_s |
|
|
|
| A | B | C | A | B | C | A | C | A | B |
|
|
|
|
|
|
|
|
|
|
|
6.B与A不匹配,后移一位,逐位比较A与C不匹配,向右移动7位(7-0 =7)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
strings | B | A | B | C | B | A | B | C | A | B | C | A | A | B | C | A | B | C | A | B | C | A | C | A | B |
sub_s |
|
|
|
|
| A | B | C | A | B | C | A | C | A | B |
|
|
|
|
|
|
|
|
|
|
7.右移动7位后,B与C不匹配,后移7位(7-0=7),然后逐次比较,直至string最后一位,若发现完全匹配,则匹配成功,否则失败。这里不再继续重复。
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
strings | B | A | B | C | B | A | B | C | A | B | C | A | A | B | C | A | B | C | A | B | C | A | C | A | B |
sub_s |
|
|
|
|
|
|
|
|
|
|
|
| A | B | C | A | B | C | A | C | A | B |
|
|
|
8.部分匹配值计算
"前缀"指除了最后一个字符以外,一个字符串的全部头部组合;"后缀"指除了第一个字符以外,一个字符串的全部尾部组合。部分匹配值"就是"前缀"和"后缀"的最长的共有元素的长度。以"ABCABCACAB"为例
sub_s[j] | 前缀 | 后缀 | 共有部分及长度 |
A | [ ] | [ ] | [ ]:0 |
AB | [A] | [B] | [ ]:0 |
ABC | [A,AB] | [BC,C] | [ ]:0 |
ABCA | [A,AB,ABC,] | [BCA,CA,A] | [A]:1 |
ABCAB | [A, AB,ABC,ABCA] | [BCAB,CAB,AB,B] | [AB ]:2 |
ABCABC | [A,AB,ABC,ABCA,ABCAB] | [BCABC,CABC,ABC,BC,C] | [ABC]:3 |
ABCABCA | [A,AB,ABC,ABCA,ABCAB,ABCABC,] | [BCABCA,CABCA,ABCA,BCA,AC,A] | [A]:1 |
ABCABCAC | [A,AB,ABC,ABCA,ABCAB,ABCABC,ABCABCA] | [BCABCAC, CABCAC, ABCAC, BCAC, CAC, AC,C] | [ ]:0 |
ABCABCACA | [A,AB,ABC,ABCA,ABCAB,ABCABC,ABCABCA, ABCABCAC] | [BCABCACA, CABCACA, ABCACA, BCACA, CACA, ACA, CA,A] | [A]:1 |
ABCABCACAB | [A,AB,ABC,ABCA,ABCAB,ABCABC,ABCABCA, ABCABCAC,ABCABCACA] | [BCABCACAB, CABCACAB, ABCACAB, BCACAB, CACAB, ACAB, CAB, AB,B] | [AB ]:2 |