KMP算法

最新推荐文章于 2022-04-13 11:03:22 发布

11D_Beyonder

最新推荐文章于 2022-04-13 11:03:22 发布

阅读量256

点赞数

分类专栏：数据结构与算法文章标签：算法

本文链接：https://blog.csdn.net/weixin_45775050/article/details/122974435

版权

数据结构与算法专栏收录该内容

16 篇文章 0 订阅

订阅专栏

本文探讨了朴素的模式匹配算法的不足，主要在于其时间复杂度较高，达到O((n−m+1)×m)。然后介绍了KMP算法，该算法通过构建Next数组来避免不必要的回溯，提高了效率至O(n+m)。Next数组记录了模式串的最长相同前后缀，从而在失配时能直接跳过已匹配部分，实现快速匹配。KMP算法在字符串查找问题中有着广泛应用，例如在文本处理和搜索算法中。

摘要由CSDN通过智能技术生成

朴素的模式匹配算法

主串为 $S=\text{“abcdefgab”}$ ，我们要匹配 $T=\text{“abcdex”}$ 。定义两个指针 $i, j$ ，分别指向字符串 $S, T$ 。
最初， $i, j$ 指向两个字符串的头，将两个字符串的头对其，开始同步向后移动两个指针，当 $i = j = 6$ 时，发现 $s[i]\ne t[j]$ ，这时就将 $T$ 相对 $S$ 整体向右移动一位， $i$ 指针不变，令 $j$ 回到 $1$ 。此时发现 $T$ 的第一个字符就不匹配，继续进行如图的步骤。
朴素算法的最坏时间复杂度为 $O((n-m+1)\times m)$ ，这样的算法实在是不能接受的。

KMP算法概述

在计算机科学中， $\text{Knuth-Morris-Pratt}$ 字符串查找算法（简称为 $\text{KMP}$ 算法）可在一个主文本字符串 $S$ 内查找一个词 $W$ 的出现位置。此算法通过运用对这个词在不匹配时本身就包含足够的信息来确定下一个匹配将在哪里开始的发现，从而避免重新检查先前匹配的字符。
这个算法是由高德纳和沃恩·普拉特在1974年构思，同年詹姆斯·H·莫里斯也独立地设计出该算法，最终由三人于1977年联合发表。

朴素方法的浪费

在上述朴素匹配算法的例子中， $T$ 的首字母 $\text{‘a’}$ 与后面的串 $\text{“bcdex”}$ 中任意一个字符都不相等。也就是说，在第①步既然已经发现前五个字符相等，就意味着 $T$ 的首字符不可能与 $S$ 中的第 $2$ 位到第 $5$ 位的字符相等，那么②③④⑤的比较可以说是多余的。

利用已获得信息

来看下一个例子， $S=\text{“abcababca”}$ ， $T=\text{“abcabx”}$ 。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-As1s9SpW-1645027765938)(https://i.niupic.com/images/2020/04/02/7ges.jpg)]
对于开始的判断，前 $5$ 个字符相等，到 $i = j = 6$ 时失配，由于已经知道 $S [1, 2, 3, 4, 5] = T [1, 2, 3, 4, 5]$ 且 $T [1]$ 与 $T$ 的第 $2$ 位和第 $3$ 位字符都不相等，所以步骤②③是多余的，故可以直接向右大幅度移动 $T$ ，即 $i = 4, j = 1$ ，重新进行匹配。但是这时我们又可以发现，由于 $S [4, 5] = T [4, 5] = T [1, 2]$ ，实际上可以直接令 $i = 6, j = 3$ ， $T [1, 2] 和 S [4, 5]$ 显然能匹配上，那么步骤④⑤也是多余的。
在朴素的模式匹配算法中，主串的指针 $i$ 是不断地回溯的，而在上述的匹配过程中，我们发现完全可以避免不必要的回溯。这就是 $\text{KMP}$ 算法的精髓，我们可以充分利用字符串前后缀的相似性，再利用已经匹配得到的信息，避免指针的来回横跳。
$\text{KMP}$ 算法的时间复杂度为 $O (n + m)$ 。

定义数组 $\text{Next[j]}$

给出定义

在一般情况下， $N e x t [j]$ 为前 $j - 1$ 个字符组成的字符串最长相等真前后缀加一。

求数组 $\text{Next[j]}$

下面给一个计算 $N e x t [j]$ 的例子。

j	1	2	3	4	5	6	7
模式串T	a	b	c	a	b	x	空
Next[j]	0	1	1	1	2	3	1

就拿 $j = 6$ 来说， $T [1, 2] = T [4, 5]$ ，故 $N e x t [6] = 3$ 。当 $j = 6$ 失配时， $T$ 的头会移动到 $S [4]$ ，这时 $T [1, 2]$ 和 $S [4, 5]$ 显然能匹配上，因此我们直接将 $j$ 移动到 $j = 3$ ，即 $j=Next[j] $，称之为回退。

模板

void get_Next()
{
	int i=1,j=0;
	Next[1]=0;
	while(i<=lt)
	{
		//t[i]表示后缀，t[j]表示前缀
		if(j==0||t[i]==t[j])
		{
			i++;
			j++;
			Next[i]=j;
		}
		else j=Next[j];//回退
	}
}

查询子串

HDU 1686 Oulipo $\looparrowright$

Problem Description

The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of the Oulipo group. A quote from the book:
Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…
Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of $500000$ consecutive 'T’s is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {%raw%} ${A, B, C, …, Z\}$ {%endraw%} and two finite strings over that alphabet, a word $W$ and a text $T$ , count the number of occurrences of $W$ in $T$ . All the consecutive characters of $W$ must exactly match consecutive characters of $T$ . Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:
One line with the word W, a string over {%raw%} ${A,B, C, …, Z\}$ {%endraw%}, with $1 \leq ∣ W ∣ \leq 10000$ (here $∣ W ∣$ denotes the length of the string $W$ ).
One line with the text $T$ , a string over {%raw%} ${A, B, C, …, Z\}$ {%endraw%}, with $∣ W ∣ \leq ∣ T ∣ \leq 1000000$ .

Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word $W$ in the text $T$ .

Sample Input

3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN
Sample Output
1
3
0

Translation

在字符串 $T$ 中找到子串 $W$ ，输出所有 $W$ 出现的位置（以第一个字符出现的位置为准）。

Idea

如上述介绍 $\text{KMP}$ 算法的模拟过程，跳过不必要的回溯。

Code

#include<iostream>
#include<cstring>
#include<cstdio>
#define N 10003
using namespace std;
char a[N*100],b[N];
int Next[N];
int la,lb;
/*------获取数组Next------*/
void get_Next()
{
	int i=1,j=0;
	Next[1]=0;
	while(i<=lb)
	{
		if(j==0||b[i]==b[j])
		{
			i++;
			j++;
			Next[i]=j;
		}
		else j=Next[j];
	}
}
/*------------匹配------------*/
int kmp()
{
	int i=1,j=1;
	int ans=0;
	while(i<=la&&j<=lb)
	{
		if(j==0||a[i]==b[j])
		{
			i++;
			j++;
		}
		else j=Next[j];
		if(j>lb) //指针指向字符串末尾的后一位，说明找到子串。
		{
			//找到一个子串后当作匹配失败，继续找下一个
			j=Next[j];
			ans++;
		}
	}
	return ans;
}
void solve()
{
	int i;
	scanf("%s",b+1);
	lb=strlen(b+1);
	scanf("%s",a+1);
	la=strlen(a+1);
	get_Next();
	printf("%d\n",kmp());
}
int main()
{
	int t;
	cin>>t;
	while(t--) solve();
	return 0;
}

字符串最小循环节

POJ 1961 Period $\looparrowright$

Description

For each prefix of a given string $S$ with $N$ characters (each character has an ASCII code between $97$ and $126$ , inclusive), we want to know whether the prefix is a periodic string. That is, for each $i$ $\leqslant i \leqslant N)$ we want to know the largest $K > 1$ (if there is one) such that the prefix of $S$ with length $i$ can be written as $A^K$ ,that is $A$ concatenated $K$ times, for some string $A$ . Of course, we also want to know the period $K$ .

Input

The input consists of several test cases. Each test case consists of two lines. The first one contains $N$ $(2\leqslant N \leqslant 1000 000)$ – the size of the string $S$ .The second line contains the string $S$ . The input file ends with a line, having the number zero on it.

Output

For each test case, output “Test case #” and the consecutive test case number on a single line; then, for each prefix with length $i$ that has a period $K > 1$ , output the prefix size $i$ and the period $K$ separated by a single space; the prefix sizes must be in increasing order. Print a blank line after each test case.

Sample Input

3
aaa
12
aabaabaabaab
0

Sample Output

Test case #1
2 2
3 3
Test case #2
2 2
6 2
9 3
12 4

Translation

求一个字符串中前缀的最小循环节，要求输出前缀长度和它的最小循环节循环次数（次数要求至少为2）。

Idea

给出定理： $S[1,2,\cdots ,i]$ 具有长度为 $c i r c l e < i$ 的循环节的充要条件是 $c i r c l e$ 能整除 $i$ 并且 $S [c i r c l e + 1, c i r c l e + 2, \dots, i] = S [1, 2, \dots, i - c i r l e]$ 。

对于一个长度为 $i$ 的前缀，该前缀的最后一个字符位于字符串的第 $i$ 位， $N e x t [i + 1]$ 描述了字符串 $S[1,2,\cdots ,i]$ 的相似性， $N e x t [i + 1] - 1$ 表示既是子串 $s [1, 2, \dots, i]$ 的前缀，同时也是子串 $s [1, 2, \dots, i]$ 的后缀的最长真前缀长度，即 $S[i-(Next[i+1]-1)+1,\cdots,i]=S[1,\cdots,Next[i+1]]$ $\Rightarrow$ $S[(i-Next[i+1]+1)+1,\cdots,i]=S[1,\cdots,i-(i-Next[i+1])]$ ，若 $(i-Next[i+1]+1)\mid i$ ，那么 $S[1,\cdots,i-Next[i+1]+1]$ 就是 $S[1,2,\cdots ,i]$ 的最小循环节。
进一步地，如果 $i - N e x t [N e x t [i + 1]] + 1$ 能整除 $i$ ，那么 $S [1, \dots, i - N e x t [N e x t [i + 1]] + 1]$ 就是 $S [1 \dots i]$ 的次小循环元。依次类推，我们还可以找出 $S [1 \dots i]$ 所有可能的循环节。

Code

#include<iostream>
#include<cstdio>
#include<cstring>
#define N 1000004
using namespace std;
int len;
char s[N];
int Next[N];
void get_Next()
{
	int i=1,j=0;
	while(i<=len)
	{
		if(j==0||s[i]==s[j])
		{
			i++;
			j++;
			Next[i]=j;
		}
		else j=Next[j];
	}
}
void solve()
{
	scanf("%s",s+1);
	get_Next();
	int i;
	for(i=2;i<=len;i++)
	{
		int circle=i-Next[i+1]+1;
		//判定最小循环节
		if(!(i%circle)&&circle<i) printf("%d %d\n",i,i/circle); 
	}
	putchar('\n');
}
int main()
{
	int times=0;
	while(~scanf("%d",&len)&&len) 
	{
		printf("Test case #%d\n",++times);
		solve();
	}
	return 0;
}

统计每个前缀的出现次数

HDU 3336 Count the string $\looparrowright$

Problem Description

It is well known that AekdyCoin is good at string problems as well as number theory problems. When given a string $s$ , we can write down all the non-empty prefixes of this string. For example, $s=\text{“abab”}$ .The prefixes are: $\text{“a”},\text{“ab”}, \text{“aba”}, \text{“abab”}$ .For each prefix, we can count the times it matches in $s$ . So we can see that prefix $\text{“a”}$ matches twice, $\text{“ab”}$ matches twice too, $\text{“aba”}$ matches once, and $\text{“abab”}$ matches once. Now you are asked to calculate the sum of the match times for all the prefixes. For “abab”, it is $2 + 2 + 1 + 1 = 6$ .
The answer may be very large, so output the answer mod $10007$ .

Input

The first line is a single integer $T$ , indicating the number of test cases.
For each case, the first line is an integer $n$ $(1\leqslant n \leqslant 200000)$ , which is the length of string $s$ . A line follows giving the string $s$ . The characters in the strings are all lower-case letters.

Output

For each case, output only one number: the sum of the match times for all the prefixes of $s$ mod $10007$ .

Sample Input

1
4
abab

Sample Output

Translation

给定一个字符串 $s$ $(1\leqslant|s|\leqslant 200000 )$ 所有前缀在字符串 $s$ 中出现次数的总和，答案对 $10007$ 取模。

Idea

考虑位置 $i$ 对应的 $N e x t [i + 1] - 1$ 。根据定义，这意味着字符串 $s$ 一个长度为 $N e x t [i] - 1$ 的前缀在位置 $i$ 出现并以 $i$ 为右端点，同时不存在一个更长的前缀满足前述定义。与此同时，更短的前缀可能以该位置为右端点。此时需要寻找下一个更小的长度为 $k$ 且 $k < N e x t [i + 1] - 1$ 的前缀，该长度的前缀同时也是一个右端点为 $i$ 的后缀。因此以位置 $i$ 为右端点，有长度为 $N e x t [i + 1] - 1$ 的前缀，有长度为 $N e x t [N e x t [i + 1]] - 1$ 的前缀，有长度为 $N e x t [N e x t [N e x t [i + 1]]] - 1$ 的前缀，等等，直到长度变为 $0$ 。

Code

#include<iostream>
#include<cstdio>
#include<cstring>
#define N 200004
#define ll long long
using namespace std;
const ll mod=10007;
int len;
char s[N];
int Next[N];
void get_Next()
{
	int i=1,j=0;
	while(i<=len)
	{
		if(j==0||s[i]==s[j])
		{
			i++;
			j++;
			Next[i]=j;
		}
		else j=Next[j];
	}
}
void solve()
{
	scanf("%d%s",&len,s+1);
	get_Next();
	ll ans=len;
	int i;
	for(i=1;i<=len;i++) 
	{
		int temp=Next[i+1]-1;
		while(temp)//迭代直至为0
		{
			ans=(ans+1)%mod;
			temp=Next[temp+1]-1;
		}
	}
	printf("%lld\n",ans);
}
int main()
{
	int t;
	cin>>t;
	while(t--) solve();
	return 0;
}