字符串-KMP

KMP算法是一种在文本字符串中查找模式字符串出现次数和位置的有效方法,避免了暴力匹配的时间复杂度问题。通过预处理得到next数组,可以在失配时直接跳转到正确位置继续匹配,实现时间复杂度为Θ(n+m)。本文详细解释了KMP算法的工作原理,并提供了匹配代码示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

字符串-KMP

作用:在一个文本字符串中找模式字符串出现次数、位置。
前缀知识: 字符串 \color{#60d000}\texttt{字符串} 字符串
算法名字来源:发明人 Knuth(D.E.Knuth)&Morris(J.H.Morris)&Pratt(V.R.Pratt) \texttt{Knuth(D.E.Knuth)\&Morris(J.H.Morris)\&Pratt(V.R.Pratt)} Knuth(D.E.Knuth)&Morris(J.H.Morris)&Pratt(V.R.Pratt)

讲解:

比如要在文本字符串 a = ababaababaabab a=\texttt{ababaababaabab} a=ababaababaabab 中找模式字符串 b = abaabab b=\texttt{abaabab} b=abaabab,暴力的做法就是枚举 a [ i ] = = b [ 1 ] a[i]==b[1] a[i]==b[1],然后对 a [ i ∼ i + l e n ( b ) − 1 ] a[i\sim i+len(b)-1] a[ii+len(b)1] b [ 1 ∼ l e n ( b ) ] b[1\sim len(b)] b[1len(b)] 进行匹配,代码:

#include <bits/stdc++.h>
using namespace std;
const int N=1e6+10;
int n,m,ans;
char a[N],b[N];
int main(){
	scanf("%s%s",a+1,b+1);
	n=strlen(a+1),m=strlen(b+1);
	for(int i=1;i<=n-m+1;i++)
		if(a[i]==b[1]){
			bool ok=1;
			for(int j=2;j<=m;j++)
				if(a[i+j-1]!=b[j]){ok=0;break;} //#
			if(ok) ans++;
		}
	printf("%d\n",ans);
	return 0;
}

时间复杂度为 Θ ( n × m ) \Theta(n\times m) Θ(n×m),爆率百分百。而 Θ ( n + m ) \Theta(n+m) Θ(n+m) 的KMP的精华就在于,每次上面代码标记的那行失配(匹配失败, a [ i + j − 1 ] ! = b [ j ] a[i+j-1]!=b[j] a[i+j1]!=b[j])以后,不需要让模式串 b b b 从头开始匹配,而是跳到一个固定的位置,开始匹配

如下,灰色表示待匹配,绿色表示正在匹配(成功),红色表示正在匹配(失败),黑色表示已经匹配:

ababaababaabab \color{gray}\texttt{ababaababaabab} ababaababaabab
abaabab \color{gray}\texttt{abaabab} abaabab

a babaababaabab \color{#60c000}\texttt{a}\color{gray}\texttt{babaababaabab} ababaababaabab
a baabab \color{#60c000}\texttt{a}\color{gray}\texttt{baabab} abaabab

a b abaababaabab \color{black}\texttt{a}\color{#60c000}\texttt{b}\color{gray}\texttt{abaababaabab} ababaababaabab
a b aabab \color{black}\texttt{a}\color{#60c000}\texttt{b}\color{gray}\texttt{aabab} abaabab

ab a baababaabab \color{black}\texttt{ab}\color{#60c000}\texttt{a}\color{gray}\texttt{baababaabab} ababaababaabab
ab a abab \color{black}\texttt{ab}\color{#60c000}\texttt{a}\color{gray}\texttt{abab} abaabab

aba b aababaabab \color{black}\texttt{aba}\color{red}\texttt{b}\color{gray}\texttt{aababaabab} ababaababaabab
aba a bab \color{black}\texttt{aba}\color{red}\texttt{a}\color{gray}\texttt{bab} abaabab

文本串和模式串失配,不需要如下让模式串 b b b 从头开始匹配:

a b abaababaabab \color{black}\texttt{a}\color{red}\texttt{b}\color{gray}\texttt{abaababaabab} ababaababaabab
  a baabab \color{red}\texttt{ a}\color{gray}\texttt{baabab}  abaabab ←错误示范

而是应该这样:

aba b aababaabab \color{black}\texttt{aba}\color{#60c000}\texttt{b}\color{gray}\texttt{aababaabab} ababaababaabab
   a b aabab \color{black}\texttt{~~a}\color{#60c000}\texttt{b}\color{gray}\texttt{aabab}   abaabab

这时我能感受到你诧异的表情,这不是玄学穿越,而是有依据的。对于模式串 b b b 成功匹配的前三个字符 aba \texttt{aba} aba,满足该字符串最多前 1 1 1 个字符等于后 1 1 1 个字符,而前 2 2 2 个字符就不等于后 2 2 2 个字符了。所以这时,就可以知道两点:
1. b b b 的前 1 1 1 个字符能和 a a a 的第 3 ∼ 3 3\sim 3 33 个字符匹配。
2.如果把 b b b 的第 1 1 1 个字符对 a a a 的第 2 ∼ 2 2\sim 2 22 个字符,必将不会整个匹配成功。

所以根据 b b b 3 3 3 个字符组成的子串中最多前几个字符等于后几个字符,就可以得出失配后跳转的方法。为了更全面具体的解说,看如下继续匹配:

abab a ababaabab \color{black}\texttt{abab}\color{#60c000}\texttt{a}\color{gray}\texttt{ababaabab} ababaababaabab
   ab a abab \color{black}\texttt{~~ab}\color{#60c000}\texttt{a}\color{gray}\texttt{abab}   abaabab

ababa a babaabab \color{black}\texttt{ababa}\color{#60c000}\texttt{a}\color{gray}\texttt{babaabab} ababaababaabab
   aba a bab \color{black}\texttt{~~aba}\color{#60c000}\texttt{a}\color{gray}\texttt{bab}   abaabab

ababaa b abaabab \color{black}\texttt{ababaa}\color{#60c000}\texttt{b}\color{gray}\texttt{abaabab} ababaababaabab
   abaa b ab \color{black}\texttt{~~abaa}\color{#60c000}\texttt{b}\color{gray}\texttt{ab}   abaabab

ababaab a baabab \color{black}\texttt{ababaab}\color{#60c000}\texttt{a}\color{gray}\texttt{baabab} ababaababaabab
   abaab a b \color{black}\texttt{~~abaab}\color{#60c000}\texttt{a}\color{gray}\texttt{b}   abaabab

ababaaba b aabab \color{black}\texttt{ababaaba}\color{#60c000}\texttt{b}\color{gray}\texttt{aabab} ababaababaabab
   abaaba b \color{black}\texttt{~~abaaba}\color{#60c000}\texttt{b}   abaabab

如上,成功发现了一个模式串 b b b 在文本串 a a a 中出现的位置。这时候就不能在再沿着 b b b 继续匹配下去了,所以也可以看作是失配。因为对于字符串 b b b 的成功匹配的前 7 7 7 个字符组成的字符串,满足前两个字符等于后两个字符等于 ab \texttt{ab} ab,所以这么跳转匹配:

ababaabab a abab \color{black}\texttt{ababaabab}\color{#60c000}\texttt{a}\color{gray}\texttt{abab} ababaababaabab
        ab a abab \color{black}\texttt{~~~~~~~ab}\color{#60c000}\texttt{a}\color{gray}\texttt{abab}        abaabab

ababaababa a bab \color{black}\texttt{ababaababa}\color{#60c000}\texttt{a}\color{gray}\texttt{bab} ababaababaabab
        aba a bab \color{black}\texttt{~~~~~~~aba}\color{#60c000}\texttt{a}\color{gray}\texttt{bab}        abaabab

ababaababaa b ab \color{black}\texttt{ababaababaa}\color{#60c000}\texttt{b}\color{gray}\texttt{ab} ababaababaabab
        abaa b ab \color{black}\texttt{~~~~~~~abaa}\color{#60c000}\texttt{b}\color{gray}\texttt{ab}        abaabab

ababaababaab a b \color{black}\texttt{ababaababaab}\color{#60c000}\texttt{a}\color{gray}\texttt{b} ababaababaabab
        abaab a b \color{black}\texttt{~~~~~~~abaab}\color{#60c000}\texttt{a}\color{gray}\texttt{b}        abaabab

ababaababaaba b \color{black}\texttt{ababaababaaba}\color{#60c000}\texttt{b} ababaababaabab
        abaaba b \color{black}\texttt{~~~~~~~abaaba}\color{#60c000}\texttt{b}        abaabab

然后又发现一个模式串 b b b 在文本串 a a a 中出现的位置,并且所有 a a a 的所有字符都已经匹配结束,所以结束匹配。最终得出, b b b a a a 中出现了 2 2 2 次,两次中 b b b 的第一个字符分别对应 a a a 的第 3 3 3 个和第 8 8 8 个字符

所以如果我们现在已经有数组 n e x [ x ] nex[x] nex[x] 表示 b b b 的前 x x x 个字符所组成的字符串中,最多前 n e x [ x ] nex[x] nex[x] 个字符与后 n e x [ x ] nex[x] nex[x] 个字符完全一样( 0 ≤ n e x [ x ] < x 0\le nex[x]< x 0nex[x]<x,那么匹配的代码就可以这么写:

#include <bits/stdc++.h>
using namespace std;
const int N=1e6+10;
class charstar{  //字符串
//个人比较喜欢用class,如果不懂可以去查查class的用法
public:char arr[N];
	int len;
	char& operator[](int x){return arr[x];}
	void leng(){len=strlen(arr+1);}
}a;
class KMP:public charstar{
public:
	int nex[N];
	void build(){
   		 //构造nex[]数组的函数先不说
	}
	void found(charstar&book,queue<int>&q){//book表示a,arr表示b本身
		for(int i=1,j=0;i<=book.len;i++){
			while(j&&book[i]!=arr[j+1]) j=nex[j];
			if(book[i]==arr[j+1]) j++;
			if(j==len) q.push(i-len+1),j=nex[j];
		}
	}
}b;
queue<int> ans;
int main(){	 
	scanf("%s%s",&a[1],&b[1]);
	a.leng(),b.leng();
	b.build(),b.found(a,ans);
	while(ans.size()) printf("%d\n",ans.front()),ans.pop();//输出每次成功匹配时b[1]对应a[几]
	for(int i=1;i<=b.len;i++) printf("%d%c",b.nex[i],"\n "[i<b.len]);
	return 0;
}

这样的算法时间复杂度是 Θ ( n + m ) \Theta(n+m) Θ(n+m) 的,为了保证复杂度,求 n e x [ ] nex[] nex[] 数组也必须 Θ ( m + m ) \Theta(m+m) Θ(m+m)。聪明的三个科学家想到了一个很微妙的方法—— b b b 自己匹配自己。这就难解释了,放代码:

void build(){
	//nex[1]=0; 因为main外定义的数组值默认为0,0<=nex[i]<i
	for(int i=2,j=0;i<=len;i++){
		while(j&&arr[j+1]!=arr[i]) j=nex[j];
		if(arr[j+1]==arr[i]) j++;
		nex[i]=j;
	}
}

和上面的匹配几乎一模一样。

如果你懂了,蒟蒻就放代码了:

#include <bits/stdc++.h>
using namespace std;
const int N=1e6+10;
class charstar{
public:char arr[N];
	int len;
	char& operator[](int x){return arr[x];}
	void leng(){len=strlen(arr+1);} 
}a;
class KMP:public charstar{
public:
	int nex[N];
	void build(){ 
		for(int i=2,j=0;i<=len;i++){
			while(j&&arr[j+1]!=arr[i]) j=nex[j];
			if(arr[j+1]==arr[i]) j++;
			nex[i]=j;
		}
	}
	void found(charstar&book,queue<int>&q){
		for(int i=1,j=0;i<=book.len;i++){
			while(j&&book[i]!=arr[j+1]) j=nex[j];
			if(book[i]==arr[j+1]) j++;
			if(j==len) q.push(i-len+1),j=nex[j];
		}
	}
}b;
queue<int> ans;
int main(){	 
	scanf("%s%s",&a[1],&b[1]);
	a.leng(),b.leng();
	b.build(),b.found(a,ans);
	while(ans.size()) printf("%d\n",ans.front()),ans.pop();
	for(int i=1;i<=b.len;i++) printf("%d%c",b.nex[i],"\n "[i<b.len]);
	return 0;
}

如果你看不惯这种匹配双重循环的版本,另一个版本:

#include <bits/stdc++.h>
using namespace std;
const int N=1e6+10;
class charstar{
public:char arr[N];
	int len;
	char& operator[](int x){return arr[x];}
	void leng(){len=strlen(arr+1);}
}s1;
class KMP:public charstar{
public:
	int nex[N];
	void build(){ 
		for(int i=1,j=0;i<=len;)
			if(!j||arr[i]==arr[j]) nex[++i]=++j;
			else j=nex[j];
	}
	void found(charstar&book,queue<int>&q){
		for(int i=1,j=1;i<=book.len;){
			if(!j||book[i]==arr[j]) i++,j++;
			else j=nex[j];
			if(j==len+1) q.push(i-len),j=nex[j];
		}
	}
}s2;
queue<int> ans;
int main(){	 
	scanf("%s%s",&s1[1],&s2[1]);
	s1.leng(),s2.leng();
	s2.build(),s2.found(s1,ans);
	while(ans.size()) printf("%d\n",ans.front()),ans.pop();
	for(int i=1;i<=s2.len;i++) printf("%d%c",s2.nex[i+1]-1,"\n "[i<s2.len]);
	return 0;
}

字符串学习之路( ★ \texttt{★} 表示当前学习知识):
hash - kmp ★ - manacher - exkmp - trie - acam - sa - sam - pam \color{#cccccc}\texttt{hash}\color{#aaaaff}\texttt{-}\color{#8888ff}\texttt{kmp}\color{#000000}\texttt{★}\color{#88cccc}\texttt{-}\color{#88ff88}\texttt{manacher}\color{#cccc88}\texttt{-}\color{#dddd44}\texttt{exkmp}\color{#eeaa44}\texttt{-}\color{#ffaa00}\texttt{trie}\color{#ff8800}\texttt{-}\color{#ee2200}\texttt{acam}\color{#ee0088}\texttt{-}\color{#cc00ff}\texttt{sa}\color{#660077}\texttt{-}\color{#555555}\texttt{sam}\color{#272727}\texttt{-}\color{#000000}\texttt{pam} hash-kmp-manacher-exkmp-trie-acam-sa-sam-pam

祝大家学习愉快!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值