朴素的模式匹配算法
主串为
S
=
“abcdefgab”
S=\text{“abcdefgab”}
S=“abcdefgab”,我们要匹配
T
=
“abcdex”
T=\text{“abcdex”}
T=“abcdex”。定义两个指针
i
,
j
i,j
i,j,分别指向字符串
S
,
T
S,T
S,T。
最初,
i
,
j
i,j
i,j 指向两个字符串的头,将两个字符串的头对其,开始同步向后移动两个指针,当
i
=
j
=
6
i=j=6
i=j=6 时,发现
s
[
i
]
≠
t
[
j
]
s[i]\ne t[j]
s[i]=t[j],这时就将
T
T
T 相对
S
S
S 整体向右移动一位,
i
i
i 指针不变,令
j
j
j 回到
1
1
1 。此时发现
T
T
T 的第一个字符就不匹配,继续进行如图的步骤。
朴素算法的最坏时间复杂度为
O
(
(
n
−
m
+
1
)
×
m
)
O((n-m+1)\times m)
O((n−m+1)×m),这样的算法实在是不能接受的。
KMP算法概述
在计算机科学中,
Knuth-Morris-Pratt
\text{Knuth-Morris-Pratt}
Knuth-Morris-Pratt 字符串查找算法(简称为
KMP
\text{KMP}
KMP算法)可在一个主文本字符串
S
S
S 内查找一个词
W
W
W 的出现位置。此算法通过运用对这个词在不匹配时本身就包含足够的信息来确定下一个匹配将在哪里开始的发现,从而避免重新检查先前匹配的字符。
这个算法是由高德纳和沃恩·普拉特在1974年构思,同年詹姆斯·H·莫里斯也独立地设计出该算法,最终由三人于1977年联合发表。
朴素方法的浪费
在上述朴素匹配算法的例子中, T T T 的首字母 ‘a’ \text{‘a’} ‘a’ 与后面的串 “bcdex” \text{“bcdex”} “bcdex” 中任意一个字符都不相等。也就是说,在第①步既然已经发现前五个字符相等,就意味着 T T T 的首字符不可能与 S S S 中的第 2 2 2位到第 5 5 5位的字符相等,那么②③④⑤的比较可以说是多余的。
利用已获得信息
来看下一个例子,
S
=
“abcababca”
S=\text{“abcababca”}
S=“abcababca”,
T
=
“abcabx”
T=\text{“abcabx”}
T=“abcabx”。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-As1s9SpW-1645027765938)(https://i.niupic.com/images/2020/04/02/7ges.jpg)]
对于开始的判断,前
5
5
5个字符相等,到
i
=
j
=
6
i=j=6
i=j=6 时失配,由于已经知道
S
[
1
,
2
,
3
,
4
,
5
]
=
T
[
1
,
2
,
3
,
4
,
5
]
S[1,2,3,4,5]=T[1,2,3,4,5]
S[1,2,3,4,5]=T[1,2,3,4,5] 且
T
[
1
]
T[1]
T[1] 与
T
T
T的第
2
2
2位和第
3
3
3位字符都不相等,所以步骤②③是多余的,故可以直接向右大幅度移动
T
T
T ,即
i
=
4
,
j
=
1
i=4,j=1
i=4,j=1,重新进行匹配。但是这时我们又可以发现,由于
S
[
4
,
5
]
=
T
[
4
,
5
]
=
T
[
1
,
2
]
S[4,5]=T[4,5]=T[1,2]
S[4,5]=T[4,5]=T[1,2],实际上可以直接令
i
=
6
,
j
=
3
i=6,j=3
i=6,j=3,
T
[
1
,
2
]
和
S
[
4
,
5
]
T[1,2]和S[4,5]
T[1,2]和S[4,5]显然能匹配上,那么步骤④⑤也是多余的。
在朴素的模式匹配算法中,主串的指针
i
i
i 是不断地回溯的,而在上述的匹配过程中,我们发现完全可以避免不必要的回溯。这就是
KMP
\text{KMP}
KMP算法的精髓,我们可以充分利用字符串前后缀的相似性,再利用已经匹配得到的信息,避免指针的来回横跳。
KMP
\text{KMP}
KMP算法的时间复杂度为
O
(
n
+
m
)
O(n+m)
O(n+m)。
定义数组 Next[j] \text{Next[j]} Next[j]
给出定义
在一般情况下,
N
e
x
t
[
j
]
Next[j]
Next[j] 为前
j
−
1
j-1
j−1 个字符组成的字符串最长相等真前后缀加一。
求数组 Next[j] \text{Next[j]} Next[j]
下面给一个计算 N e x t [ j ] Next[j] Next[j] 的例子。
j | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
模式串T | a | b | c | a | b | x | 空 |
Next[j] | 0 | 1 | 1 | 1 | 2 | 3 | 1 |
就拿
j
=
6
j=6
j=6 来说,
T
[
1
,
2
]
=
T
[
4
,
5
]
T[1,2]=T[4,5]
T[1,2]=T[4,5],故
N
e
x
t
[
6
]
=
3
Next[6]=3
Next[6]=3。当
j
=
6
j=6
j=6 失配时,
T
T
T 的头会移动到
S
[
4
]
S[4]
S[4] ,这时
T
[
1
,
2
]
T[1,2]
T[1,2] 和
S
[
4
,
5
]
S[4,5]
S[4,5] 显然能匹配上,因此我们直接将
j
j
j 移动到
j
=
3
j=3
j=3, 即 $j=Next[j] $,称之为回退。
模板
void get_Next()
{
int i=1,j=0;
Next[1]=0;
while(i<=lt)
{
//t[i]表示后缀,t[j]表示前缀
if(j==0||t[i]==t[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];//回退
}
}
查询子串
HDU 1686 Oulipo ↬ \looparrowright ↬
Problem Description
The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of the Oulipo group. A quote from the book:
Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…
Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of
500000
500000
500000 consecutive 'T’s is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {%raw%}
{
A
,
B
,
C
,
…
,
Z
}
\{A, B, C, …, Z\}
{A,B,C,…,Z}{%endraw%} and two finite strings over that alphabet, a word
W
W
W and a text
T
T
T, count the number of occurrences of
W
W
W in
T
T
T. All the consecutive characters of
W
W
W must exactly match consecutive characters of
T
T
T. Occurrences may overlap.
Input
The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:
One line with the word W, a string over {%raw%}
{
A
,
B
,
C
,
…
,
Z
}
\{A,B, C, …, Z\}
{A,B,C,…,Z}{%endraw%}, with
1
≤
∣
W
∣
≤
10000
1 ≤ |W| ≤ 10000
1≤∣W∣≤10000 (here
∣
W
∣
|W|
∣W∣ denotes the length of the string
W
W
W).
One line with the text
T
T
T, a string over {%raw%}
{
A
,
B
,
C
,
…
,
Z
}
\{A, B, C, …, Z\}
{A,B,C,…,Z}{%endraw%}, with
∣
W
∣
≤
∣
T
∣
≤
1000000
|W| ≤ |T| ≤ 1000000
∣W∣≤∣T∣≤1000000.
Output
For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W W W in the text T T T.
Sample Input
3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN
Sample Output
1
3
0
Translation
在字符串 T T T中找到子串 W W W,输出所有 W W W出现的位置(以第一个字符出现的位置为准)。
Idea
如上述介绍 KMP \text{KMP} KMP算法的模拟过程,跳过不必要的回溯。
Code
#include<iostream>
#include<cstring>
#include<cstdio>
#define N 10003
using namespace std;
char a[N*100],b[N];
int Next[N];
int la,lb;
/*------获取数组Next------*/
void get_Next()
{
int i=1,j=0;
Next[1]=0;
while(i<=lb)
{
if(j==0||b[i]==b[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];
}
}
/*------------匹配------------*/
int kmp()
{
int i=1,j=1;
int ans=0;
while(i<=la&&j<=lb)
{
if(j==0||a[i]==b[j])
{
i++;
j++;
}
else j=Next[j];
if(j>lb) //指针指向字符串末尾的后一位,说明找到子串。
{
//找到一个子串后当作匹配失败,继续找下一个
j=Next[j];
ans++;
}
}
return ans;
}
void solve()
{
int i;
scanf("%s",b+1);
lb=strlen(b+1);
scanf("%s",a+1);
la=strlen(a+1);
get_Next();
printf("%d\n",kmp());
}
int main()
{
int t;
cin>>t;
while(t--) solve();
return 0;
}
字符串最小循环节
POJ 1961 Period ↬ \looparrowright ↬
Description
For each prefix of a given string S S S with N N N characters (each character has an ASCII code between 97 97 97 and 126 126 126, inclusive), we want to know whether the prefix is a periodic string. That is, for each i i i ( 2 ⩽ i ⩽ N ) (2 \leqslant i \leqslant N) (2⩽i⩽N) we want to know the largest K > 1 K > 1 K>1 (if there is one) such that the prefix of S S S with length i i i can be written as A K A^K AK ,that is A A A concatenated K K K times, for some string A A A. Of course, we also want to know the period K K K.
Input
The input consists of several test cases. Each test case consists of two lines. The first one contains N N N ( 2 ⩽ N ⩽ 1000000 ) (2\leqslant N \leqslant 1000 000) (2⩽N⩽1000000) – the size of the string S S S.The second line contains the string S S S. The input file ends with a line, having the number zero on it.
Output
For each test case, output “Test case #” and the consecutive test case number on a single line; then, for each prefix with length i i i that has a period K > 1 K > 1 K>1, output the prefix size i i i and the period K K K separated by a single space; the prefix sizes must be in increasing order. Print a blank line after each test case.
Sample Input
3
aaa
12
aabaabaabaab
0
Sample Output
Test case #1
2 2
3 3
Test case #2
2 2
6 2
9 3
12 4
Translation
求一个字符串中前缀的最小循环节,要求输出前缀长度和它的最小循环节循环次数(次数要求至少为2)。
Idea
给出定理:
S
[
1
,
2
,
⋯
,
i
]
S[1,2,\cdots ,i]
S[1,2,⋯,i] 具有长度为
c
i
r
c
l
e
<
i
circle<i
circle<i 的循环节的充要条件是
c
i
r
c
l
e
circle
circle 能整除
i
i
i 并且
S
[
c
i
r
c
l
e
+
1
,
c
i
r
c
l
e
+
2
,
⋯
,
i
]
=
S
[
1
,
2
,
⋯
,
i
−
c
i
r
l
e
]
S[circle+1,circle+2,⋯,i]=S[1,2,⋯,i−cirle]
S[circle+1,circle+2,⋯,i]=S[1,2,⋯,i−cirle]。
对于一个长度为
i
i
i 的前缀,该前缀的最后一个字符位于字符串的第
i
i
i 位,
N
e
x
t
[
i
+
1
]
Next[i+1]
Next[i+1]描述了字符串
S
[
1
,
2
,
⋯
,
i
]
S[1,2,\cdots ,i]
S[1,2,⋯,i]的相似性,
N
e
x
t
[
i
+
1
]
−
1
Next[i+1]-1
Next[i+1]−1表示既是子串
s
[
1
,
2
,
⋯
,
i
]
s[1,2,⋯,i]
s[1,2,⋯,i] 的前缀,同时也是子串
s
[
1
,
2
,
⋯
,
i
]
s[1,2,⋯,i]
s[1,2,⋯,i] 的后缀的最长真前缀长度 ,即
S
[
i
−
(
N
e
x
t
[
i
+
1
]
−
1
)
+
1
,
⋯
,
i
]
=
S
[
1
,
⋯
,
N
e
x
t
[
i
+
1
]
]
S[i-(Next[i+1]-1)+1,\cdots,i]=S[1,\cdots,Next[i+1]]
S[i−(Next[i+1]−1)+1,⋯,i]=S[1,⋯,Next[i+1]]
⇒
\Rightarrow
⇒
S
[
(
i
−
N
e
x
t
[
i
+
1
]
+
1
)
+
1
,
⋯
,
i
]
=
S
[
1
,
⋯
,
i
−
(
i
−
N
e
x
t
[
i
+
1
]
)
]
S[(i-Next[i+1]+1)+1,\cdots,i]=S[1,\cdots,i-(i-Next[i+1])]
S[(i−Next[i+1]+1)+1,⋯,i]=S[1,⋯,i−(i−Next[i+1])],若
(
i
−
N
e
x
t
[
i
+
1
]
+
1
)
∣
i
(i-Next[i+1]+1)\mid i
(i−Next[i+1]+1)∣i,那么
S
[
1
,
⋯
,
i
−
N
e
x
t
[
i
+
1
]
+
1
]
S[1,\cdots,i-Next[i+1]+1]
S[1,⋯,i−Next[i+1]+1]就是
S
[
1
,
2
,
⋯
,
i
]
S[1,2,\cdots ,i]
S[1,2,⋯,i] 的最小循环节。
进一步地,如果
i
−
N
e
x
t
[
N
e
x
t
[
i
+
1
]
]
+
1
i−Next[Next[i+1]]+1
i−Next[Next[i+1]]+1 能整除
i
i
i,那么
S
[
1
,
⋯
,
i
−
N
e
x
t
[
N
e
x
t
[
i
+
1
]
]
+
1
]
S[1,⋯,i−Next[Next[i+1]]+1]
S[1,⋯,i−Next[Next[i+1]]+1] 就是
S
[
1
⋯
i
]
S[1⋯i]
S[1⋯i] 的次小循环元。依次类推,我们还可以找出
S
[
1
⋯
i
]
S[1⋯i]
S[1⋯i] 所有可能的循环节。
Code
#include<iostream>
#include<cstdio>
#include<cstring>
#define N 1000004
using namespace std;
int len;
char s[N];
int Next[N];
void get_Next()
{
int i=1,j=0;
while(i<=len)
{
if(j==0||s[i]==s[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];
}
}
void solve()
{
scanf("%s",s+1);
get_Next();
int i;
for(i=2;i<=len;i++)
{
int circle=i-Next[i+1]+1;
//判定最小循环节
if(!(i%circle)&&circle<i) printf("%d %d\n",i,i/circle);
}
putchar('\n');
}
int main()
{
int times=0;
while(~scanf("%d",&len)&&len)
{
printf("Test case #%d\n",++times);
solve();
}
return 0;
}
统计每个前缀的出现次数
HDU 3336 Count the string ↬ \looparrowright ↬
Problem Description
It is well known that AekdyCoin is good at string problems as well as number theory problems. When given a string
s
s
s, we can write down all the non-empty prefixes of this string. For example,
s
=
“abab”
s=\text{“abab”}
s=“abab”.The prefixes are:
“a”
,
“ab”
,
“aba”
,
“abab”
\text{“a”},\text{“ab”}, \text{“aba”}, \text{“abab”}
“a”,“ab”,“aba”,“abab”.For each prefix, we can count the times it matches in
s
s
s. So we can see that prefix
“a”
\text{“a”}
“a” matches twice,
“ab”
\text{“ab”}
“ab” matches twice too,
“aba”
\text{“aba”}
“aba” matches once, and
“abab”
\text{“abab”}
“abab” matches once. Now you are asked to calculate the sum of the match times for all the prefixes. For “abab”, it is
2
+
2
+
1
+
1
=
6
2 + 2 + 1 + 1 = 6
2+2+1+1=6.
The answer may be very large, so output the answer mod
10007
10007
10007.
Input
The first line is a single integer
T
T
T, indicating the number of test cases.
For each case, the first line is an integer
n
n
n
(
1
⩽
n
⩽
200000
)
(1\leqslant n \leqslant 200000)
(1⩽n⩽200000), which is the length of string
s
s
s. A line follows giving the string
s
s
s. The characters in the strings are all lower-case letters.
Output
For each case, output only one number: the sum of the match times for all the prefixes of s s s mod 10007 10007 10007.
Sample Input
1
4
abab
Sample Output
6
Translation
给定一个字符串 s s s ( 1 ⩽ ∣ s ∣ ⩽ 200000 ) (1\leqslant|s|\leqslant 200000 ) (1⩽∣s∣⩽200000)所有前缀在字符串 s s s 中出现次数的总和,答案对 10007 10007 10007 取模。
Idea
考虑位置 i i i 对应的 N e x t [ i + 1 ] − 1 Next[i+1]-1 Next[i+1]−1。根据定义,这意味着字符串 s s s 一个长度为 N e x t [ i ] − 1 Next[i]-1 Next[i]−1 的前缀在位置 i i i 出现并以 i i i 为右端点,同时不存在一个更长的前缀满足前述定义。与此同时,更短的前缀可能以该位置为右端点。此时需要寻找下一个更小的长度为 k k k 且 k < N e x t [ i + 1 ] − 1 k<Next[i+1]-1 k<Next[i+1]−1 的前缀,该长度的前缀同时也是一个右端点为 i i i 的后缀。因此以位置 i i i 为右端点,有长度为 N e x t [ i + 1 ] − 1 Next[i+1]-1 Next[i+1]−1 的前缀,有长度为 N e x t [ N e x t [ i + 1 ] ] − 1 Next[Next[i+1]]-1 Next[Next[i+1]]−1 的前缀,有长度为 N e x t [ N e x t [ N e x t [ i + 1 ] ] ] − 1 Next[Next[Next[i+1]]]-1 Next[Next[Next[i+1]]]−1 的前缀,等等,直到长度变为 0 0 0。
Code
#include<iostream>
#include<cstdio>
#include<cstring>
#define N 200004
#define ll long long
using namespace std;
const ll mod=10007;
int len;
char s[N];
int Next[N];
void get_Next()
{
int i=1,j=0;
while(i<=len)
{
if(j==0||s[i]==s[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];
}
}
void solve()
{
scanf("%d%s",&len,s+1);
get_Next();
ll ans=len;
int i;
for(i=1;i<=len;i++)
{
int temp=Next[i+1]-1;
while(temp)//迭代直至为0
{
ans=(ans+1)%mod;
temp=Next[temp+1]-1;
}
}
printf("%lld\n",ans);
}
int main()
{
int t;
cin>>t;
while(t--) solve();
return 0;
}