字符串-KMP

最新推荐文章于 2024-04-14 17:48:07 发布

吾仄lo咚锵

最新推荐文章于 2024-04-14 17:48:07 发布

阅读量8k

点赞数 68

分类专栏：算法文章标签：算法字符串 c++ c语言数据结构

本文链接：https://blog.csdn.net/qq_45034708/article/details/108444332

版权

算法专栏收录该内容

38 篇文章 6 订阅

订阅专栏

文章目录

KMP
后记

KMP

KMP是单模式匹配算法，即在一个长度为 $n$ 的文本串S中查找一个长度 $m$ 的模式串P。它的复杂度是 $O (n + m)$ ，差不多是此类算法能达到的最优复杂度。
它是如何做到的？简单说，它通过分析P的特征对P进行预处理，从而在与S匹配的时候能够跳过一些字符串，达到快速匹配的目的。

原理

以 $S [] = " a b c a b c a b c d ", P [] = “ a b c d ”$ 为例， $i$ 指向 $S [i]$ ， $j$ 指向 $P [j]$ 。
在这里插入图片描述

图（c）说明KMP算法，指向 $S$ 的 $i$ 指针不会回溯，而是一直往后走到底；
同图（b）的朴素算法相比，大大减少了匹配次数。

那么KMP是如何做到 $i$ 不回溯，只回溯 $j$ 呢？ $j$ 应该怎么回溯？这就是KMP的核心—— $n e x t []$ 数组，当匹配失败后，用 $n e x t []$ 数组指出 $j$ 回溯的位置。

next
$n e x t []$ 数组是对串P预处理得到的， $n e x t$ 数组的值是除当前字符外的字符串的前缀与后缀相同的最大长度。

前缀：以 $j$ 为起点的一段字符串，终点不限(别越界)
后缀：以 $j$ 为终点的一段字符串，起点不限(别越界)

比如求串“abab”的next数组：初值赋-1；
第3个字符a之前的字符串ab中有长度为0的相同前缀后缀，所以第3个字符a对应的next值为0；
第4个字符b之前的字符串aba中有长度为1的相同前缀后缀a，所以第4个字符b对应的next值为1。

a	b	a	b
-1	0	0	1

如果还是很难消化next的原理，就先知道它的作用即可：指出 $j$ 匹配失败后回溯的位置。
知道原理后，对于代码编程实现求 $n e x t []$ 数组其实还是不好理解，虽然代码不长，自惭形秽，免得误人子弟，给出一篇大牛的详解参考，小伙伴们可以刨根究底：从头到尾彻底理解KMP

模板

void getnext(char* p, int lp) {
	nex[0] = nex[1] = 0;
	for (int i = 1; i < lp; i++) {
		int j = nex[i];
		while (j && p[i] != p[j])j = nex[j];
		nex[i + 1] = p[i] == p[j] ? j + 1 : 0;
	}
}
int kmp(char* s, char* p) {	//统计s中有多少个p
	int ans = 0;
	int ls = strlen(s), lp = strlen(p);
	getnext(p, lp);
	for (int i = 0, j = 0; i < ls; i++) {
		while (j && s[i] != p[j])j = nex[j];//失配则回溯j
		if (s[i] == p[j])j++;//匹配则继续
		if (j >= lp)ans++;//统计
	}
	return ans;
}

例题

HDU-1686Oulipo

Problem Description
The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of the Oulipo group. A quote from the book:
Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…
Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive 'T’s is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {‘A’, ‘B’, ‘C’, …, ‘Z’} and two finite strings over that alphabet, a word W and a text T, count the number of occurrences of W in T. All the consecutive characters of W must exactly match consecutive characters of T. Occurrences may overlap.
Input
The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:
One line with the word W, a string over {‘A’, ‘B’, ‘C’, …, ‘Z’}, with 1 ≤ |W| ≤ 10,000 (here |W| denotes the length of the string W).
One line with the text T, a string over {‘A’, ‘B’, ‘C’, …, ‘Z’}, with |W| ≤ |T| ≤ 1,000,000.
Output
For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W in the text T.
Sample Input
3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN
Sample Output
1
3
0

分析：先来个模板题

#include<bits/stdc++.h>
using namespace std;
const int maxn = 10004;
int nex[maxn];
char s[maxn * 100], p[maxn];
void getnext() {
	int lp = strlen(p);
	nex[0] = nex[1] = 0;
	for (int i = 1; i < lp; i++) {
		int j = nex[i];
		while (j && p[i] != p[j])j = nex[j];
		nex[i + 1] = p[i] == p[j] ? j + 1 : 0;
	}
}
int kmp() {
	int ans = 0;
	int ls = strlen(s), lp = strlen(p);
	getnext();
	for (int i = 0, j = 0; i < ls; i++) {
		while (j && s[i] != p[j])j = nex[j];
		if (s[i] == p[j])j++;
		if (j >= lp)ans++;
	}
	return ans;
}
int main() {
	int t;
	scanf("%d", &t);
	while (t--) {
		scanf("%s%s", p, s);
		printf("%d\n", kmp());
	}
	return 0;
}

HDU-2087剪花布条

Problem Description
一块花布条，里面有些图案，另有一块直接可用的小饰条，里面也有一些图案。对于给定的花布条和小饰条，计算一下能从花布条中尽可能剪出几块小饰条来呢？
Input
输入中含有一些数据，分别是成对出现的花布条和小饰条，其布条都是用可见ASCII字符表示的，可见的ASCII字符有多少个，布条的花纹也有多少种花样。花纹条和小饰条不会超过1000个字符长。如果遇见#字符，则不再进行工作。
Output
输出能从花纹布中剪出的最多小饰条个数，如果一块都没有，那就老老实实输出0，每个结果之间应换行。
Sample Input
abcde a3
aaaaaa aa
#
Sample Output
0
3

分析：找到能分开的子串数量，套用KMP，统计的时候判断与上一个是否重合即可。

#include<bits/stdc++.h>
using namespace std;
const int maxn = 1003;
int nex[maxn];
char p[maxn], s[maxn];
void getnext() {
	nex[0] = nex[1] = 0;
	int lp = strlen(p);
	for (int i = 1; i < lp; i++) {
		int j = nex[i];
		while (j && p[i] != p[j])j = nex[j];
		nex[i + 1] = p[i] == p[j] ? j + 1 : 0;
	}
}
int kmp() { 
	int ls = strlen(s), lp = strlen(p);
	getnext();
	int last = -1; //指向上一个匹配的末尾
	int ans = 0;
	for (int i = 0, j = 0; i < ls; i++) {
		while (j && s[i] != p[j])j = nex[j];
		if (s[i] == p[j])j++;
		if (j >= lp) { //若完全匹配
			if (i - last >= lp) { //且与上一个不重合
				ans++;
				last = i;
			}
		}
	}
	return ans;
}
int main() {
	while (~scanf("%s", s)) {
		if (s[0] == '#')break;
		scanf("%s", p);
		printf("%d\n", kmp());
	}
	return 0;
}

POJ-2752Seek the Name, Seek the Fame

Description
The little cat is so famous, that many couples tramp over hill and dale to Byteland, and asked the little cat to give names to their newly-born babies. They seek the name, and at the same time seek the fame. In order to escape from such boring job, the innovative little cat works out an easy but fantastic algorithm:
Step1. Connect the father’s name and the mother’s name, to a new string S.
Step2. Find a proper prefix-suffix string of S (which is not only the prefix, but also the suffix of S).
Example: Father=‘ala’, Mother=‘la’, we have S = ‘ala’+‘la’ = ‘alala’. Potential prefix-suffix strings of S are {‘a’, ‘ala’, ‘alala’}. Given the string S, could you help the little cat to write a program to calculate the length of possible prefix-suffix strings of S? (He might thank you by giving your baby a name:)
Input
The input contains a number of test cases. Each test case occupies a single line that contains the string S described above.
Restrictions: Only lowercase letters may appear in the input. 1 <= Length of S <= 400000.
Output
For each test case, output a single line with integer numbers in increasing order, denoting the possible length of the new baby’s name.
Sample Input
ababcababababcabab
aaaaa
Sample Output
2 4 9 18
1 2 3 4 5

分析：问前缀和后缀相同的长度可能是多少？就是考察对 $n e x t []$ 的理解，设 $l e n$ 为字符串长度，首先原字符串肯定是自己前缀和后缀，len是答案，然后 $n e x [l e n]$ 就是代表这个字符串的最大前缀后缀相同长度，然后将这个长度为 $n e x [l e n]$ 的子串继续分析前缀后缀最大相同长度，递归求解即可。
眼尖选手可以直接打表套样例找规律。

#include<cstdio>
#include<cstring>
#include<vector>
using namespace std;
const int maxn = 400005;
int nex[maxn], ls;
char s[maxn];
void getnext() {
	nex[0] = nex[1] = 0;
	for (int i = 1; i < ls; i++) {
		int j = nex[i];
		while (j && s[i] != s[j])
			j = nex[j];
		nex[i + 1] = s[i] == s[j] ? j + 1 : 0;
	}
}
int main() {
	while (~scanf("%s", s)) {
		ls = strlen(s);
		getnext();
		int idx = ls ;
		vector<int>ans;
		while (nex[idx]) {
			ans.push_back(nex[idx] );
			idx = nex[idx];
		}
		int len = ans.size();
		for (int i = len - 1; i >= 0; i--)
			printf("%d ", ans[i]);
		printf("%d\n", ls);
	}
	return 0;
}

POJ-2406Power Strings

Description
Given two strings a and b we define ab to be their concatenation. For example, if a = “abc” and b = “def” then ab = “abcdef”. If we think of concatenation as multiplication, exponentiation by a non-negative integer is defined in the normal way: a^0 = “” (the empty string) and a^(n+1) = a*(a^n).
Input
Each test case is a line of input representing s, a string of printable characters. The length of s will be at least 1 and will not exceed 1 million characters. A line containing a period follows the last test case.
Output
For each s you should print the largest n such that s = a^n for some string a.
Sample Input
abcd
aaaa
ababab
.
Sample Output
1
4
3
Hint
This problem has huge input, use scanf instead of cin to avoid time limit exceed.

分析：问字符串是由多少个子串循环组成 ?还是对 $n e x t []$ 的考察， $n e x t []$ 是前缀和后缀最大相同长度，那么 $n - n e x t [n]$ 就是可能的循环节长度，再判断下是否为0及是否整除，否则循环节长度为1。举图说明，也可以打表观察。
~~居然不给n范围，我测出来是1e6~~
在这里插入图片描述

#include<cstdio>
#include<cstring>
using namespace std;
const int maxn = 1000006;
int nex[maxn], ls;
char s[maxn];
void getnext() {
	nex[0] = nex[1] = 0;
	for (int i = 1; i < ls; i++) {
		int j = nex[i];
		while (j && s[i] != s[j])
			j = nex[j];
		nex[i + 1] = s[i] == s[j] ? j + 1 : 0;
	}
}
int main() {
	while (~scanf("%s", s)) {
		if (s[0] == '.')break;
		ls = strlen(s);
		getnext();
		int len = ls - nex[ls];
		if (len && ls % len == 0)
			printf("%d\n", ls / len);
		else puts("1");
	}
	return 0;
}

后记

小结一下，KMP是单模式匹配算法，复杂度是 $O (n + m)$ ，多模匹配算法用AC自动机。其中需要重点理解的是 $n e x t []$ ，很多衍生题都是从 $n e x [] t$ 切入的，如果没思路可以试试打表套样例找规律，多观察 $n e x t []$ 。

原创不易，请勿转载（~~本不富裕的访问量雪上加霜~~ ）
博主首页：https://blog.csdn.net/qq_45034708
如果文章对你有帮助，记得一键三连❤

吾仄lo咚锵

关注

68
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
15
评论
字符串-KMP

HDU-1686OulipoHDU-2087剪花布条POJ-2752Seek the Name, Seek the FamePOJ-2406Power StringsKMP模板KMP是单模式匹配算法，即在一个长度为nn的文本串S中查找一个长度mm的模式串P。它的复杂度是O(n+m)O(n+m)，差不多是此类算法能达到的最优复杂度。它是如何做到的？简单说，它通过分析P的特征对P进行预处理，从而在与S匹配的时候能够跳过一些字符串，达到快速匹配的目的。
复制链接

扫一扫