数据结构·KMP·笔记

jmu_hjc

已于 2024-07-17 10:56:13 修改

阅读量1.2k

点赞数 5

分类专栏： C++ 文章标签：数据结构

于 2022-05-31 22:26:26 首次发布

本文链接：https://blog.csdn.net/qq_61179907/article/details/125065069

版权

C++ 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

数据结构·KMP

KMP
算法题Oulipo
小结

KMP

1.简介

KMP算法目的是在主串中快速找到一个字串
快速：并非一位一位比较后发现不匹配然后后移一位
字串：此处我们以后叫模式串

2.KMP使用的例子

红色框住是完全匹配部分，蓝色是最大公共前后缀
当字符发生不匹配时（红色框外第一对，主串B！=模式串A）开始在完全匹配部分找最大公共前后缀
下一步，将前缀移动到后缀，然后在下一个发生不匹配的字符位置之前完全匹配部分找最大公共前后缀
当模式串长度超出主串时，匹配失败
在这里插入图片描述
另一个成功案例：

运用KMP算法查找字串在效率上有了极大的提升

3.公共前后缀是什么？怎么找？

公共前后缀：
模式串中完全匹配部分相同的地方
其特点是前缀从前往后数后缀从后往前数
例如ABCD 前缀是A AB ABC ABCD而后缀是D CD BCD ABCD

最大公共前后缀：
顾名思义，最大公共前后缀就是公共前后缀中最大的一串。
公共前后缀应该在比较指针（主串和模式串不一样的字符）左边开始找

4.next数组

既然我们知道了KMP算法，那么如果我们要用代码实现，我们还需要知道next数组，next数组的意思就是下一步应该如何做，我们应该怎么对模式串进行移动
在这里插入图片描述
一些名词解释：
1号位，2号位：指的是完全匹配部分的第一个，第二个
当前位置：指的是比较指针的位置就是匹配发现字符不相同的（首个）位置

规律：当最大公共前缀为n时，n+1号位与主串当前位进行比较

ps：这里一般从下标1开始记录

此时所得的next数组的值就是对应几号位的值了
比如这里

下标    0   1 2 3 4 5 6 7 8 9 10 11 12
next  不计  0 1 1 2 3 4 2 2 3 4  5  6

如果我们可以由前几位的next数组值得到下一位的next的值，通过这种递推的方式我们可以得到求next数组的代码

假设我们已知next[j]=t
分析：
在这里插入图片描述

代码：

void getNext(Str substr,int next[])
{
	int j=1,t=0;
	next[1]=0;
	while(j<substr.length)
	{
		if(t==0||substr.ch[j]==substr[t])
		{	
			next[j+1]=t+1;
			++t;
			++j;
		}else t=next[t];
	}
}

5.nextval数组

若出现特殊情况，我们可以对KMP算法进行优化
在这里插入图片描述
递推关系分析：

j=1,nextvval=0
j>1 Pj!=Pnext[j],nextval[j]=next[j]
Pj=Pnext[j],nextval[j]=nextval[next[j]]
代码：

void getNextval(Str substr,int nextval[])
{
	int j=1,t=0;
	nextval[1]=0;
	while(j<substr.length)
	{
		if(t==0||substr.ch[j]==substr.ch[t])
		{
			if(substr.ch[j+1]!=substr.ch[t+1])
				nextval[j+1]=t+1;
			else 
				nextval[j+1]=nextval[t+1];
			++t;++j;
		}else	t=nextval[t];
	}
}

6.KMP算法的实操

next数组

void getNext(Str substr,int next[])
{
	int j=1,t=0;
	next[1]=0;
	while(j<substr.length)
	{
		if(t==0||substr.ch[j]==substr[t])
		{	
			next[j+1]=t+1;
			++t;
			++j;
		}else t=next[t];
	}
}
int KMP（Str str,Str substr,int next[])
{
	int i=1;j=1;
	while(j<str.length &&j<=substr.length)
	{
		if(j==0||str.ch[i}==substr.ch[j])
		{++i;++j;}
		else {j=next[j];}
	}
	if(j>substr.length)
		return i-substr.length;
	else
		return 0;
}

nextval数组：

void getNextval(Str substr,int nextval[])
{
	int j=1,t=0;
	nextval[1]=0;
	while(j<substr.length)
	{
		if(t==0||substr.ch[j]==substr.ch[t])
		{
			if(substr.ch[j+1]!=substr.ch[t+1])
				nextval[j+1]=t+1;
			else 
				nextval[j+1]=nextval[t+1];
			++t;++j;
		}else	t=nextval[t];
	}
}
int KMP（Str str,Str substr,int nextval[])
{
	int i=1;j=1;
	while(j<str.length &&j<=substr.length)
	{
		if(j==0||str.ch[i}==substr.ch[j])
		{++i;++j;}
		else {j=nextval[j];}
	}
	if(j>substr.length)
		return i-substr.length;
	else
		return 0;
}

算法题Oulipo

原题描述

Description

The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of the Oulipo group. A quote from the book:

Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…

Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive 'T’s is not unusual. And they never use spaces.

So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {‘A’, ‘B’, ‘C’, …, ‘Z’} and two finite strings over that alphabet, a word W and a text T, count the number of occurrences of W in T. All the consecutive characters of W must exactly match consecutive characters of T. Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:

One line with the word W, a string over {‘A’, ‘B’, ‘C’, …, ‘Z’}, with 1 ≤ |W| ≤ 10,000 (here |W| denotes the length of the string W).
One line with the text T, a string over {‘A’, ‘B’, ‘C’, …, ‘Z’}, with |W| ≤ |T| ≤ 1,000,000.
Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W in the text T.

翻译描述

法国作家乔治·佩雷克(1936-1982年)曾写过一本书Ladisparition，书中没有字母
他是Oulipo 集团的成员。书中的部分内容引述如下:“一切都很正常，但一切都错了。一开始一切都有正常的公平，然后非人的出现，吓坏了他。他本来想知道将他与小说联系在一起的联系在哪里:在他的地毯上，随时攻击他的想象力，禁忌的直觉，晦涩邪恶的异象，空缺的东西，一个未说出口的:愿景，遗忘的意见指挥一切，原因被废除:一切看起来都很正常但……”
佩雷克在接下来的比赛中评为高分(或低分)。要求人们在某些主题上写一个可能的文本消息，尽可能少地出现一个给定的“单词”。我们的任务是为陪审团提供一个计算这些事件的程序，以获得竞争对手的排名。这些竞争对手经常写出非常长的文本，而且含义还是无意义的，一系列500000个连续“T”并不罕见。而且他们从不使用空格。
因此，我们希望快速找出文本中出现的单词(即给定字符串)的频率。更正式地，给出字母表A,B,C…,Z和该字母表上的两个有限字符串，即单词W和文本T，计算T中W的出现次数，即W的所有连续字符必须与T的连续字符完全匹配。可能有重叠现象。

测试样例

3
BAPC
BAPC
AZAZAZA
AZA
AVERDXIVYERDIAN
VERDI

样例结果

1
3
0

代码

#include<iostream>
#include<cstdio>
#include<string>
using namespace std;

int arr[1000];

void getNext(string T){
    int t=-1;int i =0;
    arr[i]=t;
    while(i<T.length()){
        if(t==-1||T[i]==T[t]){
            arr[i+1]=t+1;
            i++;t++;
        }else t = arr[t];
    }
    return ;
}

int KMP(string P,string T){
    getNext(T);
    int i =0,j=0,number=0;
    while(i<P.length()){
        if(j==-1||P[i]==T[j]){
            i++;j++;
        }else j=arr[j];
        if(j==T.length()) {
            number++;
            j=arr[j];
        }
    }
    return number;

}
int main(int argc,char** argv){
    int n;
    scanf("%d",&n);
    while(n--){
        string P,T;
        cin>>P>>T;
        printf("%d\n", KMP(P,T));
    }

    return 0;
}

小结

朴素模式匹配：
主串中每一个和模式串长度相同的子串都与其进行匹配
时间复杂度O(MN)
KMP：
在匹配过程中可以知道主串匹配部分的字符，根据这些匹配的字符进行优化
寻找最大公共前后缀，前后缀长度小于匹配部分总长度
利用i，j指针指向匹配位置，将最大前缀的下一位移至j
时间复杂度O(M+N)
NEXT数组
next数组值为前几位的最大前缀位+1
next[1]=0,next[2]=1是固定的，角标是字符串的第一个和第二个意思，可以根据不同存储方法（0开始，1开始修改）
时间复杂度O(M)
NEXTVAL数组
由于存在next数组指向的元素和本身一样的情况，如果本身不匹配，那next数组指向的元素肯定也不匹配，因此可以优化为nextval指向next数组指向元素的next数组

jmu_hjc

关注

5
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
数据结构·KMP·笔记

数据结构·KMPKMP1.简介2.KMP使用的例子3.公共前后缀是什么？怎么找？4.next数组5.nextval数组KMP1.简介KMP算法目的是在主串中快速找到一个字串快速：并非一位一位比较后发现不匹配然后后移一位字串：此处我们以后叫模式串2.KMP使用的例子红色框住是完全匹配部分，蓝色是最大公共前后缀当字符发生不匹配时（红色框外第一对，主串B！=模式串A）开始在完全匹配部分找最大公共前后缀下一步，将前缀移动到后缀，然后在下一个发生不匹配的字符位置之前完全匹配部分找最大公共前后缀当模
复制链接

扫一扫

专栏目录