扩展KMP算法的详细理解+例题--hdu 2328

最新推荐文章于 2020-03-25 20:25:41 发布

古城白衣少年i

最新推荐文章于 2020-03-25 20:25:41 发布

阅读量221

点赞数

分类专栏： # 扩展KMP算法

本文链接：https://blog.csdn.net/weixin_44757834/article/details/104143648

版权

扩展KMP算法专栏收录该内容

2 篇文章 0 订阅

订阅专栏

借鉴博客：https://blog.csdn.net/qq_40160605/article/details/80407554

扩展KMP的详细理解

扩展KMP求的是对于原串S1的每一个后缀子串与模式串S2的最长公共前缀。它有一个next[]数组和一个extend[]数组。

next[i]表示为模式串S2中以i为起点的后缀字符串和模式串S2的最长公共前缀长度.

其中，next[0]=l2;

next[i]=max{ k|i<=i+k-1<l2 &&S2.substring(i,i+k-1) == S2.substring(0,k-1) }

其中str.substring(i, j)表示str从位置i到位置j的子串，如果i>j则,substring为空。

extend[i]表示为以字符串S1中以i为起点的后缀字符串和模式串S2的最长公共前缀长度.

下面我们先以一组样例来理解扩展KMP的过程

(1) 第一步，我们先对原串S1和模式串S2进行逐一匹配，直到发生不配对的情况。我们可以看到，S1[0]=S2[0],S1[1]=S2[1],S1[2]=S2[2],S1[3] ≠S2[3],此时匹配失败，第一步结束，我们得到S1[0,2]=S2[0,2],即extend[0]=3;

(2) Extend[0]的计算如第一步所示，那么extend[1]的计算是否也要从原串S1的1位置，模式串的0位置开始进行逐一匹配呢？扩展KMP优化的便是这个过程。从extend[0]=3的结果中，我们可以知道，S1[0,2]=S2[0,2],那么S1[1.2]=S2[1,2]。因为next[1]=4,所以S2[1,4]=S2[0,3],即S2[1,2]=S[0,1],可以得出S1[1,2]=S2[1,2]=S2[0,1],然后我们继续匹配，S1[3] ≠S2[3],匹配失败，extend[1]=2;

(3) 因为extend[1]=2,则S1[1,2]=S2[0,1],所以S1[2,2]=S2[0,0],因为next[0]=5,所以S1[0,5]=S2[0,5],所以S2[0,0]=S2[0,0],又回到S1[2,2]=S2[0,0],继续匹配下一位，因为S1[3] ≠S2[1],所以下一位匹配失败，extend[2]=1;

(4) 到计算原串S1的3号位置（在之前的步骤中能匹配到的最远的位置+1,即发生匹配失败的位置），这种情况下，我们会回到步骤（1）的方式，从原串S1的3号位置开始和模式串的0号位置开始，进行逐一匹配，直到匹配失败，此时的extend[]值即为它的匹配长度。因为S1[3] ≠S2[0],匹配失败，匹配长度为0，即extend[3]=0;

(5) 计算S1的4号位置extend[]。由于原串S1的4号位置也是未匹配过的，我们也是回到步骤（1）的方式，从原串S1的4号位置开始和模式串S2的0号位置开始进行逐一匹配，可以看到，S1[4]=S2[0],S1[5]=S2[1],S1[6]=S2[2],S1[7]=S2[3],S1[8]=S2[4],S1[9] ≠S2[5],此时原串S1的9号位置发生匹配失败，最多能匹配到原串S1的8号位置，即S1[4,8]=S2[0,4],匹配长度为5，即extend[4]=5;

(6) 计算S1的5号位置extend[].由于原串S1的5号位置是匹配过的（在步骤（5）中匹配了），我们从extend[4]=5得出，S1[4,8]=S2[0,4],即S1[5,8]=S2[1,4],和步骤（2）的计算方式类似，我们从next[1]=4可知，S2[0,3]=S2[1,4],即S1[5,8]=S2[0,3],然后我们继续匹配原串S1的9号位置和S2的4号位置，S1[9]=S2[4],继续匹配，S1[10]=S2[5],此时原串S1的所有字符皆匹配完毕，皆大欢喜，则S1[5,10]=S2[0,5],extend[5]=6;

(7) 从原串S1的6号位置到10号位置的extend[]的计算，与原串S1的1号位置到3号位置的计算基本相同。S1[6,10]=S2[1,5],因为next[1]=4，所以S2[1,4]=S[0,3],所以S1[6,9]=S2[0,3],此时不在需要判断匹配下一位的字符了，直接extend[6]=4;(具体原因在后面的分析总结中有说明)

(8) S1[7,10]=S2[2,5],因为next[3]=2,所以S2[3,4]=S2[0,1],所以S1[8,9]=S2[0,1],匹配长度为2，即extend[7]=3;

(9) S1[8,10]=S2[3,5],因为next[3]=2,所以S2[3,4]=S2[0,1],所以S1[8,9]=S2[0,1],匹配长度为2，即extend[8]=2;

(10) S1[9,10]=S2[4,5],因为next[4]=1,所以S2[4,5]=S2[0,0],所以S1[9,9]=S2[0,0],匹配长度为1，即extend[9]=1;

(11) S1[10,10]=S2[5,5],因为next[5]=0,所以匹配长度为0，即extend[10]=0;

至此，所有的匹配已经结束，相信，如果你仔细的看了上述的例子，已经对扩展KMP有了一定的了解了，它的计算过程中，主要是步骤一和步骤二的计算过程。下面我们对这两个过程归纳一下：

我们先定义，从0~k的计算过程中，我们已经计算出它们的extend[]值了和在匹配的过程中从Po开始匹配能匹配到的最远位置P,即Po+extend[Po]-1=P;

步骤一：当前需要计算的extend[k+1],原串S1中k+1号位置还未进行过匹配，则从原串S1的k+1号位置和模式串S2的0号位置开始进行逐一匹配，直到匹配失败，则extend[k+1]=匹配长度，此外，还要相应的更新Po值和最远匹配位置P.

步骤二：当前需要计算的extend[k+1],原串S1中k+1号位置已经进行过匹配。首先，我们从Po+extend[Po]-1=P中，可以得知S1[Po,P]=S2[0,P-Po],所以S1[k+1,P]=S2[k+1-Po,P-Po],令len=next[k+1-Po]

(1) 当（k+1）+len-1=k+len<P时，即一下情况：

我们可以得出，len=next[k+1-Po],S2[0,len-1]=S2[k+1-Po,k+Po+len],所以S1[k+1,k+len]=S2[k+1-Po,k+Po+len]=S2[0,len-1],即extend[k+1]=len;

那么会不会出现S1[k+len+1]=S2[len]的情况呢？答案是否定的

假如S1[k+len+1]=S2[len],则S1[k+1,k+len+1]=S2[0,len]

因为k+len<P,所以k+len+1<=P

所以S1[k+1,k+len+1]=S2[k+1-Po,k+Po+len+1]=S2[0,len]

此时，next[k+1-Po]=len+1与原假设不符合，所以此时S1[k+len+1]≠S2[len],不需要再次判断。

（2）当（k+1）+len-1=k+len>=P时，即一下情况：

我们可以看出，由S1[Po,P]=S2[0,P-Po]可得出S1[k+1,P]=S2[k+1-Po,P-po]，len=next[k+1-Po],所以S2[0,len-1]=S2[k+1-Po,k+len+Po]

所以S1[k+1,p]=S2[0,P-k-1]

由于大于P的位置我们还未进行匹配，所以从原串S1的P+1位置开始和模式串的P-k位置开始进行逐一匹配，直到匹配失败，并更新相应的Po位置和最远匹配位置P,此时extend[k+1]=P-k+后来逐一匹配的匹配长度。

其实，next[]数组的计算过程与extend[]的计算过程基本一致，可以看成是原串S2和模式串S2的扩展KMP进行计算，每次计算extend[k+1]时，next[i](0<=i<=k)已经算出来了，算出extend[k+1]的时候，意味着next[k+1]=extend[k+1]也计算出来了。

时间复杂度分析

通过上面的算法可知，我们原串S1的每一个字符串只会进行一次匹配，extend[k+1]的计算可以通过之前extend[i](0<=i<=k)的值得出，由于需要用相同的方法对模式串S2进行一次预处理，所以扩展KMP的时间复杂度为O(l1+l2),其中，l1为原串S1的长度，l2为模式串S2的长度。

HDU - 2328

Corporate Identity

Beside other services, ACM helps companies to clearly state their “corporate identity”, which includes company logo but also other signs, like trademarks. One of such companies is Internet Building Masters (IBM), which has recently asked ACM for a help with their new identity. IBM do not want to change their existing logos and trademarks completely, because their customers are used to the old ones. Therefore, ACM will only change existing trademarks instead of creating new ones.

After several other proposals, it was decided to take all existing trademarks and find the longest common sequence of letters that is contained in all of them. This sequence will be graphically emphasized to form a new logo. Then, the old trademarks may still be used while showing the new identity.

Your task is to find such a sequence.

Input

The input contains several tasks. Each task begins with a line containing a positive integer N, the number of trademarks (2 ≤ N ≤ 4000). The number is followed by N lines, each containing one trademark. Trademarks will be composed only from lowercase letters, the length of each trademark will be at least 1 and at most 200 characters.

After the last trademark, the next task begins. The last task is followed by a line containing zero.

Output

For each task, output a single line containing the longest string contained as a substring in all trademarks. If there are several strings of the same length, print the one that is lexicographically smallest. If there is no such non-empty string, output the words “IDENTITY LOST” instead.

Sample Input
3
aabbaabb
abbababb
bbbbbabb
2
xyz
abc
0
Sample Output
abb
IDENTITY LOST

题意：给你n个字符串，求这n个字符串的最长公共子串

思路：有多种方面可以做出这道题，我们这里先找出最短的一个母串，然后枚举它的每一个子串，对于每一个子串和原来

的母串进行扩展KMP匹配，然后记录匹配的最大值和对应的位置即可，需要注意的时，多个答案时，输出字典序最小的子串。

#include<iostream>
#include<cstdio>
#include<cstring>
#include<algorithm>
#include<cmath>
using namespace std;
typedef long long ll;
const int maxn=1e5+5;
const int INF=1e9;
int nxt[maxn],ex[maxn];
char s1[maxn],s2[maxn];
char s[5010][210];
//预处理计算next数组
void get_next(char *str)
{
	int i=0,j,po,len=strlen(str);
	nxt[0]=len;//初始化next[0] 
	while(str[i]==str[i+1]&&i+1<len)//计算next[1] 
	    i++;
	nxt[1]=i;
	po=1;//初始化po的位置 
	for(i=2;i<len;i++)
	{
	    if(nxt[i-po]+i<nxt[po]+po)//第一种情况 
		    nxt[i]=nxt[i-po];
		else{//第二种情况，要继续匹配 
			j=nxt[po]+po-i;
			if(j<0)j=0;//当前这个位置未匹配，从头开始匹配 
			while(i+j<len&&str[j]==str[j+i])
			    j++;
			nxt[i]=j;
			po=i;//更新po的位置 
		}	
	} 
}
//计算extend数组 
bool exkmp(char *s1,char *s2){
	int i=0,j,po,len=strlen(s1),l2=strlen(s2);
    get_next(s2);//计算子串的next数组 
    while(s1[i]==s2[i]&&i<l2&&i<len)//计算ex[0] 
        i++;
    ex[0]=i;
    po=0;//初始化po的位置 
    if(ex[0]==l2)return true;
    for(i=1;i<len;i++){
    	if(nxt[i-po]+i<ex[po]+po)//第一种情况 
    	    ex[i]=nxt[i-po];
    	else {//第二种情况，要继续匹配 
    		j=ex[po]+po-i;
    		if(j<0)j=0;//当前位置，未匹配过 
    		while(i+j<len&&j<l2&&s1[i+j]==s2[j])
    		    j++;
    		ex[i]=j;
    		po=i;//更新po的位置 
		}
		if(ex[i]==l2)
		    return true;
	}
	return false;
}
void char_s(char *str1,char *str,int l,int r)//将str字符串的(l,r)赋值给str1 
{
	for(int i=l;i<=r;i++)
	{
		str1[i-l]=str[i];
	}
	str1[r-l+1]='\0';
    return ;	
}
int main()
{
	int n;
	while(scanf("%d",&n)&&n){
		scanf("%d",&n);
		int Min=INF,pos;
		for(int i=1;i<=n;i++)
		{
			scanf("%s",s[i]);
			if(strlen(s[i])<Min){//找出最短的串 
		        Min=strlen(s[i]);
				pos=i;		
			}
		}
		int Max=0;
		int l=strlen(s[pos]);
		int left,right;
		for(int i=0;i<l;i++)//枚举串的每一个子串 
		{
			for(int j=i;j<l;j++){
				if((j-i+1)<Max)continue;//剪枝 
				char_s(s1,s[pos],i,j);
				int flag=true;
				for(int z=1;z<=n;z++)//每个串和s1串进行匹配 
				{
					if(z==pos)continue;
					if(exkmp(s[z],s1))
					    continue;
					flag=false;
					break;
				}
				if(flag){
					if(j-i+1==Max)//输出字典序较小的公共串 
					{
						for(int z=0;z+left<=right;z++){
							if(s[pos][z+left]<s[pos][z+i])
							    break;
							else{
								if(s[pos][z+left]>s[pos][z+i]){
									left=i;
									right=j;
									break;
								}
							}
						}
					}
					if(j-i+1>Max){
						left=i;right=j;
						Max=max(Max,j-i+1);
					}
				}
			}
		}
		if(Max==0)printf("IDENTITY LOST\n");
		else{
			for(int i=left;i<=right;i++)
			{
				printf("%c",s[pos][i]);
			}
			printf("\n");
		}
	} 
    return 0;
}