理解与实现KMP模式匹配算法

最新推荐文章于 2023-02-23 11:10:39 发布

楠之枫雪

最新推荐文章于 2023-02-23 11:10:39 发布

阅读量634

点赞数

分类专栏：数据结构及算法

本文链接：https://blog.csdn.net/u014614038/article/details/49334485

版权

数据结构及算法专栏收录该内容

5 篇文章 0 订阅

订阅专栏

看了差不多一天，终于理解了KMP算法到底是怎么的一回事，核心难点是求出前缀与后缀的公共部分的最大长度。

首先先了解一下字符串的前缀与后缀：

－　"A"的前缀和后缀都为空集，共有元素的长度为0；
　　－　"AB"的前缀为[A]，后缀为[B]，共有元素的长度为0；
　　－　"ABC"的前缀为[A, AB]，后缀为[BC, C]，共有元素的长度0；
　　－　"ABCD"的前缀为[A, AB, ABC]，后缀为[BCD, CD, D]，共有元素的长度为0；
　　－　"ABCDA"的前缀为[A, AB, ABC, ABCD]，后缀为[BCDA, CDA, DA, A]，共有元素为"A"，长度为1；
　　－　"ABCDAB"的前缀为[A, AB, ABC, ABCD, ABCDA]，后缀为[BCDAB, CDAB, DAB, AB, B]，共有元素为"AB"，长度为2；
　　－　"ABCDABD"的前缀为[A, AB, ABC, ABCD, ABCDA, ABCDAB]，后缀为[BCDABD, CDABD, DABD, ABD, BD, D]，共有元素的长度为0。

前缀"指除了最后一个字符以外，一个字符串的全部头部组合；"后缀"指除了第一个字符以外，一个字符串的全部尾部组合。

当一个字符串前缀与后缀出现相同的情况下，当后缀匹配相同时，那么前缀一定会匹配相同，反之相当。而出现这种情况下时，字符串必定含有相同的字符，相同字符出现在前缀与后缀中，这样我们可以把字符串分成三个部分：

【a】【b】【a】

【a】是相同的部分，【b】是不同的部分

KMP算法的匹配过程如下：

位置： 1 2 3 4 5 6

目标字符串s1：【a】【b】【a】【b】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

当s1匹配4k时，由于4位置不满足，所以这次匹配失败，重新移动位置进行匹配，而对于s2而已就不需要，因为它的1 与3是相同的，所以它不需要再和再和s2的3相匹配，因为它和2位置是不相同的,所以它也不需要再和s1的2位置再匹配了，因为【b】与【b】已经匹配相同了，所以可以移动如下：

位置： 1 2 3 4 5 6

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

【a】与【b】不相同,移动1,匹配5:

位置： 1 2 3 4 5 6 7 8 9 10

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【g】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

【a】与【b】不相同,移动1,匹配6:

位置： 1 2 3 4 5 6 7 8 9 10

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【g】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

【a】与【b】不相同,移动1,匹配到10:

位置： 1 2 3 4 5 6 7 8 9 10

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【g】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

10不同，移动4:

位置： 1 2 3 4 5 6 7 8 9 10

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【g】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

11不同，移动1:

位置： 1 2 3 4 5 6 7 8 9 10

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【g】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

匹配相同，结束。

所以可以看到，核心的地方是要知道s2到底要移动多少位置。

移动位置大小=已经匹配的字符数-匹配字符位置（包括该位置）之前的字符串的前缀与后缀的公共部分的最大长度（next(i)）。

如：

位置： 1 2 3 4 5 6 7 8 9 10

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【g】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

匹配到【。】事，该【a】【b】【a】【x】【。】的前缀与后缀的公共部分最大长度为0，已经匹配了4个字符，故移动4-0个位置：

位置： 1 2 3 4 5 6 7 8 9 10

目标字符串s1：【a】【b】【a】【b】【x】【a】【b】【a】【x】【g】【a】【b】【a】【x】【。】

要匹配出字符串s2：【a】【b】【a】【x】【。】

所以，匹配转化为了求每个字符对应的next(i)的值：

我们知道，前缀一定是包含了前面的字符，后缀是包含了后面的字符，我们知道位置为i 的字符的next(i),所以前面next(i)个字符与后面的next(i)个字符是相同的

【0】。。。【next(i)-1】【next(i)】。。。【i-next(i)】。。【i】【i+1】

假如【next(i)】与【i+1】相同，那么，next(i+1)=next(i)+1;

假如不相同，就判断【0】。。。【next(i)-1】【next(i)】，通过next(next(i))就判断前面有多少与后面相同的，如果【next(next(i))】与【i+1】相同，那么next(i+1)=next(next(i))+1.....一直判断下去。

下面给出一个我实现的Java的获取next的方法，就是利用了递归：

<span style="color:#cc0000;">  </span>public  static int[] getnext(String target)
	    {
	    	int[] next=new int[target.length()];
	    	char[] cs=target.toCharArray();	    	    
	    	for(int i=0;i<next.length;i++)
	    	{   	    			    		
	    	next[i]=getnum(i, cs,next, cs[i]);
	    				    				    		
	    	}
	    	return next;
	    }
	    
	    
	    
	    public static int getnum(int id,char[] cs,int[] next,char targetc)
	    {
	    	    if(id==0)
	    		return 0;	    	    	
	    	    int prenxet=next[id-1];	    	  
	    		if(cs[prenxet]==targetc)
	    		{	   	    			
	    			return prenxet+1;
	    		}
	    		else{
	    			return getnum(prenxet,cs,next,targetc);
	    		}
	    		
	    }

实现了获取next(i),接下来实现匹配就简单的很了，看我实现的代码如下：

/**
	     * 2015年10月27日上午11:21:22
	     * @param matchStr 想要找到是否含有targetstr的字符串
	     * @param targetStr 要匹配的字符串
	     * @return
	     */
	    public static int KmpMatch(String matchStr,String targetStr)
	    {
	    	char[] matchChars=matchStr.toCharArray();
	    	char[] targetChars=targetStr.toCharArray();
	    	int[] next=getNext(targetStr);
	    	int matchIndex=0;//matchStr 游标
	    	int targetIndex=0;//target游标
	    	int movelength=0;
	    	int mathlength=0;
	    	while(matchIndex<matchChars.length)
	    	{
	    		
	    		if(matchChars[matchIndex]==targetChars[targetIndex])
	    		{
	    			matchIndex++;
	    			targetIndex++;
	    			mathlength++;
	    		}
	    		else
	    		{  
	    			if(mathlength>0)
	    		{
	    			movelength=mathlength-next[targetIndex];//要移动的长度为  已匹配长度减去 next(i)
	    			targetIndex=targetIndex-movelength;//移动位置
	    		}
	    		else {
	    			matchIndex++;
	    			targetIndex=0;
	    		}
	    		
	    		}
	    		if(targetIndex==targetChars.length)
	    		return matchIndex-targetIndex;	
	    		else if(targetIndex<0)//若移动已经超出目标，返回-1
	    			return -1;
	    	}
	    	return -1;//没有匹配到返回-1
	    }

看一下测试代码：

public static void main(String[] args) 
	{	
		
	String matchstr ="aass啊wwww"; 
        String targetstr="啊";
    
        System.out.println("--："+KmpMatch(matchstr, targetstr)); // KMP匹配字符串  
        
	}

输出如下：--：4

这只是实现了匹配第一个字符串，接下来实现匹配所有的字符串，将匹配的下标位置通过集合保存起来：

实现代码如下：

/**
	 * 2015年10月27日上午11:21:22
	 * 
	 * @param matchStr
	 *            想要找到是否含有targetstr的字符串
	 * @param targetStr
	 *            要匹配的字符串
	 * @return
	 */
	public static java.util.List<Integer> KmpMatch(String matchStr,
			String targetStr) {
		char[] matchChars = matchStr.toCharArray();
		char[] targetChars = targetStr.toCharArray();
		int[] next = getNext(targetStr);
		int matchIndex = 0;// matchStr 游标
		int targetIndex = 0;// target游标
		int movelength = 0;
		int mathlength = 0;
		java.util.List<Integer> matchIndexs = new ArrayList<Integer>();
		while (matchIndex < matchChars.length) {

			if (matchChars[matchIndex] == targetChars[targetIndex]) {
				matchIndex++;
				targetIndex++;
				mathlength++;
			} else {
				if (mathlength > 0) {
					movelength = mathlength - next[targetIndex];// 要移动的长度为
																// 已匹配长度减去
																// next(i)
					targetIndex = targetIndex - movelength;// 移动位置
				} else {
					matchIndex++;
					targetIndex = 0;
				}

			}
			if (targetIndex == targetChars.length) {
				matchIndexs.add(matchIndex - targetIndex);
				movelength = 0;
				mathlength = 0;
				targetIndex++;
				targetIndex = 0;
			} else if (targetIndex < 0)// 若移动已经超出目标，返回
				return matchIndexs;
		}
		return matchIndexs;
	}

测试，通过一个文件读取要搜索匹配的字符串：

public static void main(String[] args) {
		File f = new File("E:/test/test.txt");
		if (f.exists()) {
			FileReader reader = null;
			try {
				reader = new FileReader(f);

			} catch (FileNotFoundException e) {

				e.printStackTrace();
			}
			String targetstr = "的";
			StringBuffer matcherstr=new StringBuffer();
			
			char[] readchar = new char[30];
			try {
				while ( reader.read(readchar) >0) {
					matcherstr.append(readchar);

				}
			} catch (IOException e) {

				e.printStackTrace();
			}
			java.util.List<Integer> indexs = KmpMatch(matcherstr.toString(),
					targetstr);
			for (int i = 0, size = indexs.size(); i < size; i++)
				System.out.println("匹配成功的第："+i+1+"个，位置为：" + indexs.get(i)); // KMP匹配字符串
		}
	}

输出如下：

匹配成功的第：01个，位置为：8
匹配成功的第：11个，位置为：106
匹配成功的第：21个，位置为：158
匹配成功的第：31个，位置为：181
匹配成功的第：41个，位置为：218
匹配成功的第：51个，位置为：230
匹配成功的第：61个，位置为：277
匹配成功的第：71个，位置为：332
匹配成功的第：81个，位置为：380
匹配成功的第：91个，位置为：391

匹配正确，以上只是实现的思路，给出的代码算法也是不是最优的，提供一种思路而已，读取大量字符串时也需要考虑很多问题，暂时不再描述。