KMP--模式匹配算法

最新推荐文章于 2022-09-22 16:57:21 发布

Android路上的人

最新推荐文章于 2022-09-22 16:57:21 发布

阅读量3.2k

点赞数

分类专栏： JDK源码文章标签： kmp jdk 算法

本文链接：https://blog.csdn.net/Androidlushangderen/article/details/39828027

版权

JDK源码专栏收录该内容

9 篇文章 0 订阅

订阅专栏

今天我们来聊聊模式匹配算法，什么是模式匹配算法呢，其实就是子字符串匹配上算法。比如字符串需匹配字符串为"abc"，则在"abcdse"中出现的第一个位置就是0号位置了，这就算是匹配成功了。在讲kmp算法之前，我们想传统的给你2个字符串，做比较的话，肯定是一个一个的比较，暴力的解决这个问题，我事先也写了一个这样的例子.

	/**
	 * 普通的模式匹配算法
	 * 
	 * @param s
	 *            主串
	 * @param t
	 *            匹配串
	 */
	private static int strIndex(String s, String t) {
		int start = 0;
		int end = s.length() - t.length() + 1;
		int k = 0;
		int index = -1;

		for (int i = start; i <= end; i++) {
			// 当前主串的匹配位置
			k = i;
			// 找准开始匹配的起时，再依次匹配
			for (int j = 0; j < t.length(); j++) {
				if (s.charAt(k) == t.charAt(j)) {
					k++;
				} else {
					break;
				}
			}

			// 如果匹配到t个长度后
			if (k == i + t.length()) {
				index = i;
				break;
			}
		}

		return index;
	}

功能虽然说可以实现了，但是效率自不必说，时间复杂度为O(n*n)级别的，如果碰上超长字符串，类似文章型的检索，都不知道得等到什么时候了。我们总是站在巨人的肩膀上思考问题，这些问题，前辈们早就思考到了，有人就提出了一种KMP的模式匹配算法，首先介绍一下KMP的由来。
KMP算法之所以叫做KMP算法是因为这个算法是由三个人共同提出来的，（由D.E.Knuth与V.R.Pratt和J.H.Morris同时发现，因此人们称它为克努特——莫里斯——普拉特操作（简称KMP算法））就取三个人名字的首字母作为该算法的名字。其实KMP算法与暴力算法的区别就在于KMP算法巧妙的消除了指针i的回溯问题，只需确定下次匹配j的位置即可，使得问题的复杂度由O(n*n)下降到O(m+n)。
我们先来看看原始暴力匹配的过程是怎么样的：

但是kmp算法根普通匹配算法的最大不同点之处在于，他略过了之前匹配中的相同部分，直接从下一个匹配不同的地方开始，利用已得到的“匹配部分”，向右滑动尽可能远的一段距离。避免了逐一滑动。但是他在里面又定义了种next[]数组的概念，就是next[j] = k，意味着表明模式串中的第j+1个字符串失配时候，在模式串中需重新和目标串中字符si进行比较的位置，不一定失配时j都得从0开始，如果模式串中前k个字符等于模式串中后k个字符，我们就直接从模式串中的k下标开始匹配，因为之前的k个已经是匹配正确的情况下的。在KMP算法中，为了确定在匹配不成功时，下次匹配时j的位置，next[j]的值表示s[0...j-1]中最长后缀的长度等于相同字符序列的前缀。意思就是说如果next[j]>=0，则目标串的指针i不变，将模式串的指针j移动到next[j]的位置继续进行匹配,这是为了避免少匹配的情况的发生，因为头尾部部分匹配，也可能出现全部匹配的情况，</span>如果k=0，直接重新j=0开始匹配，匹配的位置则刚刚好是i下标失配的位置。所以后面的任务就是求next数组的活了。
那么如何去求next数组呢：根据定义next[0]=-1，假设next[j]=k, 即T[0...k-1]==T[j-k,j-1]若T[j]==T[k]，则有T[0..k]==T[j-k,j]，很显然，next[j+1]=next[j]+1=k+1; 2)若T[j]!=T[k]，k值如何移动，显然k=next[k]，这个是我最难理解的一点，我的意思是这相当于把k的值回溯到上一个匹配的值的时候。比如说原本我有3个字符首尾相同，后来多了一个字符串比较不通过时，把变为上次通过的值，这个值可能为2，拿前2个字符和后2个比较，如果不行在回溯一次值，可能最后k就变成0了，说明新比较的值一添加，就不存在相同的部分了，直接j又得从0开始了。代码如下：

	/**
	 * 计算next[]数组的值
	 * 
	 * @param t
	 *            匹配串
	 * @return
	 */
	private static int[] getNext(String t) {
		int[] next = new int[t.length()];
		next[0] = -1;
		int j = 0;
		int k = -1;

		while (j < t.length() - 1) {
			if (k == -1 || t.charAt(j) == t.charAt(k)) {
				j++;
				k++;

				next[j] = k;
			} else {
				k = next[k];
			}
		}

		for (int i : next) {
			System.out.print(i + ": ");
		}
		System.out.print("\n");

		return next;
	}

所以按照此方法，abcaa,的值的next[]数组的值为-1,0,0,0,1，当第5个字符a不匹配时候，因为第一个a和第4个a相同,所以nextde值为1，j直接从b比较第一a移动到了第四个a的位置上了。相应的kmp算法最终为：

	/**
	 * kmp模式匹配算法
	 * 
	 * @param s
	 *            主串
	 * @param t
	 *            匹配串
	 * @param next
	 *            next[]数组
	 */
	private static int kmpStrIndex(String s, String t, int[] next) {
		int i = 0;
		int j = 0;

		while (i < s.length() && j < t.length()) {
			if (j == -1 || s.charAt(i) == t.charAt(j)) {
				i++;
				j++;
			} else {
				// i不变，j后退
				j = next[j];
			}

			if (j == t.length()) {
				return i - j;
			}
		}

		return -1;
	}

kmp的思想还是非常难理解的，如果第一次看的话，至少给我感觉是这样的，要反复琢磨，还要在纸上画画写写吧。其实想到这里，我又萌生了这样的一个想法，jdk中不是也有一个字符串匹配的方法吗，不错，就是String.contains(),不过返回的好像是布尔类型，他上面到底用的是什么方法呢，难道也是kmp的算法思想？

      /**
      当且仅当此字符串包含指定的 char 值序列时，返回 true。 
 
      参数：
      s - 要搜索的序列 
      返回：
      如果此字符串包含 s，则返回 true，否则返回 false 
      抛出： 
      NullPointerException - 如果 s 为 null
      从以下版本开始： 
      1.5 
 
     */
    public boolean contains(CharSequence s) {
        return indexOf(s.toString()) > -1;
    }

public int indexOf(String str, int fromIndex) {
        return indexOf(value, offset, count, str.value, str.offset, str.count, fromIndex);
    }

这个index里传入了一堆的参数值，下面应该就是揭晓谜底的时候了，

/**
     * Code shared by String and StringBuffer to do searches. The source is the character array being searched, and the
     * target is the string being searched for.
     * 
     * @param source
     *          the characters being searched.
     * @param sourceOffset
     *          offset of the source string.
     * @param sourceCount
     *          count of the source string.
     * @param target
     *          the characters being searched for.
     * @param targetOffset
     *          offset of the target string.
     * @param targetCount
     *          count of the target string.
     * @param fromIndex
     *          the index to begin searching from.
     */
    static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset,
            int targetCount, int fromIndex) {
    	//做早期的参数验证和判断，这里的source其实就是主串
        if (fromIndex >= sourceCount) {
            return (targetCount == 0 ? sourceCount : -1);
        }
        if (fromIndex < 0) {
            fromIndex = 0;
        }
        if (targetCount == 0) {
            return fromIndex;
        }
 
        //先找出第一个字符，和计算最大的偏移下标sourceCount - targetCount，
        //从这里基本可以看出计算Max的值就是要进行暴力比较了，
        char first = target[targetOffset];
        int max = sourceOffset + (sourceCount - targetCount);
 
        for (int i = sourceOffset + fromIndex; i <= max; i++) {
            /* Look for first character. */
        	//先找出第一个匹配的地方，避免后面多余的操作
            if (source[i] != first) {
                while (++i <= max && source[i] != first);
            }
 
            /* Found first character, now look at the rest of v2 */
            if (i <= max) {
                int j = i + 1;
                int end = j + targetCount - 1;
                //找到之后，进行剩余的比较，又是通过for循环的，根本看不到kmp的影子
                for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++);
 
                if (j == end) {
                    /* Found whole string. */
                    return i - sourceOffset;
                }
            }
        }
        return -1;
    }

结果比较让人失望，jdk里用的也是普通的方法，不知道未来sun公司的人会不会改进这个算法，可能编写者当时的目的就是简单字符串的比比而已，还没有考虑那么多因素吧。KMP算法分析到此为止，希望大家有所收获，好了，最后贴出今天我做测试的例子，就是一个测试类：

package Kmp;

/**
 * 模式匹配算法
 * 
 * @author lyq
 * 
 */
public class Client {
	public static void main(String[] args) {
		// 主串
		String s = "ababcaabcacbab";
		// 匹配串
		String t = "abcaa";
		// 第一个匹配的位置
		int position = strIndex(s, t);

		System.out.println(position);

		int[] next = getNext(t);
		position = kmpStrIndex(s, t, next);

		System.out.println("kmp:" + position);
	}

	/**
	 * 普通的模式匹配算法
	 * 
	 * @param s
	 *            主串
	 * @param t
	 *            匹配串
	 */
	private static int strIndex(String s, String t) {
		int start = 0;
		int end = s.length() - t.length() + 1;
		int k = 0;
		int index = -1;

		for (int i = start; i <= end; i++) {
			// 当前主串的匹配位置
			k = i;
			// 找准开始匹配的起时，再依次匹配
			for (int j = 0; j < t.length(); j++) {
				if (s.charAt(k) == t.charAt(j)) {
					k++;
				} else {
					break;
				}
			}

			// 如果匹配到t个长度后
			if (k == i + t.length()) {
				index = i;
				break;
			}
		}

		return index;
	}

	/**
	 * 计算next[]数组的值
	 * 
	 * @param t
	 *            匹配串
	 * @return
	 */
	private static int[] getNext(String t) {
		int[] next = new int[t.length()];
		next[0] = -1;
		int j = 0;
		int k = -1;

		while (j < t.length() - 1) {
			if (k == -1 || t.charAt(j) == t.charAt(k)) {
				j++;
				k++;

				next[j] = k;
			} else {
				k = next[k];
			}
		}

		for (int i : next) {
			System.out.print(i + ": ");
		}
		System.out.print("\n");

		return next;
	}

	/**
	 * kmp模式匹配算法
	 * 
	 * @param s
	 *            主串
	 * @param t
	 *            匹配串
	 * @param next
	 *            next[]数组
	 */
	private static int kmpStrIndex(String s, String t, int[] next) {
		int i = 0;
		int j = 0;

		while (i < s.length() && j < t.length()) {
			if (j == -1 || s.charAt(i) == t.charAt(j)) {
				i++;
				j++;
			} else {
				// i不变，j后退
				j = next[j];
			}

			if (j == t.length()) {
				return i - j;
			}
		}

		return -1;
	}
	
	 /**
     * Code shared by String and StringBuffer to do searches. The source is the character array being searched, and the
     * target is the string being searched for.
     * 
     * @param source
     *          the characters being searched.
     * @param sourceOffset
     *          offset of the source string.
     * @param sourceCount
     *          count of the source string.
     * @param target
     *          the characters being searched for.
     * @param targetOffset
     *          offset of the target string.
     * @param targetCount
     *          count of the target string.
     * @param fromIndex
     *          the index to begin searching from.
     */
    static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset,
            int targetCount, int fromIndex) {
    	//做早期的参数验证和判断，这里的source其实就是主串
        if (fromIndex >= sourceCount) {
            return (targetCount == 0 ? sourceCount : -1);
        }
        if (fromIndex < 0) {
            fromIndex = 0;
        }
        if (targetCount == 0) {
            return fromIndex;
        }
 
        //先找出第一个字符，和计算最大的偏移下标sourceCount - targetCount，
        //从这里基本可以看出计算Max的值就是要进行暴力比较了，
        char first = target[targetOffset];
        int max = sourceOffset + (sourceCount - targetCount);
 
        for (int i = sourceOffset + fromIndex; i <= max; i++) {
            /* Look for first character. */
        	//先找出第一个匹配的地方，避免后面多余的操作
            if (source[i] != first) {
                while (++i <= max && source[i] != first);
            }
 
            /* Found first character, now look at the rest of v2 */
            if (i <= max) {
                int j = i + 1;
                int end = j + targetCount - 1;
                //找到之后，进行剩余的比较，又是通过for循环的，根本看不到kmp的影子
                for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++);
 
                if (j == end) {
                    /* Found whole string. */
                    return i - sourceOffset;
                }
            }
        }
        return -1;
    }
}