21. 串（sequence）

最新推荐文章于 2024-03-30 12:15:47 发布

含低调

最新推荐文章于 2024-03-30 12:15:47 发布

阅读量263

点赞数

分类专栏：数据结构与算法

本文链接：https://blog.csdn.net/hanzong110/article/details/106349801

版权

数据结构与算法专栏收录该内容

21 篇文章 1 订阅

订阅专栏

1 串

本课程研究的串是开发中非常熟悉的字符串，是由若干个字符组成的有限序列
字符串 thank 的前缀（prefix）、真前缀（proper prefix）、后缀（suffix）、真后缀（proper suffix）
1. 真前缀就是不包括自身的所有前缀

2 串匹配算法

查找一个模式串（pattern）在文本串（text）中的位置，本文种用 tlen 代表文本串 text 的长度，plen 代表模式串 pattern 的长度
几个经典的串匹配算法
1. 蛮力（Brute Force）
2. KMP
3. Boyer-Moore
4. Karp-Rabin
5. Sunday

2.1 蛮力（Brute Force）

以字符为单位，从左到右移动模式串，直到匹配成功
蛮力1 – 执行过程
1. 逐个比对p和ti元素，发现有不相同情况，将pi归0，ti = ti-pi+1，相当于让p字符串头重新与ti的下一个元素开始匹配
2. 如果发现相同，ti++，pi++，直到pi越界或ti越界
3. 如果循环结束，是由于pi越界导致，那么说明找到了匹配的字符串，返回ti-pi，就能得到pattern在text中的位置，如果是由于ti越界导致，说明直到整个text遍历完，都没能找到对应的pattern
代码

public static int indexOf(String text, String pattern) {
	if (text == null || pattern == null)
		return -1;
	char[] textChars = text.toCharArray();
	int tlen = textChars.length;
	char[] patternChars = pattern.toCharArray();
	int plen = patternChars.length;
	if (tlen == 0 || plen == 0 || plen > tlen)
		return -1;
	int pi = 0, ti = 0;
	while (pi < plen && ti < tlen) {
		if (textChars[ti] == patternChars[pi]) {
			ti++;
			pi++;
		} else {
			//注意此时不是ti++，此处代码应该是让ti的初始位置+1，而不是ti当前位置+1
			ti = ti-pi+1;
			pi = 0;
		}
	}
	return (pi == plen) ? (ti - pi) : -1;
}

2.2 优化1

此前实现的蛮力算法，在恰当的时候可以提前退出，减少比较次数
代码

public static int indexOf(String text, String pattern) {
	if (text == null || pattern == null)
		return -1;
	char[] textChars = text.toCharArray();
	int tlen = textChars.length;
	char[] patternChars = pattern.toCharArray();
	int plen = patternChars.length;
	if (tlen == 0 || plen == 0 || plen > tlen)
		return -1;
	int pi = 0, ti = 0;
	//只修改了这一处，不能写成ti<=tlen-plen，因为ti是变化的，只有当pi=0时，ti才<=tlen-plen
	//也就是说退出条件可以改为pi=0 && ti<=tlen-plen，或者直接ti-pi<=tlen-plen，ti-pi表示本次比较，最开始ti的位置，该位置不能超过tlen-plen
	while (pi < plen && ti - pi <= tlen - plen) {
		if (textChars[ti] == patternChars[pi]) {
			ti++;
			pi++;
		} else {
			ti = ti - pi + 1;
			pi = 0;
		}
	}
	return (pi == plen) ? (ti - pi) : -1;
}

2.3 蛮力的第二种实现

ti一直不变，用ti+pi表示text中用于比较的元素，pi表示pattern中用于比较的元素
代码

public static int indexOf(String text, String pattern) {
	if (text == null || pattern == null)
		return -1;
	char[] textChars = text.toCharArray();
	int tlen = textChars.length;
	char[] patternChars = pattern.toCharArray();
	int plen = patternChars.length;
	if (tlen == 0 || plen == 0 || plen > tlen)
		return -1;

	int tiMax = tlen - plen;
	for (int ti = 0; ti <= tiMax; ti++) {
		int pi = 0;
		for (; pi < plen; pi++) {
			if (textChars[ti + pi] != patternChars[pi])
				break;
		}
		if (pi == plen)
			return ti;
	}
	return -1;
}

2.4 蛮力性能分析

最好情况
1. 只需一轮比较就完全匹配成功，比较 m 次（ m 是模式串的长度）
2. 时间复杂度为 O(m)
最坏情况（字符集越大，出现概率越低）
1. 执行了 n – m + 1 轮比较（ n 是文本串的长度）
2. 每轮都比较至模式串的末字符后失败（ m – 1 次成功，1 次失败）
3. 时间复杂度为 O(m ∗ (n − m + 1))，由于一般 m 远小于 n，所以为 O(mn)

3 KMP算法

蛮力 vs KMP：KMP充分利用了此前比较过的内容，可以很聪明地跳过一些不必要的比较位置
KMP – next表的使用：KMP 会预先根据模式串的内容生成一张 next 表（一般是个数组）
KMP – 核心原理
1. 当e失配时，由于e前面所有内容和d前面所有内容完全一样
2. 此时如果在pattern中存在完全相等的A和B，那么，那么pattern中的A一定和text中的B完全一致，因此可以直接尝试用pattern中A之后的c与text中B之后的d进行比较
3. 因此在next表中，next[e的索引]的值，就应该是c的索引，而next[e] 是e左边子串的真前缀后缀的最大公共子串长度
真前缀后缀的最大公共子串长度
1. 图中得到的是以模式串指定字符，结尾，的真前缀后缀的最大公共子串长度
2. 而我们要的是，以模式串指定字符，之前，的真前缀后缀的最大公共子串长度
3. 所以将最大公共子串长度都向后移动 1 位，首字符设置为负1，就得到了 next 表
使用-1的原因
1. 因为我们需要一个值，表示此时不应该再挪动pattern串上的指针，而是挪动str上的指针
2. 这个数完全可以设置为一个小于pattern最小索引0的一个数，或一个大于pattern最大索引pattern.length-1的一个数
3. 但为了便于程序书写，因为当匹配时，我们通常i++，j++，因此为了让i++后，直接能将新的j与0号索引进行匹配，所以将next[0]设置为-1
为什么是“最大“公共子串长度：因为如果不是最大公共子串长度，会导致将pattern向右移动距离过大，从而导致错过成功匹配机会
假设文本串是AAAAABCDEF，模式串是AAAAB
将 3 赋值给 pi：向右移动了 1 个字符单位，最后成功匹配
将 1 赋值给 pi：向右移动了 3 个字符单位，错过了成功匹配的机会
next表的构造
使用动态规划
1. 状态：dp（i）表示以pattern[i-1]结尾的字符串的真前缀、和真后缀的最大公共子串长度
  2. 初始状态：dp(0)=-1，单纯为了方便计算，dp(1)=0
  3. 状态转移方程：
2. 如果pattern[i-1]==pattern[dp[i-1]]，dp[i] = dp[i-1]+1
3. 如果pattern[i-1]!=pattern[dp[i-1]]
4. 如果pattern[i-1]==pattern[dp[dp[i-1]]]，dp[i] = dp[dp[i-1]]+1
5. 如果不等，继续循环直到跳到pattern[0]
图片
next表代码

//老师实现的方法
public static int indexOf(String text, String pattern) {
  if (text == null || pattern == null)
    return -1;
  char[] textChars = text.toCharArray();
  int tlen = textChars.length;
  char[] patternChars = pattern.toCharArray();
  int plen = patternChars.length;
  if (tlen == 0 || plen == 0 || plen > tlen)
    return -1;

  int[] next = next(pattern);
  int pi = 0, ti = 0, lendDelta = tlen - plen;
  while (pi < plen && ti - pi <= lendDelta) {
    if (pi < 0 || textChars[ti] == patternChars[pi]) {
      ti++;
      pi++;
    } else {
      pi = next[pi];
    }
  }
  return (pi == plen) ? (ti - pi) : -1;
}

private static int[] next(String pattern) {
  char[] chars = pattern.toCharArray();
  int[] next = new int[chars.length];
  next[0] = -1;
  int i = 0;
  //ABCDABCE：想获取E的next值时，i记录的是当前的E的位置，n实际上记录的是，D的位置
  //如果n的值<0，说明了next[E的索引]的值，应该设置为0，因为相当于E和之前哪个它的公共子串的下一位都不同，没法组成更大的公共子串
  int n = -1;
  int iMax = chars.length - 1;
  while (i < iMax) {
    if (n < 0 || chars[i] == chars[n]) {
      next[++i] = ++n;
    } else {
      //如果n索引对应的值和i索引对应的值E不等，那就应该在n之前，继续找，是否有公共子串的末尾和E相同的
      n = next[n];
    }
  }
  return next;
}

//根据KMP思想自己实现的算法
public int kmp(String t, String pattern) {
  int next[] = nextArray(pattern);
  int j = 0;
  int i = 0;
  //这种求长度的，最好都记录下来，且应该转换为数组，防止多次charAt操作
  char[] text = t.toCharArray();
  char[] pat = pattern.toCharArray();
  int tlength = text.length;
  int plength = pat.length;
  while (i < tlength && j < plength) {
    if (j == -1 || text[i] == pat[j]) {
      i++;
      j++;
    } else {
      j = next[j];
    }
  }
  return j == plength ? i - j : -1;
}

public static int[] nextArray(String pattern) {
  int[] next = new int[pattern.length()];
  next[0] = -1;
  next[1] = 0;
  char[] pat = pattern.toCharArray();
  int length = pat.length;
  for (int i = 2; i < length; i++) {
    int j = i - 1;
    while (next[j] != -1) {
      if (pat[i - 1] == pat[next[j]]) {
        next[i] = next[j] + 1;
        break;
      } else {
        j = next[j];
      }
    }
  }
  return next;
}

next表的优化

private static int[] next(String pattern) {
	char[] chars = pattern.toCharArray();
	int[] next = new int[chars.length];
	next[0] = -1;
	int i = 0;int n = -1;
	int iMax = chars.length - 1;
	while (i < iMax) {
		if (n < 0 || chars[i] == chars[n]) {
			i++;
			n++;
			if(chars[i] == chars[n]) {
				//其实就是让n = next[n]，多跳了一步
				next[i] = next[n];
			}else {
				next[i] = n;
			}
		} else {
			n = next[n];
		}
	}
	return next;
}

next优化后的效果
KMP性能分析
1. 主逻辑
  1. 最好时间复杂度：O(m)
  2. 最坏时间复杂度：O(n)，不超过O(2n)，2n是因为红色部分不会超过n，绿色部分应该也不超过n
2. next 表的构造过程跟 KMP 主体逻辑类似
  1. 时间复杂度：O(m)
3. kmp整体
  1. 最好时间复杂度：O(m)
  2. 最坏时间复杂度：O(m+n)
  3. 空间复杂度：O(m)
蛮力与KMP差别
1. 当字符失配时
  1. 蛮力算法： ti 回溯到左边位置，pi 回溯到 0
  2. KMP 算法：ti 不必回溯， pi 回溯到 next[pi]

含低调

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
21. 串（sequence）

1 串本课程研究的串是开发中非常熟悉的字符串，是由若干个字符组成的有限序列字符串 thank 的前缀（prefix）、真前缀（proper prefix）、后缀（suffix）、真后缀（proper suffix）真前缀就是不包括自身的所有前缀2 串匹配算法查找一个模式串（pattern）在文本串（text）中的位置，本文种用 tlen 代表文本串 text 的长度，plen 代表模式串 pattern 的长度几个经典的串匹配算法蛮力（Brute Force）KMPBoy
复制链接

扫一扫

专栏目录