字符串匹配算法（三）

最新推荐文章于 2020-03-21 22:37:44 发布

bytxl

最新推荐文章于 2020-03-21 22:37:44 发布

阅读量839

点赞数

分类专栏：算法

本文链接：https://blog.csdn.net/bytxl/article/details/42000875

版权

算法专栏收录该内容

17 篇文章 2 订阅

订阅专栏

注：本文大致翻译自EXACT STRING MATCHING ALGORITHMS，去掉一些废话，增加一些解释。

本文的算法一律输出全部的匹配位置。模式串在代码中用x[m]来表示，文本用y[n]来，而所有字符串都构造自一个有限集的字母表Σ，其大小为σ。

四、可以滑动多远

记得在穷举法中，每一趟比较后，无论成与不成，都将模式向右滑动一个位置，然后继续比较。有没有办法能利用之前的比较结果，使得模式滑动的更远一点呢？

在介绍经典的KMP算法前，我先介绍几个简单的滑动类算法。

Not So Naive

特点：

preprocessing phase in constant time and space;
searching phase in O(nm) time complexity;
slightly (by coefficient) sub-linear in the average case.

同名字一样，这个算法的确有点幼稚，它根据模式的前两个字符是否相同来滑动比穷举法稍长一点的距离：

如果模式中前两个字符相同，那么文本中与模式第二个字符不同则必然也与第一个不同；如果模式中前两个字符不同，则与第二个相同的文本字符必然与第一个不同。

那么这两种情况下不用比较都可以断定，文本字符与模式的第一个字符肯定不相同，于是能比穷举法多滑动1个位置。

During the searching phase of the Not So Naive algorithm the character comparisons are made with the pattern positions in the following order1, 2, ... , m-2, m-1, 0.

代码见下：

 
    void NSN(char *x, int m, char *y, int n) {
    int j, k, ell;
   
    /* Preprocessing */
    if (x[0] == x[1]) {
       k = 2;
       ell = 1;
    }
    else {
       k = 1;
       ell = 2;
    }
   
    /* Searching */
    j = 0;
    while (j <= n - m)
       if (x[1] != y[j + 1])
          j += k;
       else {
          if (memcmp(x + 2, y + j + 2, m - 2) == 0 &
              x[0] == y[j])
             OUTPUT(j);
          j += ell;
       }
 }
 
 
 
  

代码解释：

k为模式第二个字符和文本第二个字符不相等时的步进，ell为模式第二个字符和文本第二个字符相等时的步进。

模式第二个字符和文本第二个字符不相等时，如果模式第一个和第二个字符相同，则步进为2（例如，当前文本位置为0，如果情况出现，则模式第一个跟文本第二个肯定不相同，即步进一次比较肯定不相同，步进为2），否则步进为1（例如，当前文本位置为0，如果情况出现，则模式第一个字符和文本第二个字符可能相同，需要步进一次进行比较，只步进1）；

模式第二个字符和文本第二个字符相等时，如果模式第一个和第二个字符相同，则步进为1（例如，当前文本位置为0，如果情况出现，则模式第一个跟文本第二个肯定相同，需要步进一次进行比较，步进为1），否则步进为2（例如，当前文本位置为0，如果情况出现，则模式第一个跟文本第二个肯定不相同，即步进一次比较肯定不相同，步进为2）；

步进的总结：“模式第二个字符和文本第二个字符”和“模式第一个和模式第二个字符”这两种，一个相同一个不同，则步进为2；两个都相同或者两个都不同，则步进为1。

这个算法仅需要常数时间和空间的预处理，比较过程中，先比较模式第二个字符，然后比较其余位置，为的就在某些情况下省掉第一个字符的比较，达到滑动的目的。不过复杂度依然是O(mn)的，比起穷举法或者有轻微改善吧。

想法的确够幼稚，仅仅只考虑了两个模式字符，滑动的步子也太小，能否考虑的更多一点呢？下面请看Quick Search算法。

Quick Search

特点：

simplification of the Boyer-Moore algorithm;
uses only the bad-character shift;
easy to implement;
preprocessing phase in O(m+) time and O() space complexity;
searching phase in O(mn) time complexity;
very fast in practice for short patterns and large alphabets.

见到这个名字，不禁让人想起快速排序了，快速排序在最坏情况下是n平方的复杂度，而通常情况下速度超级快，Quick Search莫非也是这样的？没错，就是这样，这个算法在 模式长度短而字母表大时，有着优异的表现，尽管它的搜索时间复杂度是O(mn)。

算法的思想是这样：如果文本中某个字符根本就没在模式中出现过，那么就不需要再去和模式中的任何一个比较；如果该字符出现过，那么为了不漏掉可能的匹配，只好与最晚出现过的位置对齐进行比较了。

代码如下：

 
    void preQsBc(char *x, int m, int qsBc[]) {
    int i;
 
    for (i = 0; i < ASIZE; ++i)
       qsBc[i] = m + 1;
    for (i = 0; i < m; ++i)
       qsBc[x[i]] = m - i;
 }
 
 
 void QS(char *x, int m, char *y, int n) {
    int j, qsBc[ASIZE];
 
    /* Preprocessing */
    preQsBc(x, m, qsBc);
  
    /* Searching */
    j = 0;
    while (j <= n - m) {
       if (memcmp(x, y + j, m) == 0)
          OUTPUT(j);
       j += qsBc[y[j + m]];               /* shift */
    }
 }
 
 
 
  

理解这个算法，请看22行，无论这一趟比较是否成功，都进行模式串的滑动，这个滑动就是根据窗口之外的第一个字符位于模式串的位置来决定的，你可以把窗口外第一个字符是否能匹配看成下一趟比较的前提。

现在你知道为何这个算法最适合在短模式和大字母表下运行了，因为字母表大，模式短，则文本字符不在模式中出现的几率就大，因此更大可能性得进行最长距离的滑动，而且模式短，花在比较上的时间就短，可以尽量多滑动。

示例：

qsBc table used by Quick Search algorithm

First attempt
G	C	A	T	C	G	C	A	G	A	G	A	G	T	A	T	A	C	A	G	T	A	C	G
1	2	3	4
G	C	A	G	A	G	A	G

Shift by: 1 (qsBc[G])

Second attempt
G	C	A	T	C	G	C	A	G	A	G	A	G	T	A	T	A	C	A	G	T	A	C	G
	1
	G	C	A	G	A	G	A	G

Shift by: 2 (qsBc[A])

Third attempt
G	C	A	T	C	G	C	A	G	A	G	A	G	T	A	T	A	C	A	G	T	A	C	G
			1
			G	C	A	G	A	G	A	G

Shift by: 2 (qsBc[A])

Fourth attempt
G	C	A	T	C	G	C	A	G	A	G	A	G	T	A	T	A	C	A	G	T	A	C	G
					1	2	3	4	5	6	7	8
					G	C	A	G	A	G	A	G

Shift by: 9 (qsBc[T])

Fifth attempt
G	C	A	T	C	G	C	A	G	A	G	A	G	T	A	T	A	C	A	G	T	A	C	G
														1
														G	C	A	G	A	G	A	G

Shift by: 7 (qsBc[C])

The Quick Search algorithm performs 15 character comparisons on the example.

示例解析：

preQsBc函数，第一个for循环，把qsBc中每一个数都初始化为m + 1 = 9。由于T在模式字符串中没有出现，所以最后qsBc[T]为9；

preQsBc函数，第二个for循环，算出模式字符串中每一个字符在qsBc中的值m - i，即总是用最晚出现的位置算出的值。

例：A最晚位置为6，那么qsBc[A] = 8 - 6 = 2。

QS函数中while循环，总是比较字符串是否相同，如果相同则输出。然后，文本字符串中当前比较位置步进qsBc[y[j + m]]。

关于步进的解释：

下一轮比较时，文本字符串中窗口之外的第一个字符与模式字符串中最晚出现这个字符总是对齐的。即上图中红色字符在下一轮中总是与模式符串中最晚出现这个字符对齐，下一轮中用蓝色表示。T不再模式字符串中，所以步进T的值后，T不与任何字符对齐。

美中不足的是这个算法最坏情况下复杂度还是O(mn)，尽管预处理中已经利用上了每一个模式字符了。通过滑动能找到一个线性算法吗？仔细审视一下比较过程，造成算法非线性的根本原因是什么？没错，是 文本串回溯。让我们来看看一个真正的线性算法——MP，以及它的改进——KMP。

MP/KMP

MP特点：

performs the comparisons from left to right;
preprocessing phase in O(m) space and time complexity;
searching phase in O(n+m) time complexity (independent from the alphabet size);
performs at most 2n-1 information gathered during the scan of the text;
delay bounded by m.

MP（Morris-Pratt algorithm）算法：

本着文本串不回溯的目标，MP算法横空出世，它的一个重要指导思想是，凡是比较过，被认定为相同的文本字符，绝不再拿出来比。道理上也是能说得通的，因为既然和模式串一部分相同，那么它的信息就已经存在于模式串中了。预处理时，模式串自己和自己的一部分进行比较，存储下自身的相似信息——Next数组。即模式的前缀与模式中的一部分比较，来判断下一次模式串比较的位置。以后在比较时，如果某处失配了，根据之前预处理的结果，可以直接滑动到自身相似的那一部分与文本串对齐，然后从失配处继续比较，避免了文本串回溯。

保持文本串不回溯，让模式串尽量地移动到有效的位置。

The design of the Morris-Pratt algorithm follows a tight analysis of the Brute Force algorithm, and especially on the way this latter wastes the information gathered during the scan of the text.

Let us look more closely at the brute force algorithm. It is possible to improve the length of the shifts and simultaneously remember some portions of the text that match the pattern. This saves comparisons between characters of the pattern and characters of the text and consequently increases the speed of the search.

Consider an attempt at a left position j on y, that is when the window is positioned on the text factor y[j .. j+m-1]. Assume that the first mismatch occurs between x[i] and y[i+j] with 0 < i < m. Then, x[0..i-1] = y[j .. i+j-1] = u and a = x[i] neq y[i+j]=b.

When shifting, it is reasonable to expect that a prefix v of the pattern matches some suffix of the portion u of the text. The longest such prefix v is calledthe border of u (it occurs at both ends of u).

This introduces the notation: let mpNext[i] be the length of the longest border of x[0 .. i-1] for 0 < i m. Then, after a shift, the comparisons can resume between characters c=x[mpNext[i]] and y[i+j]=b without missing any occurrence of x in y, and avoiding a backtrack on the text (see figure 6.1). The value of mpNext[0] is set to -1.

Figure 6.1: Shift in the Morris-Pratt algorithm (v border of u).

The table mpNext can be computed in O(m) space and time before the searching phase, applying the same searching algorithm to the pattern itself, as if x=y.

Then the searching phase can be done in O(m+n) time. The Morris-Pratt algorithm performs at most 2n-1 text character comparisons during the searching phase. Thedelay (maximal number of comparisons for a single text character) is bounded by m.

代码：

void preMp(char *x, int m, int Next[]) {
   int i, j;
   i = 0;
   j = Next[0] = -1;
   while (i < m) {
      while (j > -1 && x[i] != x[j])
         j = Next[j];
      i++;
      j++;
      // 下面注掉的三行去掉注释就成KMP了
      //if (x[i] == x[j])
      //   Next[i] = Next[j];
      //else
         Next[i] = j;
   }
}
void MP(char *x, int m, char *y, int n) {
   int i, j, Next[XSIZE];
   /* Preprocessing */
   preMp(x, m, Next);
   /* Searching */
   i = j = 0;
   while (j < n) {
      while (i > -1 && x[i] != y[j])
         i = Next[i];
      i++;
      j++;
      if (i >= m) {
         OUTPUT(j - i);
         i = Next[i];
      }
   }
}

上述原始程序中，

MP函数是模式串和文本串比较，j是文本串y中的当前字符位置。
preMp函数是模式串和模式串自己比较，i是文本串x（相当于文本串）中的当前字符位置。

不便于记忆与理解。

下面的程序修改为preMp和MP中，j都是文本串中的当前字符位置，i都是模式串当前字符位置：

#define XSIZE 256

// preMp是模式字符串自己和自己比较，那么有一个相当于文本字符串
// MP是模式字符串和文本字符串比较
// preMp 和MP中，j都是文本字符串的当前位置
void preMp(char *x, int m, int Next[]) {  
   int i, j;  
   j = 0;  
   i = Next[0] = -1;  
   while (j < m) {  
      while (i > -1 && x[i] != x[j])  
         i = Next[i];
      i++;  
      j++;  
      // 下面注掉的三行去掉注释就成KMP了  
      //if (x[i] == x[j])  
      //   Next[j] = Next[i];  
      //else  
         Next[j] = i;
   }  
}

void MP(char *x, int m, char *y, int n) {  
   int i, j, Next[XSIZE];  
   /* Preprocessing */  
   preMp(x, m, Next);  
   /* Searching */  
   i = j = 0;  
   while (j < n) {  
      while (i > -1 && x[i] != y[j])  
         i = Next[i];  
      i++;  
      j++;  
      if (i >= m) {  
         OUTPUT(j - i); 
         //printf( "%d\n", j - i );
         i = Next[i];  
      }  
   }  
}

示例：

The mpNext table.即preMp函数中的Next表。

First attempt
G	C	A	T	C	G	C	A	G	A	G	A	G	T	A	T	A	C	A	G	T	A	C	G
1	2	3	4
G	C	A	G	A	G	A	G

Shift by: 3 (i-mpNext[i]=3-0)

Second attempt
G	C	A	T	C	G	C	A	G	A	G	A	G	T	A	T	A	C	A	G	T	A	C	G
			1
			G	C	A	G	A	G	A	G