字符串匹配算法之"Boyer Moore"

最新推荐文章于 2021-02-23 01:56:01 发布

weixin_34366546

最新推荐文章于 2021-02-23 01:56:01 发布

阅读量215

点赞数

原文链接：https://my.oschina.net/amince/blog/180255

版权

为什么80%的码农都做不了架构师？>>>

Boyer-Moore字符串搜索算法是一种非常高效的字符串搜索算法。它由Bob Boyer和J Strother Moore设计于1977年，最初的定义1975年就给出了，后续才给出构造算法以及算法证明。

先假定部分定义：

1、pattern 为模式字符串，长度为patLen;

2、Text为目标查找字符串，长度为n;

2、当前不匹配字符在pattern中位置为 j（0≤ j ≤patLen -1）;

3、已经匹配的长度为 m（0≤ m ＜patLen）;

4、先假设不匹配字符在pattern中位置为 Δ(*),其中*可以是任何字符;

很多资料里面讲解原理时说的数组位置都是从1开始的，这里为了好理解code，都是从0开始;

首先来看下坏字符规则：

一、坏字符规则（bad character rule ）：让不匹配字符和pattern中最右边出现的该字符对齐匹配，如果没有则全部跳过；

>假设1：遇到不匹配字符，如果该字符在pattern 中不存在，有:（如下图示跳转）

字符指针右移：patLen 长度后和 pattern 右对齐;

Pattern 右移：patLen – m;

>假设2：遇到不匹配字符，如果该字符在pattern 中存在，这里也分两种情况:

a>.在pattern最右边出现的该字符在当前不匹配字符左边,有:（如下图示跳转）

字符指针右移：j–Δ(‘-’) +m = (j + m)–Δ(‘-’) = (patlen – 1) -Δ(‘-’) = (7-1)-2 = 4

Pattern 右移：字符指针偏移 - m = 4 – m = 2;

b>.在pattern中最右边出现的该字符在当前不匹配字符右边,有:（如下图示跳转）

字符指针右移: (patlen-1) – Δ(‘T’) = (7-1) – 6 = 0

Pattern右移：字符指针偏移 – m = 0 – 2 = -2

可以看出，pattern 竟然回退比较了，这是不应该出现的，这时候直接往后移动1位就行了：

总结上面三种情况,我们定义坏字符函数delta1() 为字符指针的偏移：

Delta1($) = patLen;(不匹配字符在pattern中不存在)

= patLen–1-Δ(*);(不匹配字符存在pattern中，且在pattern中最右边出现的位置在当前不匹配字符左边)

= 1;( 不匹配字符存在pattern中，且在pattern中最右边出现的该字符在当前不匹配字符右边)

二、好后缀规则（good suffix rule）：根据已经匹配的部分字串(subpat)，在pattern中寻找是否有和 subpat 全部或者部分匹配的字串，直接对齐匹配，避免无效的移动；

先约定几点：

1、假设 $ 为pattern中没有出现过的字符，有pat[i] = $ 当i < 0;

2、两个序列[C_{1 …}C_n] 和[d_1…d_n] 是一致的, 当且仅且c_j= d_j或者 c_{j =}$ 或者 d_j= $；其中(0≤j＜n)

3、最右边可能重新出现的subpat (p[j+1 ~ patLen-1])的位置为rpr(j)(rightmost plausible reoccurrence), 是使[pat[j + 1] ... pat[patlen]] 和 [pat[k] ... pat[k + patlen - j – 1] ]一致的最大K值，其中k≤0 或者pat[k – 1] != pat[j].

上图写出了pattern “ABXYCDEXY” 的rpr()值计算结果：我们来解析下

a>.当j = 8 时，已经匹配字串p[j+1 … patLen-1] 为空，参照rpr()定义，可知，pattern最右边可能和空串一致的，就是p[8 ~ PatLen-1], 可知rpr(8) = 8.

b>.当j = 7时，已经匹配字串subpat为”Y”, 可以看到p[3 ~ 3] = subpat , 此时k=3>0, 但是pat(k-1) == pat[j] = “X”不满足条件，再往右找，可以知道该 subpat 只可能存在 pattern 头部-1位置，即rpr(7) = -1.

c>.当j = 6 时，已经匹配字串subpat为”XY”, 可以看到p[2 ~ 3] = subpat, 同时满足p[k-1] != pat[j] ,可知rpr(6) = 2.

d>.当j = 5 时，已经匹配字串subpat为”EXY”, pattern中没有对应字串和subpat一致，只可能存在pattern头部，可知rpr(5) = -3;

其他情况依次类推，上面的几种情况应该包含了所有的rpr() 求法，从上面分析可以得出个规律：

rpr[patLen-1] = patLen-1.

可以得出 good suffix rule 的偏移值, 让pat[k] 和 pat[j+1] 对齐匹配：

Pattern 右移：j + 1 - rpr(j)

字符指针右移: m + j + 1 - rpr(j) = (patLen - 1 - j) + j + 1 – rpr(j) = patLen – rpr(j)

下面我们定义好后缀规则偏移算法：

delta2(j) = patLen - rpr(j); (0≤j<patLen)

*读者如果有看过别的BM算法资料，有地方 delta2(j) = patLen – 1 – rpr(j)，还是开头的这句话，我们这里数组索引从0开始，所以rpr(j) 的值也比索引从1开始的小1；

下面给出完整的实现代码:

#include <string.h>  // strlen()
#include <stdlib.h>  // __max()

#define ALPHABET_SIZE (1 << (sizeof(char)*8))

// Enable any/all to trace intermediate results
//#define TRACE_DELTA1
//#define TRACE_DELTA2
//#define TRACE_BM

#if defined TRACE_DELTA1 || defined TRACE_DELTA2 || defined TRACE_BM
#include <stdio.h>
#include <ctype.h>
#endif

void calc_delta1(const char *pat, int patlen, int delta1[]) 
{
	int j = 0;
	for (j = 0; j < ALPHABET_SIZE; j++)
		delta1[j] = patlen;

	for (j = 0; j < patlen; j++)
	{
		// By scanning pat from left to right, the final 
		// value in delta1[char] is the *rightmost* occurrence of
		// char in pat
		delta1[pat[j]] = patlen - 1 - j;
	}

#ifdef TRACE_DELTA1
	printf("Starting dump delta1[]>>>>>>>>>>>>>>>>>>>>>>>>>\n");
	for (j = 0; j < ALPHABET_SIZE; j++)
	{
		if (delta1[j] != patlen)
		{
			printf("       %c:%d\n", (char)j, delta1[j]);
		}
	}
	printf("  others:%d\n", patlen);
#endif
}

void calc_delta2(const char *pat, int patlen, int * delta2)
{
	int i = 0, j = 0, s = 0, m = 0, n = 0;
	// rpr[j] : where we can find rightmost plausible recurrence of pat[j+1 .. patlen-1]
	int *rpr = new int[patlen];

	// Mark each uninitialized rpr value with a large negative index
	const int def = -2*patlen;
	for (i = 0; i != patlen; i++)
	{
		rpr[i] = def;
	}

	// r: number of uninitialized entries in rpr[]
	int r = patlen;

	// Scan pattern from right-to-left until all rpr[] are initialized.
	// s: scan position.
	// Examine all substrings that end at pat[s] including null string pat[s .. s]
	for (s = patlen - 1; r > 0; s--)
	{
		// m: length of substring  pat[s-m .. s]
		for (m = 0; m <= patlen - 1 && r > 0; m++)
		{
			// Introduce j and k (as used in the BM paper)
			// j: index of leftmost character of suffix
			int j = patlen - m - 1;
			// k: index of leftmost character of (possible) recurrence.
			int k = s - m;

		#ifdef TRACE_DELTA2
			const int indent = patlen;
			printf("\ns:%d m:%d j:%d k:%d\n", s, m, j, k);
			printf("p  :%*s%s\n", indent, "", pat);
			printf("j  :%*s%*.*s\n", indent+j, "", m+1, m+1, &pat[j] );
			printf("k-1:%*s", indent+k-1, "");
			for (n = 0; n <= m; n++)
			{
				printf("%c", (k-1+n < 0 ? pat[j+n] : pat[k-1+n]) );
			}
			printf("\n");
		#endif

			// We have a match of pat[j+1 .. j+1+m] with pat[k .. k+m]
			// Compare pat[j] to pat[k-1].
			// Match: extend the substring to the left by increasing m
			// Mismatch: terminate the substring and check if plausible RPR

			bool mismatch = false;
			if (k > 0)
			{
				if (pat[j] == pat[k-1]) // extend substring
					continue;
				mismatch = true;
			}
			// else preceding char, pat[k-1] lies to the left of pat[0]
			// which terminates the substring

			// We have a match of m (possibly zero) characters.
			// pat[j+1 .. j+1+m] matches pat[k .. k+m] and
			// either pat[j] != pat[k-1] or k <= 0.
			// So rpr[j] = k (unless rpr[j] is already > k)
			if (rpr[j] < k)
			{
			#ifdef TRACE_DELTA2
				printf("2  :%*s %c %*.*s %*s s:%d m:%d j:%d k:%d r:%d\n",
					indent+j, "",
					toupper(pat[j]),
					m, m, &pat[j+1],
					(patlen-j-1-m), "",
					s, m, j, k, r);
			#endif
				rpr[j] = k;
				r--;
			}
		#ifdef TRACE_DELTA2
			else
			{
				printf("rpr[%d]=%d already inited\n", j, rpr[j]);
			}
		#endif

			// Once we have a mismatch (pat[j] != pat[k-1]) it is fruitless 
			//to examine further substrings ending at pat[s];
			//as Any subpat end with pat[s] will not be the rightmost plausible 
			//recurrence of the terminal substring pat[j+1 ~ patlen-1]
			if (mismatch)
			{
				break;
			}
		}
	}

	for (j = 0; j != patlen; j++) 
	{
		delta2[j] = patlen - rpr[j];
	}

#ifdef TRACE_DELTA2
	printf("R:"); // trace rpr[] values
	for (j = 0; j != patlen; j++)
	{
		printf(" %3d", rpr[j] );
	}

	printf("\n");
	printf("D:"); // trace delta2[] values

	for (j = 0; j != patlen; j++)
	{
		printf(" %3d", delta2[j] );
	}
	printf("\n");
#endif

	delete [] rpr;
}

/*
* Boyer-Moore search algorithm
*/
const char *boyermoore_search(const char * string, const char *pat) 
{
	int i = 0, j = 0, stringlen = 0;
	const char *result = NULL;

	int patlen = strlen(pat);
	int *delta1 = NULL;
	int *delta2 = NULL;

	if (patlen == 0)
		goto out;

	stringlen = strlen(string);
	if (patlen > stringlen)
		goto out;

	delta1 = new int[ALPHABET_SIZE];
	delta2 = new int[patlen];

#ifdef TRACE_BM
	printf("pattern: %s\n", pat);
#endif
	calc_delta1(pat, patlen, delta1);
	calc_delta2(pat, patlen, delta2);

#ifdef TRACE_BM
	printf("\nCalculating boyermoore_search>>>>>>>>>>>>>>>>>>>>>>>>>\n");
#endif

	// i: index of current string character
	for (i = patlen-1;;) 
	{
		if (i > stringlen) 
		{
			result = NULL;
			goto out;
		}

		// j: index of current pattern character
		j = patlen-1;
		for (;;)
		{
			if (j == 0)
			{
				result = &string[i];
				goto out;
			}

			if (string[i] == pat[j])
			{
			#ifdef TRACE_BM
				printf("p:%*s%*.*s%c%*.*s\n", \
					(i-j), "", \
					j, j, pat, \
					toupper(pat[j]), // mark matched char with upcase
					patlen-j-1, patlen-j-1, &pat[j+1]);
			#endif
				j--;
				i--;
				continue;
			}
			break;
		}

	#ifdef TRACE_BM
		printf("p:%*s%*.*s%c%*.*s\n",
			(i-j), "",
			j, j, pat,
			L'?', // mark mismatch char
			patlen-j-1, patlen-j-1, &pat[j+1]); // which-finally-halts.--at-that-point ...
			printf("c:%s\n", string);
	#endif
		// bc: "bad character" shift amount
		int bc = delta1[string[i]];

		// gs: "good suffix" shift amount
		int gs = delta2[j];

	#ifdef TRACE_BM
		printf("j:%d bc:%d gs:%d\n\n", j, bc, gs);
	#endif
		i += __max(bc, gs);
	}

/* not found */
out:
	delete [] delta1;
	delete [] delta2;
	return result;
}

void main(void)
{
	char src_str[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT";
	char pat_str[80] = "AT-THAT";
	const char* find_str = NULL;

	find_str = boyermoore_search((const char *)src_str, (const char *)pat_str);
	if(NULL != find_str)
	{
		printf("\n Success find string : %s\n", find_str);
	}
	else
	{
		printf("no find pattern string !\n");
	}
}

Boyer Moore 算法时间复杂度是亚线性O(patLen+n), pattern 越长BM算法效率越高；

参考：

1、A Fast String Searching Algorithm

2、http://en.wikipedia.org/wiki/User:RMcPhillip/sandbox/boyer-moore

转载于:https://my.oschina.net/amince/blog/180255