字符串匹配算法之Boyer-Moore-Horspool Algorithm

Boyer-Moore-Horspool 算法也称Horspool 算法,由Nigel Horspool设计于1980年,是在BM算法上改进版,因为BM算法里面的 好后缀规则较难理解,同时其效率与正确性的证明当时一直没有得到解决,所以Horspool 算法只用了一个BM里的坏字符规则.


借用“find a needle in a haystack” 典故,意为"大海捞针",引意到我们这里就是 从haystack 字串中查找needle字串(needle 字串等同pattern字串),同时假定haystack字串长度n,needle字串长度为m;


基本原理:

Horspool算法 也是从右向左比较但Horspool算法相对于Boyer-Moore算法改进了坏字符规则;从右向左匹配,当遇到 不匹配字符(mismatch character) 时:

BM 跳转规则: 当前不匹配字符和needle中最右边出现的该字符对齐匹配;

Horspool 跳转规则:haystack 字串中与needle字串尾部字符对应的字符和needle中最右边出现的该字符匹配;


坏字符规则跳转表初始化和BM中一样,理解了原理,code理解起来就容易了;

下面是实现代码:

#include <stdio.h>
#include <string.h>		//
#include <limits.h>		//UCHAR_MAX
 
/* Returns a pointer to the first occurrence of "needle"
 * within "haystack", or NULL if not found. Works like
 * memmem() OR strstr().
 */
 
/* Note: In this example needle is a C string. The ending
 * 0x00 will be cut off, so you could call this example with
 * boyermoore_horspool_memmem(haystack, hlen, "abc", sizeof("abc"))
 */
const unsigned char *
boyermoore_horspool_memmem(const unsigned char* haystack, size_t hlen,
                           const unsigned char* needle,   size_t nlen)
{
    size_t scan = 0;
    size_t bad_char_skip[UCHAR_MAX + 1]; /* Officially called:
                                          * bad character shift */
 
    /* Sanity checks on the parameters */
    if (nlen <= 0 || !haystack || !needle)
        return NULL;
 
    /* ---- Preprocess ---- */
    /* Initialize the table to default value */
    /* When a character is encountered that does not occur
     * in the needle, we can safely skip ahead for the whole
     * length of the needle.
     */
    for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1)
        bad_char_skip[scan] = nlen;
 
    /* C arrays have the first byte at [0], therefore:
     * [nlen - 1] is the last byte of the array. */
    size_t last = nlen - 1;
 
    /* Then populate it with the analysis of the needle */
    for (scan = 0; scan < last; scan = scan + 1)
        bad_char_skip[needle[scan]] = last - scan;
 
    /* ---- Do the matching ---- */
 
    /* Search the haystack, while the needle can still be within it. */
    while (hlen >= nlen)
    {
        /* scan from the end of the needle */
        for (scan = last; haystack[scan] == needle[scan]; scan = scan - 1)
		{
            if (scan == 0) /* If the first byte matches, we've found it. */
                return haystack;
		}
 
        /* otherwise, we need to skip some bytes and start again.
           Note that here we are getting the skip value based on the last byte
           of needle, no matter where we didn't match. So if needle is: "abcd"
           then we are skipping based on 'd' and that value will be 4, and
           for "abcdd" we again skip on 'd' but the value will be only 1.
           The alternative of pretending that the mismatched character was
           the last character is slower in the normal case (E.g. finding
           "abcd" in "...azcd..." gives 4 by using 'd' but only
           4-2==2 using 'z'. */
        hlen     -= bad_char_skip[haystack[last]];
        haystack += bad_char_skip[haystack[last]];		//与BM中的坏字符区别主要在这
    }
 
    return NULL;
}

void main(void)
{
	char haystack[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT";
	char needle[80] = "AT-THAT";
	const unsigned char* find_str = NULL;

	find_str = boyermoore_horspool_memmem((const unsigned char *)haystack, strlen(haystack), (const unsigned char *)needle, strlen(needle));
	if(NULL != find_str)
	{
		printf("Success find string : %s\n", find_str);
	}
	else
	{
		printf("no find pattern string !\n");
	}
}





转载于:https://my.oschina.net/amince/blog/180497

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值