深入理解BM算法

最新推荐文章于 2023-09-24 23:57:22 发布

坏小哥

最新推荐文章于 2023-09-24 23:57:22 发布

阅读量2.6k

点赞数 1

分类专栏：算法文章标签： php

本文链接：https://blog.csdn.net/weixin_43885417/article/details/112589547

版权

算法专栏收录该内容

10 篇文章 1 订阅

订阅专栏

我之前和大家分享过KMP算法，下面和大家分享一个更高效的算法——BM算法。

一、简介：
Boyer-Moore字符串搜索算法是一种非常高效的字符串搜索算法。它由Bob Boyer和J Strother Moore设计于1977年。此算法仅对搜索目标字符串（关键字）进行预处理，而非被搜索的字符串。
接下来我们以Moore教授自己的例子来解释这种算法：

主串：HERE IS A SIMPLE EXAMPLE
模式串：EXAMPLE

二、基本原理

主串定义：mainStr
模式串定义：patternStr
主串移动下标定义为：matchMainIndex
模式串移动下标定义为：matchPatternIndex

1、坏字符：

在这里插入图片描述

刚开始matchMainIndex = 0，matchPatternIndex=0。
从尾部开始比较，如果尾部字符不匹配，那么只要一次比较，就可以知道前7个字符，肯定不是要找的结果。"S"与"E"不匹配。这时，“S"就被称为"坏字符”（bad character）。并且"S"不包含在搜索词"EXAMPLE"之中，想一下，不管我们怎么往后移动（小于模式串长度）主串，失配的S都不可能和模式串任何一个字符相等，因此这个时候，我们可以直接移动主串长度为matchMainIndex+len(patternStr)
得到下面匹配图：
在这里插入图片描述
依然从尾部开始比较，发现"P"与"E"不匹配，所以P是坏字符。但是，P包含在搜索词EXAMPLE之中。所以，将搜索词后移两位，主串中的P字符和模式串P字符对齐。
根据上面例子我们可以大致总结出:
bmBc数组存放模式串字符距离末尾的最小长度。
在这里插入图片描述
坏字符移动的距离=bmBc[i] - (m - 1 - i)
bmBc为坏字符i在模式串中出现的位置距离模式串末尾的最小长度
如果"坏字符"不包含在搜索词之中，则移动i + 1个长度，其中i为模式串失配下标
那么有同学会说假如：坏字符（主串的失配字符）出现在当前匹配模式串字符的右边。这个时候相当于是主串倒退了，bmBc[i] - (m - 1 - i)是负数，因此才会出来好后缀规则。

2、好后缀规则

我们接着上面的匹配，得到下面匹配图：
在这里插入图片描述
此时MPLE为好后缀，即所有尾部匹配的字符串。注意，“MPLE”、“PLE”、“LE”、“E"都是好后缀。
发现"I"与"A"不匹配。所以，“I"是"坏字符”。
根据"坏字符规则”，此时搜索词应该后移 2 + 1 = 3 位
而好后缀有三种情况：
（1）模式串中有子串匹配上好后缀，此时移动模式串，让该子串和好后缀对齐即可，如果超过一个子串匹配上好后缀，则选择最靠右边的子串对齐。
例如假如模式串是MPLEEXAMPLE
在这里插入图片描述
（2）若模式串中没有子串匹配上最长好后缀，其他"好后缀"的上一次出现位置必须在头部（需要想想：以上面匹配的MPLE为例，假如模式串里没有MPLE只有’PLE’并且出现在中间，没有出现在头部，那么此时’P’前面的字符肯定不会和M相等，如果相等那么就是有最长好后缀了，与之前的假设相矛盾）。寻找模式串的一个最长前缀，并让该前缀等于好后缀的后缀。
假如好后缀为MPLE，但是模式串中没有MPLE，那么PLE符合好后缀定义并且出现在模式串头部，就是最长前缀。
在这里插入图片描述
（3）模式串中没有子串匹配上后缀，并且在模式串中找不到最长前缀，让该前缀等于好后缀的后缀。此时，移动距离为模式串长度。

总结：
后移位数 = 好后缀的位置 - 搜索词中的上一次出现位置
如：如果字符串"ABCDAB"的后一个"AB"是"好后缀"。那么它的位置是5（从0开始计算，取最后的"B"的值），在"搜索词中的上一次出现位置"是1（第一个"B"的位置），所以后移 5 - 1 = 4位，前一个"AB"移到后一个"AB"的位置。
算法中用于一个suffixes数组存放每个字符失配时最长前缀长度。
bmGs数组存放好后缀情况，字符应该移动的距离

接着上面的匹配：
好后缀为MPLE，但是只有E在模式串头部出现，移动距离为 6 - 0 = 6位（6是好后缀最后一个字符在模式串的下标，0是匹配的最长前缀字符E在模式串的头部位置。）
在这里插入图片描述
继续从尾部开始比较，“P"与"E"不匹配，因此"P"是"坏字符”。根据"坏字符规则"，后移 6 - 4 = 2位。

匹配成功

三、php实现代码展示

<?php
/**
 * Class BM Algorithm
 */
class BM
{
    public function execute($mainStr, $patternStr)
    {
        //坏字符应该移动的下标
        $bmBc = array();
        //当前字符是好后缀应该移动的长度
        $bmGs = array();
        //主串'utf-8'编码总长度
        $mainStrLen = mb_strlen($mainStr, 'UTF-8');
        //模式串'utf-8'编码总长度
        $patternStrLen = mb_strlen($patternStr, 'UTF-8');
        //当前匹配主串的下标
        $matchMainIndex = 0;
        //当前匹配模式串的下标
        $matchPatternIndex = 0;
        //是否匹配成功
        $isMatch = false;
        $mainStrChar = '';

        self::preBmBc($patternStr, $patternStrLen, $bmBc);

        self::preBmGs($patternStr, $patternStrLen, $bmGs);

        //主串剩余字符个数大于模式串,继续匹配
        while ($matchMainIndex <= $mainStrLen - $patternStrLen) {
            $matchPatternIndex = $patternStrLen - 1;
            while ($matchPatternIndex >= 0) {
                $patternStrChar = mb_substr($patternStr, $matchPatternIndex, 1, 'UTF-8');

                //当前匹配主串下标
                $tempMatchPatternIndex = $matchPatternIndex + $matchMainIndex;
                //获取主串匹配的字符
                $mainStrChar = mb_substr($mainStr, $tempMatchPatternIndex, 1, 'UTF-8');
                //不相等,停止匹配,找主串移动长度
                if ($patternStrChar != $mainStrChar) {
                    break;
                }
                else {
                    $matchPatternIndex--;
                }
            }

            //找到模式串
            if ($matchPatternIndex < 0) {
                echo  '匹配成功 主串下标:' . "$matchMainIndex\n";
                $isMatch = true;
                $matchMainIndex += $bmGs[0];
            }
            else {
                $tempBmBcIndex = isset($bmBc[$mainStrChar]) ? $bmBc[$mainStrChar] - $patternStrLen + 1 + $matchPatternIndex : $matchPatternIndex + 1;
                $matchMainIndex += max($bmGs[$matchPatternIndex], $tempBmBcIndex);
            }
        }
        if (!$isMatch) {
            echo "匹配失败\n";
        }
        return $isMatch;
    }

    /**
     * @param $patternStr string 模式串
     * @param $patternStrLen int 模式串长度 'utf-8'
     * @param $bmBc
     */
    public static function preBmBc($patternStr, $patternStrLen, &$bmBc)
    {
        for ($index = 0; $index < $patternStrLen; $index++)
        {
            //取出当前字符
            $patternStrChar = mb_substr($patternStr, $index, 1, 'UTF-8');
            //一直迭代记录当前字符在字符串中的下标,后面重复出现,只保留最后一次出现的位置
            $bmBc[$patternStrChar] = $patternStrLen - 1 - $index;
        }
    }

    /**
     * @param $patternStr string 模式串
     * @param $patternStrLen int 模式串长度
     * @param $suffixes array 模式串中每个字符和模式串本身从后向前匹配的最大长度
     */
    public static function suffixes($patternStr, $patternStrLen, &$suffixes)
    {
        //初始化最后一个字符的好后缀的前缀为字符串本身长度
        $suffixes[$patternStrLen - 1] = $patternStrLen;
        //下标最大值
        $indexLen = $patternStrLen - 1;
        for ($index = $patternStrLen - 2; $index >= 0; $index--) {
            //匹配下标先赋值为当前字符下标
            $matchPreLen = $index;

            while ($matchPreLen >= 0) {
                //获取当前匹配下标的字符
                $currentChar = mb_substr($patternStr, $matchPreLen, 1, 'UTF-8');

                //当前匹配的模式串尾部下标
                $patternStrIndex = $indexLen - $index + $matchPreLen;
                //获取要和模式串末尾匹配的字符
                $patternStrChar = mb_substr($patternStr, $patternStrIndex, 1, 'UTF-8');

                //遇到不相等的,退出循环
                if ($currentChar != $patternStrChar) {
                    break;
                }
                //匹配下标往前移一位
                $matchPreLen--;
            }
            //记录当前字符向前依次和模式串末尾比较,匹配的长度
            $suffixes[$index] = $index - $matchPreLen;

        }
    }

    /**
     * @param $patternStr string 模式串
     * @param $patternStrLen int 模式串长度
     * @param $bmGs array 当前字符是好后缀应该移动的长度
     */
    public static function preBmGs($patternStr, $patternStrLen, &$bmGs)
    {
        //模式串中每个字符和模式串本身从后向前匹配的最大长度
        $suffixes = array();
        self::suffixes($patternStr, $patternStrLen, $suffixes);

        //初始化好后缀移动长度为字符串总长度
        for ($index = 0; $index < $patternStrLen; $index++) {
            $bmGs[$index] = $patternStrLen;
        }
        //当前已经被'记录好后缀移动长度的字符'下标
        $preCurrentGoodStrIndex = 0;
        //从后向前记录,保证记录的是最大移动长度===>注意看下面的第二个for循环
        for ($index = $patternStrLen - 1; $index >= 0; $index--) {
            //如果当前字符满足 从当前字符一直到字符串最开始的位置倒着和模式串匹配完全匹配,证明当前字符有前缀串
            if ($suffixes[$index] == $index + 1) {
                //从前到当前字符下标,记录好后缀字符移动的长度
                for (; $preCurrentGoodStrIndex < $patternStrLen - 1 - $index; $preCurrentGoodStrIndex++) {
                    //之前没有记录过的好后缀字符才记录
                    if ($bmGs[$preCurrentGoodStrIndex] == $patternStrLen) {
                        $bmGs[$preCurrentGoodStrIndex] = $patternStrLen - 1 - $index;
                    }
                }
            }
        }
        //好后缀在模式串中出现过,记录移动长度
        for ($index = 0; $index < $patternStrLen - 1; $index++) {
            $bmGs[$patternStrLen - 1 - $suffixes[$index]] = $patternStrLen - 1 - $index;
        }
    }


}
$startTime = time();
$objBM = new BM();
$objBM->execute('HERE IS A SIMPLE EXAMPLE', 'EXAMPLE');
$endTime = time();
echo '脚本执行耗时: ' . ($endTime - $startTime) . 's';

生成bmGs代码注解：
**130—132行对应第三种情况：**模式串中没有好后缀子串中的前缀。
134—147行对应第二种情况：
Q：为什么for循环从后往前遍历？index = patternStrLen - 1
A：原因在于如果index1 和 index2（index1 > index2）位置同时满足第二种情况，那么m-1-index1<m-1-index2，而第十行代码保证了每个位置最多只能被修改一次，故而应该赋值为m-1-index，这也说明了为什么要从后往前计算。
Q：第138行什么意思？
A：表示找到合适的位置，为什么这么说呢？因为根据suffixes的定义，我们知道x[index+1-suffixes[index]…index]＝＝x[m-1-suffixes[index]…m-1],而suffixes[index]==index+1，我们知道x[index+1-suffixes[index]…index] =x[0,index]，也就是前缀（好后缀子串必须出现在模式串头部），满足第二种情况。
Q：140—145行含义？
A：满足第二种情况下的赋值了。第142行确保了每个位置最多只能被修改一次。
149—151行对应第一种情况：
为什么顺序从前到后呢，也就是i从小到大？原因在于如果suffixes[index1]==suffixes[index2]，index1<index2，那么m-1-index1>m-1-index2,我们应该取后者作为bmGs[m - 1 - suff[index1]]的值（取最右侧的）。

坏小哥

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
深入理解BM算法

我之前和大家分享过KMP算法，下面和大家分享一个比BM更高效的算法——BM算法。一、简介：Boyer-Moore字符串搜索算法是一种非常高效的字符串搜索算法。它由Bob Boyer和J Strother Moore设计于1977年。此算法仅对搜索目标字符串（关键字）进行预处理，而非被搜索的字符串。二、基本原理1、坏字符：2、好后缀规则三、php实现代码展示<?php/** * Class BM Algorithm */class BM{ public function
复制链接

扫一扫