SWUST OJ 572: Boyer–Moore–Horspool algorithm

博伊尔-摩尔算法是一种高效的字符串搜索算法,由Bob Boyer和J Strother Moore在1977年提出。它通过预处理目标字符串而非搜索字符串来提高效率,实现搜索过程的跳过。算法包括坏字符规则和好后缀规则,允许在最坏情况下达到线性时间复杂度。本文介绍了算法的工作原理、计算表格以及性能分析,并提供了C++代码示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

#include<iostream>
#include<string.h>
using namespace std;
int table[26];
int horspool(char str[],char text[])
{
    int str_len=strlen(str);
    int text_len=strlen(text);
    for(int i=0;i<str_len-1;i++)
        {
            table[str[i]]=str_len-1-i;
        }
    int p=str_len-1;
    while(p<text_len)
    {
        int k=0;
        while(str[str_len-1-k]==text[p-k]&&k<str_len)
        {
            k++;
        }
        if(k==str_len)
        {
            return p-str_len+1;
        }
        else 
            return -1;
    }
}
int main()
{
    int str[102000],text[700000];
    cin>>str>>text;
    cout<<horspool(str,text)<<endl;
    return 0;
}

题目描述

题目内容来自: https://en.wikipedia.org/w/index.php?title=Boyer%E2%80%93Moore_string-search_algorithm&oldid=280422137

The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm. It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched (unlike some algorithms which preprocess the string to be searched, and can then amortize the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm can actually be sub-linear: it doesn't need to actually check every character of the string to be searched but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that, with each unsuccessful attempt to find a match between the search string and the text it's searching in, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string could not match.

How the algorithm works

What people frequently find surprising about the Boyer-Moore algorithm when they first encounter it is that its verifications – its attempts to check whether a match exists at a particular position – work backwards. If it starts a search at the beginning of a text for the word "ANPANMAN", for instance, it checks the eighth position of the text to see if it contains an "N". If it finds the "N", it moves to the seventh position to see if that contains the last "A" of the word, and so on until it checks the first position of the text for a "A". Why Boyer-Moore takes this backward approach is clearer when we consider what happens if the verification fails – for instance, if instead of an "N" in the eighth position, we find an "X". The "X" doesn't appear anywhere in "ANPANMAN", and this means there is no match for the search string at the very start of the text – or at the next seven positions following it, since those would all fall across the "X" as well. After checking just one character, we're able to skip ahead and start looking for a match starting at the ninth position of the text, just after the "X". This explains why the best-case performance of the algorithm, for a text of length N and a fixed pattern of length M, is N/M: in the best case, only one in M characters needs to be checked. This also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will be usually able to find it. The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the identity of the character that caused the match attempt to fail; the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed. (Because these two tables return results indicating how far ahead in the text to "jump", they are sometimes called "jump tables", which should not be confused with the more common meaning of jump tables in computer science.)

the first table

The first table is easy to calculate: Start at the last character of the sought string and move towards the first character. Each time you move left, if the character you are on is not in the table already, add it; its Shift value is its distance from the rightmost character. All other characters receive a count equal to the length of the search string.

Example: For the string ANPANMAN, the first table would be as shown (for clarity, entries are shown in the order they would be added to the table):(The N which is supposed to be zero is based on the 2nd N from the right because we only calculate from letters m-1)

The amount of shift calculated by the first table is sometimes called the "bad character shift"[1].

the second table

The second table is slightly more difficult to calculate: for each value of i less than the length of the search string, we must first calculate the pattern consisting of the last i characters of the search string, preceded by a mis-match for the character before it; then we initially line it up with the search pattern and determine the least number of characters the partial pattern must be shifted left before the two patterns match. For instance, for the search string ANPANMAN, the table would be as follows: (N signifies any character that is not N)

The amount of shift calculated by the second table is sometimes called the "good suffix shift"[2] or "(strong) good suffix rule". The original published Boyer-Moor algorithm [1] uses a simpler, weaker, version of the good suffix rule in which each entry in the above table did not require a mis-match for the left-most character. This is sometimes called the "weak good suffix rule" and is not sufficient for proving that Boyer-Moore runs in linear worst-case time.

Performance of the Boyer-Moore string search algorithm

The worst-case to find all occurrences in a text needs approximately 3*N comparisons, hence the complexity is O(n), regardless whether the text contains a match or not. The proof is due to Richard Cole, see R. COLE,Tight bounds on the complexity of the Boyer-Moore algorithm,Proceedings of the 2nd Annual ACM-SIAM Symposium on Discrete Algorithms, (1991) for details. This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than 6*N; in 1980 it was shown to be no more than 4*N, until Cole's result in 1991.

References

  1. Hume and Sunday (1991) [Fast String Searching] SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)
  2. ^ R. S. Boyer (1977). "A fast string searching algorithm". Comm. ACM. 20: 762–772. doi:10.1145/359842.359859.

输入

 

two lines and only characters “ACGT” in the string. the first line is string (< = 102000) the second line is text(< = 700000)

输出

 

position of the string in text else -1

样例输入

GGCCTCATATCTCTCT
CCCATTGGCCTCATATCTCTCTCCCTCCCTCCCCTGCCCAGGCTGCTTGGCATGG

样例输出

6
### 回答1: Boyer-Moore-Horspool算法是一种字符串匹配算法,它可以在文本中快速查找一个模式串。该算法的核心思想是利用模式串中的信息,尽可能地跳过不必要的比较,从而提高匹配效率。具体来说,算法首先预处理模式串,构建一个跳表,然后从文本串的末尾开始匹配,每次跳过尽可能多的字符,直到找到一个匹配位置或者到达文本串的开头。如果找到了匹配位置,则返回该位置在文本串中的下标;否则返回-1。该算法的时间复杂度为O(n+m),其中n和m分别为文本串和模式串的长度。 ### 回答2: Boyer-Moore-Horspool算法是一种用于字符串匹配的快速算法。该算法由Robert S. Boyer和J Strother Moore于1977年提出,之后由Richard Horspool进行了改进。该算法在实际应用中广泛使用,如文本编辑器中的查找和替换功能。 该算法的优点在于它能够利用模式串中的信息快速地跳过不匹配的字符。它的基本思想是从待匹配的字符串的右侧开始与模式串进行匹配,如果遇到不匹配的字符,则根据模式串中该字符的位置来确定移动的步数。这样可以在每次比较时跳过多个字符,提高了匹配的效率。 具体而言,Boyer-Moore-Horspool算法首先构建一个坏字符表,用于记录模式串中每个字符在模式串中最右出现的位置。当发生不匹配时,通过查找坏字符表获取需要移动的步数。如果坏字符不在模式串中出现,则可以直接移动模式串的长度个位置,因为整个模式串都不可能出现在待匹配的字符串中。 在匹配过程中,Boyer-Moore-Horspool算法一般比其他字符串匹配算法更快速,例如Brute-Force算法和KMP算法。但是,该算法并不能处理带有通配符或正则表达式的模式串,因此在某些特定情况下可能不适用。 总而言之,Boyer-Moore-Horspool算法是一种高效的字符串匹配算法,通过合理利用模式串中的信息,能够快速地跳过不匹配的字符,提高匹配效率。它在实际应用中有广泛的应用和成就。 ### 回答3: Boyer-Moore-Horspool算法是一种字符串匹配算法,用于在一个主字符串中查找子字符串的位置。它是Boyer-Moore算法的简化版本,由Nigel Horspool在1980年提出。 该算法的核心思想是从主字符串的末尾开始匹配,当发现不匹配的字符时,使用预先计算的"坏字符规则"和"好后缀规则"进行跳跃式的移动。坏字符规则是指对于不匹配的字符,在子字符串中查找其最右出现的位置,然后根据该位置和主字符串中当前字符的位置计算移动距离。好后缀规则是指对于匹配的子串的部分,从右往左查找其在子字符串中的最右出现位置,然后根据该位置和主字符串中当前字符的位置计算移动距离。 Boyer-Moore-Horspool算法通过在预处理阶段计算坏字符规则数组,以及在匹配阶段计算好后缀规则数组,来提高匹配的效率。算法的时间复杂度为O(n+m),其中n为主字符串的长度,m为子字符串的长度。 Boyer-Moore-Horspool算法在实际应用中具有很好的性能。它在大多数情况下比其他字符串匹配算法更快,特别是在处理长字符串和具有较小字符集的情况下。该算法已被广泛应用于文本编辑器、搜索引擎、数据压缩等领域。 总而言之,Boy er-Moore-Horspool算法是一种高效的字符串匹配算法,通过利用坏字符规则和好后缀规则进行跳跃式移动,以提高匹配效率。它在实际应用中表现出优秀的性能,是一个重要的算法。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值