Boyer-Moore algorithm

Idea

http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm

The algorithm of Boyer and Moore [BM 77] compares the pattern with the text from right to left. If the text symbol that is compared with the rightmost pattern symbol does not occur in the pattern at all, then the pattern can be shifted by m positions behind this text symbol. The following example illustrates this situation.

Example:  

0 1 2 3 4 5 6 7 8 9 ...
abbadabacba
babac      
     babac 

The first comparison d-c at position 4 produces a mismatch. The text symbol d does not occur in the pattern. Therefore, the pattern cannot match at any of the positions 0, ..., 4, since all corresponding windows contain a d. The pattern can be shifted to position 5.

The best case for the Boyer-Moore algorithm is attained if at each attempt the first compared text symbol does not occur in the pattern. Then the algorithm requires only O(n/m) comparisons.

 

Bad character heuristics

This method is called bad character heuristics. It can also be applied if the bad character, i.e. the text symbol that causes a mismatch, occurs somewhere else in the pattern. Then the pattern can be shifted so that it is aligned to this text symbol. The next example illustrates this situation.

Example:  

0 1 2 3 4 5 6 7 8 9 ...
abbababacba
babac      
  babac    

Comparison b-c causes a mismatch. Text symbol b occurs in the pattern at positions 0 and 2. The pattern can be shifted so that the rightmost b in the pattern is aligned to text symbol b.

 

Good suffix heuristics

Sometimes the bad character heuristics fails. In the following situation the comparison a-b causes a mismatch. An alignment of the rightmost occurence of the pattern symbol a with the text symbol a would produce a negative shift. Instead, a shift by 1 would be possible. However, in this case it is better to derive the maximum possible shift distance from the structure of the pattern. This method is called good suffix heuristics.

Example:  

0 1 2 3 4 5 6 7 8 9 ...
abaababacba
cabab      
  cabab    

The suffix ab has matched. The pattern can be shifted until the next occurence of ab in the pattern is aligned to the text symbols ab, i.e. to position 2.

In the following situation the suffix ab has matched. There is no other occurence of ab in the pattern.Therefore, the pattern can be shifted behind ab, i.e. to position 5.

Example:  

0 1 2 3 4 5 6 7 8 9 ...
abcababacba
cbaab      
     cbaab 

In the following situation the suffix bab has matched. There is no other occurence of bab in the pattern. But in this case the pattern cannot be shifted to position 5 as before, but only to position 3, since a prefix of the pattern (ab) matches the end of bab. We refer to this situation as case 2 of the good suffix heuristics.

Example:  

0 1 2 3 4 5 6 7 8 9 ...
aabababacba
abbab      
   abbab   

The pattern is shifted by the longest of the two distances that are given by the bad character and the good suffix heuristics.

 

Preprocessing for the bad character heuristics

For the bad character heuristics a function occ is required which yields, for each symbol of the alphabet, the position of its rightmost occurrence in the pattern, or -1 if the symbol does not occur in the pattern.

Definition:  Let A be the underlying alphabet.

The occurrence function  occ : A*timesA arrow integer numbers  is defined as follows:

Let  p element A* with  p = p0 ... pm-1  be the pattern and  a element A an alphabet symbol. Then

occ(pa)  =  max{ j  |  pj = a}

Here max(empty set) is set to -1.

Example:  

  • occ(text, x) = 2
  • occ(text, t) = 3

The rightmost occurence of symbol 'x' in the string 'text' is at position 2. Symbol 't' occurs at positions 0 and 3, the rightmost occurence is at position 3.

The occurrence function for a certain pattern p is stored in an array occ which is indexed by the alphabet symbols. For each symbol a element A the corresponding value occ(pa) is stored in occ[a].

The following function bmInitocc computes the occurrence function for a given pattern p.

Bad character preprocessing
void bmInitocc()
{
    char a;
    int j;

    for (a=0; a<alphabetsize; a++)
        occ[a]=-1;

    for (j=0; j<m; j++)
    {
        a=p[j];
        occ[a]=j;
    }
}



Preprocessing for the good-suffix heuristics
 

For the good-suffix heuristics an array s is used. Each entry s[i] contains the shift distance of the pattern if a mismatch at position i – 1 occurs, i.e. if the suffix of the pattern starting at position i has matched. In order to determine the shift distance, two cases have to be considered.

Case 1:   The matching suffix occurs somewhere else in the pattern (Figure 1).

Case 2:   Only a part of the matching suffix occurs at the beginning of the pattern (Figure 2).

 
The matching suffix (gray) occurs somewhere else in the pattern
 
 
Figure 1:  The matching suffix (grayoccurs somewhere else in the pattern
 

 

 
Only a part of the matching suffix occurs at the beginning of the pattern
 
 
Figure 2:  Only a part of the matching suffix occurs at the beginning of the pattern
 
Case 1:

The situation is similar to the Knuth-Morris-Pratt preprocessing. The matching suffix is a border of a suffix of the pattern. Thus, the borders of the suffixes of the pattern have to be determined. However, now the inverse mapping is needed between a given border and the shortest suffix of the pattern that has this border.

Moreover, it is necessary that the border cannot be extended to the left by the same symbol, since this would cause another mismatch after shifting the pattern.

In the following first part of the preprocessing algorithm an array f is computed. Each entry f[i]contains the starting position of the widest border of the suffix of the pattern beginning at position i. The suffix ε beginning at position m has no border, therefore f[m] is set to m+1.

Similar to the Knuth-Morris-Pratt preprocessing algorithm, each border is computed by checking if a shorter border that is already known can be extended to the left by the same symbol.

However, the case when a border cannot be extended to the left is also interesting, since it leads to a promising shift of the pattern if a mismatch occurs. Therefore, the corresponding shift distance is saved in an array s – provided that this entry is not already occupied. The latter is the case when a shorter suffix has the same border.

Good suffix preprocessing case 1


void bmPreprocess1()
{
    int i=m, j=m+1;
    f[i]=j;
    while (i>0)
    {
        while (j<=m && p[i-1]!=p[j-1])
        {
            if (s[j]==0) s[j]=j-i;
            j=f[j];
        }
        i--; j--;
        f[i]=j;
    }
}


Example:  A visualization of the preprocessing algorithm is given in [3]. The following example shows the values in array f and in array s.

i:01234567
p:abbabab 
f:56456778
s:00002041

The widest border of suffix babab beginning at position 2 is bab, beginning at position 4. Therefore, f[2] = 4. The widest border of suffix ab beginning at position 5 is ε, beginning at position 7. Therefore, f[5] = 7.

The values of array s are determined by the borders that cannot be extended to the left.

The suffix babab beginning at position 2 has border bab, beginning at position 4. This border cannot be extended to the left sincep[1]not equalp[3]. The difference 4 – 2 = 2 is the shift distance if bab has matched and then a mismatch occurs. Therefore, s[4] = 2.

The suffix babab beginning at position 2 has border b, too, beginning at position 6. This border cannot be extended either. The difference 6 – 2 = 4 is the shift distance if b has matched and then a mismatch occurs. Therefore, s[6] = 4.

The suffix b beginning at position 6 has border ε, beginning at position 7. This border cannot be extended to the left. The difference 7 – 6 = 1 is the shift distance if nothing has matched, i.e. if a mismatch occurs in the first comparison. Therefore, s[7] = 1.

Case 2:

In this situation, a part of the matching suffix of the pattern occurs at the beginning of the pattern. This means that this part is a border of the pattern. The pattern can be shifted as far as its widest matching border allows (Figure 2).

In the preprocessing for case 2, for each suffix the widest border of the pattern that is contained in that suffix is determined.

The starting position of the widest border of the pattern at all is stored in f[0]. In the example above this is 5 since the border ab starts at position 5.

In the following preprocessing algorithm, this value f[0] is stored initially in all free entries of array s. But when the suffix of the pattern becomes shorter than f[0], the algorithm continues with the next-wider border of the pattern, i.e. with f[j].

Good suffix preprocessing case 2


void bmPreprocess2()
{
    int i, j;
    j=f[0];
    for (i=0; i<=m; i++)
    {
        if (s[i]==0) s[i]=j;
        if (i==j) j=f[j];
    }
}


Example:  A visualization of the execution of the algorithm is given in [3]. The following example shows the final values of array s.

i:01234567
p:abbabab 
f:56456778
s:55552541

The entire preprocessing algorithm of the Boyer-Moore algorithm consists of the bad character preprocessing and both parts of the good suffix preprocessing.

Boyer-Moore preprocessing


void bmPreprocess()
{
    int[] f=new int[m+1];
    bmInitocc();
    bmPreprocess1();
    bmPreprocess2();
}

Searching algorithm 

The searching algorithm compares the symbols of the pattern from right to left with the text. After a complete match the pattern is shifted according to how much its widest border allows. After a mismatch the pattern is shifted by the maximum of the values given by the good-suffix and the bad-character heuristics.

 

Boyer-Moore searching algorithm



Analysis 

void bmSearch()
{
    int i=0, j;
    while (i<=n-m)
    {
        j=m-1;
        while (j>=0 && p[j]==t[i+j]) j--;
        if (j<0)
        {
            report(i);
            i+=s[0];
        }
        else 
            i+=Math.max(s[j+1], j-occ[t[i+j]]);
    }
}


If there are only a constant number of matches of the pattern in the text, the Boyer-Moore searching algorithm perfoms O(n) comparisons in the worst case. The proof of this is rather difficult.

In general Θ(n·m) comparisons are necessary, e.g. if the pattern is am and the text an. By a slight modification of the algorithm the number of comparisons can be bounded to O(n) even in the general case.

If the alphabet is large compared to the length of the pattern, the algorithm performs O(n/m) comparisons on the average. This is because often a shift by m is possible due to the bad character heuristics.

 

Conclusions

The Boyer-Moore algorithm uses two different heuristics for determining the maximum possible shift distance in case of a mismatch: the "bad character" and the "good suffix" heuristics. Both heuristics can lead to a shift distance of m. For the bad character heuristics this is the case, if the first comparison causes a mismatch and the corresponding text symbol does not occur in the pattern at all. For the good suffix heuristics this is the case, if only the first comparison was a match, but that symbol does not occur elsewhere in the pattern.

The preprocessing for the good suffix heuristics is rather difficult to understand and to implement. Therefore, sometimes versions of the Boyer-Moore algorithm are found in which the good suffix heuristics is left away. The argument is that the bad character heuristics would be sufficient and the good suffix heuristics would not save many comparisons. However, this is not true for small alphabets.

If for simplicity one wants to restrict oneself to the bad character heuristics, the Horspool algorithm [Hor 80] or the Sunday algorithm[Sun 90] are suited better.

 

References

   
[BM 77] R.S. Boyer, J.S. Moore: A Fast String Searching Algorithm. Communications of the ACM, 20, 10, 762-772 (1977)
[Hor 80] R.N. Horspool: Practical Fast Searching in Strings. Software - Practice and Experience 10, 501-506 (1980)
[Sun 90] D.M. Sunday: A Very Fast Substring Search Algorithm. Communications of the ACM, 33, 8, 132-142 (1990)
  
[1]http://www-igm.univ-mlv.fr/~lecroq/string/  
[2]http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/stringmatchingclasses/BmStringMatcher.java   
Boyer-Moore algorithm as a Java class source file
[3]http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bmPreprocess.xls   
Boyer-Moore good suffix preprocessing visualization in Excel

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Boyer-Moore-Horspool算法是一种字符串匹配算法,它可以在文本中快速查找一个模式串。该算法的核心思想是利用模式串中的信息,尽可能地跳过不必要的比较,从而提高匹配效率。具体来说,算法首先预处理模式串,构建一个跳表,然后从文本串的末尾开始匹配,每次跳过尽可能多的字符,直到找到一个匹配位置或者到达文本串的开头。如果找到了匹配位置,则返回该位置在文本串中的下标;否则返回-1。该算法的时间复杂度为O(n+m),其中n和m分别为文本串和模式串的长度。 ### 回答2: Boyer-Moore-Horspool算法是一种用于字符串匹配的快速算法。该算法由Robert S. Boyer和J Strother Moore于1977年提出,之后由Richard Horspool进行了改进。该算法在实际应用中广泛使用,如文本编辑器中的查找和替换功能。 该算法的优点在于它能够利用模式串中的信息快速地跳过不匹配的字符。它的基本思想是从待匹配的字符串的右侧开始与模式串进行匹配,如果遇到不匹配的字符,则根据模式串中该字符的位置来确定移动的步数。这样可以在每次比较时跳过多个字符,提高了匹配的效率。 具体而言,Boyer-Moore-Horspool算法首先构建一个坏字符表,用于记录模式串中每个字符在模式串中最右出现的位置。当发生不匹配时,通过查找坏字符表获取需要移动的步数。如果坏字符不在模式串中出现,则可以直接移动模式串的长度个位置,因为整个模式串都不可能出现在待匹配的字符串中。 在匹配过程中,Boyer-Moore-Horspool算法一般比其他字符串匹配算法更快速,例如Brute-Force算法和KMP算法。但是,该算法并不能处理带有通配符或正则表达式的模式串,因此在某些特定情况下可能不适用。 总而言之,Boyer-Moore-Horspool算法是一种高效的字符串匹配算法,通过合理利用模式串中的信息,能够快速地跳过不匹配的字符,提高匹配效率。它在实际应用中有广泛的应用和成就。 ### 回答3: Boyer-Moore-Horspool算法是一种字符串匹配算法,用于在一个主字符串中查找子字符串的位置。它是Boyer-Moore算法的简化版本,由Nigel Horspool在1980年提出。 该算法的核心思想是从主字符串的末尾开始匹配,当发现不匹配的字符时,使用预先计算的"坏字符规则"和"好后缀规则"进行跳跃式的移动。坏字符规则是指对于不匹配的字符,在子字符串中查找其最右出现的位置,然后根据该位置和主字符串中当前字符的位置计算移动距离。好后缀规则是指对于匹配的子串的部分,从右往左查找其在子字符串中的最右出现位置,然后根据该位置和主字符串中当前字符的位置计算移动距离。 Boyer-Moore-Horspool算法通过在预处理阶段计算坏字符规则数组,以及在匹配阶段计算好后缀规则数组,来提高匹配的效率。算法的时间复杂度为O(n+m),其中n为主字符串的长度,m为子字符串的长度。 Boyer-Moore-Horspool算法在实际应用中具有很好的性能。它在大多数情况下比其他字符串匹配算法更快,特别是在处理长字符串和具有较小字符集的情况下。该算法已被广泛应用于文本编辑器、搜索引擎、数据压缩等领域。 总而言之,Boy er-Moore-Horspool算法是一种高效的字符串匹配算法,通过利用坏字符规则和好后缀规则进行跳跃式移动,以提高匹配效率。它在实际应用中表现出优秀的性能,是一个重要的算法。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值