字符串查询算法_two-way算法-CSDN博客

本文链接：https://blog.csdn.net/anghlq/article/details/48573293

glibc中的strstr的two-way算法，two-way算法主要依据Critical Factorization理论。

要理解Critical Factorization理论，先要理解字符串的period：
设w是定义在字符集A上的非空字符串。设|w|是w的长度。存在正整数p，对所有满足模p同余的
非负整数i，j (i,j < |x|)，等式：
w[i] = w[j]
都成立。则p称为字符串w的period。w最小的period记作p(w)。

若存在正整数wp使等式：
w[i] = w[wp+i]

对所有使等式两边有意义的i都成立，那么wp称为字符串w的weak period。

若正整数lp称为非空字符串w在位置l的local period，则有定义在A上的字符串u, v, x，使
w=uv; |u|=l+1;|x|=lp;并使字符串r,s满足下列条件之一：
1. u = rx && v = xs
2. x = ru && v = xs (r不是空字符串)
3. u = rx && x = vs (s不是空字符串)
4. x = ru && x = vs (r,s不是空字符串)
字符串w在位置l的最小local period记作p(w, l)。若此时p(w, l)=p(w)，则字符串对(u, v)
称为w的Critical Factorization， l称为critical position。

two-way算法的第一步就是找到Critical Factorization。使用maximal suffix方法，依据如下理论：
字符串w的子字符串记作w[i..e)=w[i]w[i+1]...w[e-1]。w的最大后缀定义为：
存在整数i(i >= 0 && i < |w|)，使得对字符串w有意义的所有整数j满足下式：
u[j..|w|) <= u[i..|w|) (4)
字符串的逆向最大后缀的定义与上述定义相似，只是将(4)式改成：
u[j..|w|) >= u[i..|w|) (5)

可以证明字符串w(|w| >= 2)至少有一个critical factorization，且l < p。此外设v是w的
maximal suffix，且w = uv。设m是w的tilde maximal suffix，且w = nm。如果v <= m那么
(n,m)是w的是critical factorization。如果v > m那么(u, v)是w的critical factorization。

Boyer and Moore的BM算法

对齐后从匹配串后面开始匹配，不匹配时的移位算法：

1. 坏字符规则：源串中当前比较的字符不相等且不出现在匹配串中，则把匹配串移位到坏字符的背后

2. 好字符规则：后移位数 = 好后缀的位置 - 搜索词中的上一次出现位置，然后继续从后开始匹配

举例来说，如果字符串"ABCDAB"的后一个"AB"是"好后缀"。那么它的位置是5（从0开始计算，取最后的"B"的值），在"搜索词中的上一次出现位置"是1（第一个"B"的位置），所以后移 5 - 1 = 4位，前一个"AB"移到后一个"AB"的位置。
再举一个例子，如果字符串"ABCDEF"的"EF"是好后缀，则"EF"的位置是5 ，上一次出现的位置是 -1（即未出现），所以后移 5 - (-1) = 6位，即整个字符串移到"F"的后一位。

可以看到，"坏字符规则"只能移3位，"好后缀规则"可以移6位。所以，Boyer-Moore算法的基本思想是，每次后移这两个规则之中的较大值。
更巧妙的是，这两个规则的移动位数，只与搜索词有关，与原字符串无关。因此，可以预先计算生成《坏字符规则表》和《好后缀规则表》。使用时，只要查表比较一下就可以了。

KMP算法(Knuth–Morris–Pratt 三人提出的，这几个中性能相对差）

原理：

cababababadcaddecbd查询匹配ababad

ababad 不匹配的时候，

aba 我们可以直接整块右边移动2位继续当前位置继续往下比较，避免了6次冗余操作

下面附上glibc2.6的相关代码

char *
STRSTR (const char *haystack_start, const char *needle_start)
{
  const char *haystack = haystack_start;
  const char *needle = needle_start;
  size_t needle_len; /* Length of NEEDLE.  */
  size_t haystack_len; /* Known minimum length of HAYSTACK.  */
  bool ok = true; /* True if NEEDLE is prefix of HAYSTACK.  */

  /* Determine length of NEEDLE, and in the process, make sure
     HAYSTACK is at least as long (no point processing all of a long
     NEEDLE if HAYSTACK is too short).  */
  while (*haystack && *needle)
    ok &= *haystack++ == *needle++;
  if (*needle)
    return NULL;
  if (ok)
    return (char *) haystack_start;

  /* Reduce the size of haystack using strchr, since it has a smaller
     linear coefficient than the Two-Way algorithm.  */
  needle_len = needle - needle_start;
  haystack = strchr (haystack_start + 1, *needle_start);
  if (!haystack || __builtin_expect (needle_len == 1, 0))
    return (char *) haystack;
  needle -= needle_len;
  haystack_len = (haystack > haystack_start + needle_len ? 1
          : needle_len + haystack_start - haystack);

  /* Perform the search.  Abstract memory is considered to be an array
     of 'unsigned char' values, not an array of 'char' values.  See
     ISO C 99 section 6.2.6.1.  */
  if (needle_len < LONG_NEEDLE_THRESHOLD)
    return two_way_short_needle ((const unsigned char *) haystack,
                 haystack_len,
                 (const unsigned char *) needle, needle_len);
  return two_way_long_needle ((const unsigned char *) haystack, haystack_len,
                  (const unsigned char *) needle, needle_len);
}
 static RETURN_TYPE
two_way_short_needle (const unsigned char *haystack, size_t haystack_len,
              const unsigned char *needle, size_t needle_len)
{
  size_t i; /* Index into current byte of NEEDLE.  */
  size_t j; /* Index into current window of HAYSTACK.  */
  size_t period; /* The period of the right half of needle.  */
  size_t suffix; /* The index of the right half of needle.  */

  /* Factor the needle into two halves, such that the left half is
     smaller than the global period, and the right half is
     periodic (with a period as large as NEEDLE_LEN - suffix).  */
  suffix = critical_factorization (needle, needle_len, &period);

  /* Perform the search.  Each iteration compares the right half
     first.  */
  if (CMP_FUNC (needle, needle + period, suffix) == 0)
    {
      /* Entire needle is periodic; a mismatch can only advance by the
     period, so use memory to avoid rescanning known occurrences
     of the period.  */
      size_t memory = 0;
      j = 0;
      while (AVAILABLE (haystack, haystack_len, j, needle_len))
    {
      /* Scan for matches in right half.  */
      i = MAX (suffix, memory);
      while (i < needle_len && (CANON_ELEMENT (needle[i])
                    == CANON_ELEMENT (haystack[i + j])))
        ++i;
      if (needle_len <= i)
        {
          /* Scan for matches in left half.  */
          i = suffix - 1;
          while (memory < i + 1 && (CANON_ELEMENT (needle[i])
                    == CANON_ELEMENT (haystack[i + j])))
        --i;
          if (i + 1 < memory + 1)
        return (RETURN_TYPE) (haystack + j);
          /* No match, so remember how many repetitions of period
         on the right half were scanned.  */
          j += period;
          memory = needle_len - period;
        }
      else
        {
          j += i - suffix + 1;
          memory = 0;
        }
    }
    }
  else
    {
      /* The two halves of needle are distinct; no extra memory is
     required, and any mismatch results in a maximal shift.  */
      period = MAX (suffix, needle_len - suffix) + 1;
      j = 0;
      while (AVAILABLE (haystack, haystack_len, j, needle_len))
    {
      /* Scan for matches in right half.  */
      i = suffix;
      while (i < needle_len && (CANON_ELEMENT (needle[i])
                    == CANON_ELEMENT (haystack[i + j])))
        ++i;
      if (needle_len <= i)
        {
          /* Scan for matches in left half.  */
          i = suffix - 1;
          while (i != SIZE_MAX && (CANON_ELEMENT (needle[i])
                       == CANON_ELEMENT (haystack[i + j])))
        --i;
          if (i == SIZE_MAX)
        return (RETURN_TYPE) (haystack + j);
          j += period;
        }
      else
        j += i - suffix + 1;
    }
    }
  return NULL;
}
 static size_t
critical_factorization (const unsigned char *needle, size_t needle_len,
            size_t *period)
{
  /* Index of last byte of left half, or SIZE_MAX.  */
  size_t max_suffix, max_suffix_rev;
  size_t j; /* Index into NEEDLE for current candidate suffix.  */
  size_t k; /* Offset into current period.  */
  size_t p; /* Intermediate period.  */
  unsigned char a, b; /* Current comparison bytes.  */


  /* Invariants:
     0 <= j < NEEDLE_LEN - 1
     -1 <= max_suffix{,_rev} < j (treating SIZE_MAX as if it were signed)
     min(max_suffix, max_suffix_rev) < global period of NEEDLE
     1 <= p <= global period of NEEDLE
     p == global period of the substring NEEDLE[max_suffix{,_rev}+1...j]
     1 <= k <= p
  */


  /* Perform lexicographic search.  */
  max_suffix = SIZE_MAX;
  j = 0;
  k = p = 1;
  while (j + k < needle_len)
    {
      a = CANON_ELEMENT (needle[j + k]);
      b = CANON_ELEMENT (needle[max_suffix + k]);
      if (a < b)
    {
      /* Suffix is smaller, period is entire prefix so far.  */
      j += k;
      k = 1;
      p = j - max_suffix;
    }
      else if (a == b)
    {
      /* Advance through repetition of the current period.  */
      if (k != p)
        ++k;
      else
        {
          j += p;
          k = 1;
        }
    }
      else /* b < a */
    {
      /* Suffix is larger, start over from current location.  */
      max_suffix = j++;
      k = p = 1;
    }
    }
  *period = p;


  /* Perform reverse lexicographic search.  */
  max_suffix_rev = SIZE_MAX;
  j = 0;
  k = p = 1;
  while (j + k < needle_len)
    {
      a = CANON_ELEMENT (needle[j + k]);
      b = CANON_ELEMENT (needle[max_suffix_rev + k]);
      if (b < a)
    {
      /* Suffix is smaller, period is entire prefix so far.  */
      j += k;
      k = 1;
      p = j - max_suffix_rev;
    }
      else if (a == b)
    {
      /* Advance through repetition of the current period.  */
      if (k != p)
        ++k;
      else
        {
          j += p;
          k = 1;
        }
    }
      else /* a < b */
    {
      /* Suffix is larger, start over from current location.  */
      max_suffix_rev = j++;
      k = p = 1;
    }
    }


  /* Choose the longer suffix.  Return the first byte of the right
     half, rather than the last byte of the left half.  */
  if (max_suffix_rev + 1 < max_suffix + 1)
    return max_suffix + 1;
  *period = p;
  return max_suffix_rev + 1;
}
/* Return the first location of non-empty NEEDLE within HAYSTACK, or
   NULL.  HAYSTACK_LEN is the minimum known length of HAYSTACK.  This
   method is optimized for LONG_NEEDLE_THRESHOLD <= NEEDLE_LEN.
   Performance is guaranteed to be linear, with an initialization cost
   of 3 * NEEDLE_LEN + (1 << CHAR_BIT) operations.

   If AVAILABLE does not modify HAYSTACK_LEN (as in memmem), then at
   most 2 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching,
   and sublinear performance O(HAYSTACK_LEN / NEEDLE_LEN) is possible.
   If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3 *
   HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching, and
   sublinear performance is not possible.  */
static RETURN_TYPE
two_way_long_needle (const unsigned char *haystack, size_t haystack_len,
             const unsigned char *needle, size_t needle_len)
{
  size_t i; /* Index into current byte of NEEDLE.  */
  size_t j; /* Index into current window of HAYSTACK.  */
  size_t period; /* The period of the right half of needle.  */
  size_t suffix; /* The index of the right half of needle.  */
  size_t shift_table[1U << CHAR_BIT]; /* See below.  */

  /* Factor the needle into two halves, such that the left half is
     smaller than the global period, and the right half is
     periodic (with a period as large as NEEDLE_LEN - suffix).  */
  suffix = critical_factorization (needle, needle_len, &period);

  /* Populate shift_table.  For each possible byte value c,
     shift_table[c] is the distance from the last occurrence of c to
     the end of NEEDLE, or NEEDLE_LEN if c is absent from the NEEDLE.
     shift_table[NEEDLE[NEEDLE_LEN - 1]] contains the only 0.  */
  for (i = 0; i < 1U << CHAR_BIT; i++)
    shift_table[i] = needle_len;
  for (i = 0; i < needle_len; i++)
    shift_table[CANON_ELEMENT (needle[i])] = needle_len - i - 1;

  /* Perform the search.  Each iteration compares the right half
     first.  */
  if (CMP_FUNC (needle, needle + period, suffix) == 0)
    {
      /* Entire needle is periodic; a mismatch can only advance by the
     period, so use memory to avoid rescanning known occurrences
     of the period.  */
      size_t memory = 0;
      size_t shift;
      j = 0;
      while (AVAILABLE (haystack, haystack_len, j, needle_len))
    {
      /* Check the last byte first; if it does not match, then
         shift to the next possible match location.  */
      shift = shift_table[CANON_ELEMENT (haystack[j + needle_len - 1])];
      if (0 < shift)
        {
          if (memory && shift < period)
        {
          /* Since needle is periodic, but the last period has
             a byte out of place, there can be no match until
             after the mismatch.  */
          shift = needle_len - period;
        }
          memory = 0;
          j += shift;
          continue;
        }
      /* Scan for matches in right half.  The last byte has
         already been matched, by virtue of the shift table.  */
      i = MAX (suffix, memory);
      while (i < needle_len - 1 && (CANON_ELEMENT (needle[i])
                    == CANON_ELEMENT (haystack[i + j])))
        ++i;
      if (needle_len - 1 <= i)
        {
          /* Scan for matches in left half.  */
          i = suffix - 1;
          while (memory < i + 1 && (CANON_ELEMENT (needle[i])
                    == CANON_ELEMENT (haystack[i + j])))
        --i;
          if (i + 1 < memory + 1)
        return (RETURN_TYPE) (haystack + j);
          /* No match, so remember how many repetitions of period
         on the right half were scanned.  */
          j += period;
          memory = needle_len - period;
        }
      else
        {
          j += i - suffix + 1;
          memory = 0;
        }
    }
    }
  else
    {
      /* The two halves of needle are distinct; no extra memory is
     required, and any mismatch results in a maximal shift.  */
      size_t shift;
      period = MAX (suffix, needle_len - suffix) + 1;
      j = 0;
      while (AVAILABLE (haystack, haystack_len, j, needle_len))
    {
      /* Check the last byte first; if it does not match, then
         shift to the next possible match location.  */
      shift = shift_table[CANON_ELEMENT (haystack[j + needle_len - 1])];
      if (0 < shift)
        {
          j += shift;
          continue;
        }
      /* Scan for matches in right half.  The last byte has
         already been matched, by virtue of the shift table.  */
      i = suffix;
      while (i < needle_len - 1 && (CANON_ELEMENT (needle[i])
                    == CANON_ELEMENT (haystack[i + j])))
        ++i;
      if (needle_len - 1 <= i)
        {
          /* Scan for matches in left half.  */
          i = suffix - 1;
          while (i != SIZE_MAX && (CANON_ELEMENT (needle[i])
                       == CANON_ELEMENT (haystack[i + j])))
        --i;
          if (i == SIZE_MAX)
        return (RETURN_TYPE) (haystack + j);
          j += period;
        }
      else
        j += i - suffix + 1;
    }
    }
  return NULL;
}

google下来的一些性能测试，随机字串性能测试比较

Boyer-Moore	1734 ms
Boyer-Moore-Horspool (Joel’s original implementation)	1101 ms
Boyer-Moore-Horspool (our implementation)	1080 ms
Turbo Boyer-Moore	2683 ms
strstr	2116 ms
memmem	3047 ms