字符串匹配算法学习笔记 (介绍及模板)

LPoems

已于 2024-10-09 20:53:16 修改

阅读量2.1k

点赞数 43

文章标签：算法学习笔记 c++ 数据结构

于 2024-10-09 20:43:56 首次发布

本文链接：https://blog.csdn.net/LPoems/article/details/142795327

版权

在字符串匹配问题中，目标是找到一个模式串（Pattern）在一个文本串（Text）中的出现位置。这类问题在实际中有着广泛的应用。本文将介绍几种经典的字符串匹配算法，并包括一些并行加速的内容。主要涉及以下几种算法：

朴素算法
KMP算法
BM算法
Sunday算法
CUDA加速的朴素算法

1. 朴素字符串匹配算法（Brute Force）

原理：

最简单的字符串匹配方法是朴素算法，遍历文本串的所有可能位置，将每个位置的子串与模式串逐个字符比较。

时间复杂度：

最坏情况：O(n*m)
（其中n是文本长度，m是模式串长度）

实现：

#include <iostream>
#include <string>
#include <vector>

using namespace std;

vector<int> bruteForce(const string &text, const string &pattern) {
    vector<int> result;
    int n = text.size();
    int m = pattern.size();
    for (int i = 0; i <= n - m; i++) {
        int j;
        for (j = 0; j < m; j++) {
            if (text[i + j] != pattern[j])
                break;
        }
        if (j == m)
            result.push_back(i);
    }
    return result;
}

int main() {
    string text = "ABABABABCABABABCABABABC";
    string pattern = "ABABABC";
    vector<int> result = bruteForce(text, pattern);
    for(auto & i:result)
        cout << "Pattern found at index " << i << endl;
    return 0;
}

2. KMP算法（Knuth-Morris-Pratt）

原理：

KMP算法通过构建next数组，避免在匹配失败后对已经比较过的字符重新比较，从而提升效率。

在求解next数组的过程中，遍历模式串，并使用双指针法维护两个变量：当前字符位置i和已经匹配的最长前缀的长度len。具体步骤如下：

初始化一个长度为模式串长度的数组next，并将next[0]设为0，因为第一个字符没有前缀和后缀可比。
使用变量i从第二个字符开始遍历模式串。如果当前字符pattern[i]与pattern[len]相等，表示前缀和后缀相同，此时len++，并将next[i]设置为len，然后继续处理下一个字符（i++）。
如果当前字符不匹配，并且len不为0，则将next回退为next[len-1]，相当于找到前一位置的最长前缀后缀长度，继续比较。
如果len为0，说明当前字符无法匹配任何前缀，则将next[i]置为0，并移动到下一个字符。

通过这种方式，next数组能高效记录模式串中每个位置的最长前缀后缀匹配信息，确保KMP算法在主串匹配时能快速跳过无效的匹配尝试。

时间复杂度：

最坏情况：O(n + m)

实现：

#include <iostream>
#include <vector>

using namespace std;

vector<int> computeLPSArray(const string &pattern) {
    int m = pattern.size();
    vector<int> next(m, 0);
    int len = 0;
    int i = 1;

    while (i < m) {
        if (pattern[i] == pattern[len]) {
            len++;
            next[i] = len;
            i++;
        } else {
            if (len != 0) {
                next = next[len - 1];
            } else {
                next[i] = 0;
                i++;
            }
        }
    }
    return next;
}

void KMPSearch(const string &text, const string &pattern) {
    int n = text.size();
    int m = pattern.size();
    vector<int> lps = computeLPSArray(pattern);
    int i = 0, j = 0;

    while (i < n) {
        if (pattern[j] == text[i]) {
            i++;
            j++;
        }
        if (j == m) {
            cout << "Pattern found at index " << i - j << endl;
            j = lps[j - 1];
        } else if (i < n && pattern[j] != text[i]) {
            if (j != 0) {
                j = lps[j - 1];
            } else {
                i++;
            }
        }
    }
}

int main() {
    string text = "ABABABABCABABABCABABABC";
    string pattern = "ABABABC";
    KMPSearch(text, pattern);
    return 0;
}

3. BM算法（Boyer-Moore）

原理：

BM算法通过预处理模式串，使用坏字符规则和好后缀规则来跳过不必要的字符比较，使得匹配效率极高。

坏字符规则的思想是，当模式中的字符与文本中的字符不匹配时，我们可以直接跳过一些无关的字符，而不是一个字符一个字符地滑动模式。

当模式中的字符与文本中的字符不匹配时，记录这个“坏字符”。
在模式中查找该“坏字符”最近的出现位置。
根据“坏字符”在模式中的位置，决定模式向右滑动多少步，以避免不必要的比较。

好后缀规则通过匹配后缀的部分，来进一步加快模式滑动速度。

当发生不匹配时，先看看模式中是否有与已匹配部分（即“好后缀”）相匹配的部分。
如果有，模式可以直接跳到这个匹配部分的位置。
如果没有，那么根据某些情况决定移动模式到某个合理的位置。

时间复杂度：

平均情况：O(n/m)
最坏情况：O(n * m)

实现：

#include <iostream>
#include <vector>

using namespace std;

void badCharHeuristic(const string &pattern, int size, int badChar[256]) {
    for (int i = 0; i < 256; i++)
        badChar[i] = -1;

    for (int i = 0; i < size; i++)
        badChar[(int)pattern[i]] = i;
}

void BMSearch(const string &text, const string &pattern) {
    int n = text.size();
    int m = pattern.size();

    int badChar[256];
    badCharHeuristic(pattern, m, badChar);

    int s = 0;
    while (s <= (n - m)) {
        int j = m - 1;

        while (j >= 0 && pattern[j] == text[s + j])
            j--;

        if (j < 0) {
            cout << "Pattern found at index " << s << endl;
            s += (s + m < n) ? m - badChar[text[s + m]] : 1;
        } else {
            s += max(1, j - badChar[text[s + j]]);
        }
    }
}

int main() {
    string text = "ABABABABCABABABCABABABC";
    string pattern = "ABABABC";
    BMSearch(text, pattern);
    return 0;
}

4. Sunday算法

原理：

Sunday算法是BM算法的一种变体，它预处理的是模式串紧随文本中当前窗口的字符，以此决定滑动步长。
一旦发现某一位置的字符不匹配，Sunday算法并不会逐个字符地往后滑动模式串，而是根据主串当前不匹配位置之后的一个字符来决定滑动的步长（这就是与BM算法不同的地方）
如果模式串后一个字符在模式串中出现，则将模式串移动到该字符在模式串中的最后一次出现位置的对齐处。
如果模式串后一个字符在模式串中没有出现，则直接将模式串向后滑动整个模式串长度+1的位置。

时间复杂度：

最坏情况：O(n * m)
平均情况：O(n)

实现：

#include <iostream>
#include <string>
#include <vector>

using namespace std;

void SundaySearch(const string &text, const string &pattern) {
    int n = text.size();
    int m = pattern.size();
    vector<int> shift(256, m + 1);

    for (int i = 0; i < m; i++) {
        shift[pattern[i]] = m - i;
    }

    int i = 0;
    while (i <= n - m) {
        int j = 0;
        while (j < m && pattern[j] == text[i + j]) {
            j++;
        }
        if (j == m) {
            cout << "Pattern found at index " << i << endl;
        }
        i += shift[text[i + m]];
    }
}

int main() {
    string text = "ABABABABCABABABCABABABC";
    string pattern = "ABABABC";
    SundaySearch(text, pattern);
    return 0;
}

5. CUDA加速字符串匹配

原理：

CUDA可以利用GPU的并行计算能力对大规模数据进行加速处理。在字符串匹配中，可以将每个字符的比较任务分发到不同的线程，从而并行完成匹配操作。

CUDA实现朴素算法：

#include <iostream>
#include <cuda_runtime.h>

using namespace std;

__global__ void bruteForceCUDA(char* d_text, char* d_pattern, int n, int m, int* d_result) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx <= n - m) {
        bool match = true;
        for (int j = 0; j < m; j++) {
            if (d_text[idx + j] != d_pattern[j]) {
                match = false;
                break;
            }
        }
        if (match) {
            d_result[idx] = 1;
        }
    }
}

int main() {
    string text = "ABABABABCABABABCABABABC";
    string pattern = "ABABABC";
    int n = text.size();
    int m = pattern.size();

    char* d_text;
    char* d_pattern;
    int* d_result;
    int* result = new int[n - m + 1]();

    cudaMalloc(&d_text, n * sizeof(char));
    cudaMalloc(&d_pattern, m * sizeof(char));
    cudaMalloc(&d_result, (n - m + 1) * sizeof(int));

    cudaMemcpy(d_text, text.c_str(), n * sizeof(char), cudaMemcpyHostToDevice);
    cudaMemcpy(d_pattern, pattern.c_str(), m * sizeof(char), cudaMemcpyHostToDevice);

    bruteForceCUDA<<<(n - m + 1) / 256 + 1, 256>>>(d_text, d_pattern, n, m, d_result);

    cudaMemcpy(result, d_result, (n - m + 1) * sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i <= n - m; i++) {
        if (result[i] == 1) {
            cout << "Pattern found at index " << i << endl;
        }
    }

    cudaFree(d_text);
    cudaFree(d_pattern);
    cudaFree(d_result);
    delete[] result;

    return 0;
}

简单的实验对比

数据：大肠杆菌基因组序列的一个片段
Brute Force: 10 microseconds, Matches: 921
KMP: 15 microseconds, Matches: 921
BM: 4 microseconds, Matches: 921
Sunday: 4 microseconds, Matches: 921