字符串模式匹配算法（KMP）

最新推荐文章于 2023-09-16 06:00:00 发布

skyword_sun

最新推荐文章于 2023-09-16 06:00:00 发布

阅读量357

点赞数

分类专栏：字符串

本文链接：https://blog.csdn.net/skyword_sun/article/details/86028117

版权

字符串专栏收录该内容

2 篇文章 0 订阅

订阅专栏

【整理于2017年4月】
问题描述
编写程序比较暴力匹配算法和KMP算法在匹配字符串的时候的比较次数，使用动态数组的顺序存储结构
算法思想
暴力匹配算法（BruteForce）的做法是逐个字符串匹配，当有主串某字符和模板串首字符相等是，向下比较下一字符；当匹配到某个位置出现不同时，回到原来的匹配位置的下一位重新匹配，理论复杂度 $O (m n)$ ，其中 $m$ 和 $n$ 分别是主串和模板串的规模。

KMP算法对模板串定义了next数组，意义在于，当出现匹配失败的情况时，模板串的匹配下标不是回到初始位置，而是回到 $n e x t [j]$ 位置继续向下匹配。从而节省了不必要的比较，同时保证不会错过某些位置。理论复杂度 $O (m + n)$

next数组的含义：对于下标j， $n e x t [j]$ 的含义是给出了0~j-1中的最长公共前后缀，从而，下一次匹配时，我们直接回到最长前缀的下一位继续匹配即可

next数组的求法：递推方式。在求出 $n e x t [j]$ 之后：

若 $s t r [j] = = s t r [k]$ ，直接更新k值，即 $n e x t [j + 1] = k + 1$
若不然，将这时的情形看成对前k个字符的匹配，置 $k = n e x t [k]$ （由递推性质，该值一定存在）,进行下一次比较

代码设计
①"Dstring.h"头文件：定义动态串结构体，并定义了以下函数：

void Initiate(Dstring *S, int mlen, char *str)//初始化，长度为mlen，存储字符串str
bool Insert(Dstring *S, int pos, Dstring T)//在pos位置之后插入T
bool Delete(Dstring *S, int pos, int len)//在pos位置之后删除长度为len的字符串
bool Substring(Dstring *S, int pos, int len, Dstring T)//取出从pos开始，长度为len的字符串，存在T中
void Destroy(Dstring *S)//销毁字符串
void Dstring_print(Dstring *S)//输出字符串

②"main.cpp"主文件：利用写好的Dstring，实现Brute-Force和KMP并比较

int Brute_Force_Match(Dstring S, Dstring T, int &cnt)//暴力匹配部分
void getNext(Dstring T, int nxt[], int &cnt)//对模板串求next数组
int KMP(Dstring S, Dstring T, int nxt[], int &cnt)//kmp匹配部分

③两种算法比较次数的对比
要点如下：

每次匹配开始时，定义计数器cnt，置零，在每次 $i f (S - > s t r [i] = = T . s t r [j])$ 判断语句中加入cnt++，修改计数器的值
计数器以引用形式传入匹配函数中，不再作为函数的返回值
将KMP算法生成next数组时用到的匹配也计入其中（课本示例中未计入）
与教材实现的一点关键不同：假设主串S，模板串T，对于T在S中出现多次的测试情形，当第一次匹配成功时就返回。而不是继续匹配到最后

程序代码
①Dstring.h

#ifndef DSTRING_H_INCLUDED
#define DSTRING_H_INCLUDED
/*
Index of string starts from 0.
*/
#include <cstdio>
#include <cstring>
#include <stdexcept>
#include <sstream>
using namespace std;
typedef struct
{
    char *str;
    int maxLength;//Maximum capacity.
    int size;//The Number of Characters Dstring has now.
}Dstring;
void Initiate(Dstring *S, int mlen, char *str)//Initiate Dstring with size = len and str..
{
    S->str = (char *)malloc(sizeof(char)*mlen);//Apply memory.
    S->maxLength = mlen;
    S->size = strlen(str);
    int len = S->size;
    for(int i = 0; i < len; i++)
    {
        S->str[i] = str[i];
    }
}
bool Insert(Dstring *S, int pos, Dstring T)//Insert Dstring T at the pos-th position of S.
{
    if(pos < 0 || pos > S->size)//Illegal pos parameter.
    {
        ostringstream s;
        s<<"Illegal pos parameter."<<endl;
        throw invalid_argument(s.str());
        return false;
    }
    char *p;
    if(S->size + T.size > S->maxLength)//Apply more memory to store new string.
    {
        p = (char *)realloc(S->str, (S->size + T.size)*sizeof(char));
        if(p == NULL)
        {
            ostringstream s;
            s<<"System has run out of RAM."<<endl;
            throw invalid_argument(s.str());
            return false;
        }
    }
    for(int i = S->size - 1; i >= pos; i--)//move substring(pos, size-1)  T.size units forward.
        S->str[i+T.size] = S->str[i];
    for(int i = 0; i < T.size; i++)//Insert characters 1 by 1.
        S->str[pos+i] = T.str[i];
    S->size += T.size;
    return true;
}
bool Delete(Dstring *S, int pos, int len)//Delete len units from pos to pos+len.
{
    if(S->size <= 0)
    {
        ostringstream s;
        s<<"This string has already been empty."<<endl;
        throw invalid_argument(s.str());
        return false;
    }
    if(pos < 0 || len < 0 || pos + len > S->size)
    {
        ostringstream s;
        s<<"Illegal parameter : pos or len."<<endl;
        throw invalid_argument(s.str());
        return false;
    }
    else
    {
        for(int i = pos + len; i <= S->size-1; i++)
            S->str[i-len] = S->str[i];
        S->size -= len;
        return true;
    }
}
bool Substring(Dstring *S, int pos, int len, Dstring *T)//Get substring in S from position pos with length len, let T store it.
{
    if(pos < 0 || len < 0 || pos + len > S->size)
    {
        ostringstream s;
        s<<"Illegal parameter : pos or len."<<endl;
        throw invalid_argument(s.str());
        return false;
    }
    else
    {
        for(int i = 0; i < len; i++)
            T->str[i] = S->str[pos+i];
        T->size = len;
        return true;
    }
}
void Destroy(Dstring *S)
{
    free(S->str);
    S->size = 0;
    S->maxLength = 0;
}
void Dstring_print(Dstring *S)//Output.
{
    int len = S->size;
    for(int i = 0; i < len; i++)
    {
        printf("%c",S->str[i]);
    }
    printf("\n");
}
#endif // DSTRING_H_INCLUDED

②main.cpp

#include <iostream>
#include <malloc.h>
#include <stdexcept>
#include <sstream>
#include <cstdio>
#include "Dstring.h"
using namespace std;

//Pattern-Match--Brutal Algorithm
int Brute_Force_Match(Dstring S, Dstring T, int &cnt)
{
    int i, j, pos;
    i = 0; j = 0;
    int lens = S.size;
    int lent = T.size;
    while(i < lens && j < lent)
    {
        if(cnt++ && S.str[i] == T.str[j])
        {
            i++;j++;
        }
        else
        {
            i = i - j + 1;
            j = 0;
        }
    }
    if(j == T.size) pos = i - T.size;
    else pos = -1;
    return pos;
}

//Pattern-Matching--KMP Algorithm
void getNext(Dstring T, int nxt[], int &cnt)
{
    int j = 1, k = 0;
    nxt[0] = -1;
    nxt[1] = 0;
    while(j < T.size)
    {
        if(cnt++ && T.str[j] == T.str[k])
        {
            nxt[j+1] = k + 1;
            j++;k++;
        }
        else if(k == 0)
        {
            nxt[j+1] = 0;
            j++;
        }
        else k = nxt[k];
    }
}
int KMP(Dstring S, Dstring T, int nxt[], int &cnt)
{
    getNext(T, nxt, cnt);
    int i = 0, j = 0;
    while(i < S.size && j < S.size)
    {
        if(cnt++ && S.str[i] == T.str[j])
        {
            i++;j++;
            if(j == T.size)break;
            //add this to guarantee that algorithm return the position where pattern first appear.
        }
        else if(j == 0)i++;
        else j = nxt[j];
    }
    int pos;
    if(j == T.size)pos = i - T.size;
    else pos = -1;
    return pos;
}
int nxt[200];
int main()
{
    freopen("in.txt","r",stdin);
    Dstring a, b;
    int n, len, cnt1, cnt2;
    char *c = (char *)malloc(sizeof(char)*200);
    scanf("%d",&n);
    getchar();
    for(int index = 1; index <= n; index++)
    {
        cin>>c;
        len = strlen(c) + 1;
        //cout<<"**"<<len<<endl;
        Initiate(&a, len, c);
        cout<<endl;
        scanf("%s",c);
        len = strlen(c) + 1;
        Initiate(&b, len, c);
        cnt1 = cnt2 = 0;
        int pos1 = Brute_Force_Match(a, b, cnt1);
        int pos2 = KMP(a, b, nxt, cnt2);
        printf("Test case #%d:\n",index);
        //printf("Original String: ");
        //Dstring_print(&a);
        //printf("Patten String: ");
        //Dstring_print(&b);
        printf("Using BF Algorithm: %d , comparing %d times.\n", pos1, cnt1);
        printf("Using KMP Algorithm: %d , comparing %d times.\n", pos2, cnt2);
    }
}

###测试样例与测试结果
数据：随机生成的小规模数据

13
abcdefg
hijk
abcdefg
abcdefg
abcdefg
efg
abcabc
abc
cdacacac
caca
cccc
c
ccdd
cd
fkajfjkkellfjkbnmffefilckajaafinncme
ellfjkbnmffefi
dhjelkd
dwjidowwkjdnkjbja
afewfefefecdfeffgthyttrwedcdfefsrfaffcdfefsggrdg
cd
aaaaaaaa
aaaab
cddcdc
abcde
fjaeislfhakjklfeaufjkejfujfehfjeukfheyfefgejhfefdhwdhwhdwhdadwhkjdhwjadhwkjadhwjkadhjkwahdjkwahdjaaaaaaaaaaaa
whkjdhwjadhwkj

结果：
在不考虑next数组用到的匹配的时候


Test case #1:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 7 times.

Test case #2:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 7 times.

Test case #3:
Using BF Algorithm: 4 , comparing 7 times.
Using KMP Algorithm: 4 , comparing 7 times.

Test case #4:
Using BF Algorithm: 3 , comparing 6 times.
Using KMP Algorithm: 3 , comparing 6 times.

Test case #5:
Using BF Algorithm: 3 , comparing 7 times.
Using KMP Algorithm: 3 , comparing 7 times.

Test case #6:
Using BF Algorithm: 1 , comparing 2 times.
Using KMP Algorithm: 1 , comparing 2 times.

Test case #7:
Using BF Algorithm: 1 , comparing 3 times.
Using KMP Algorithm: 1 , comparing 3 times.

Test case #8:
Using BF Algorithm: 8 , comparing 22 times.
Using KMP Algorithm: 8 , comparing 22 times.

Test case #9:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 7 times.

Test case #10:
Using BF Algorithm: 10 , comparing 12 times.
Using KMP Algorithm: 10 , comparing 12 times.

Test case #11:
Using BF Algorithm: -1 , comparing 20 times.
Using KMP Algorithm: -1 , comparing 11 times.

Test case #12:
Using BF Algorithm: -1 , comparing 6 times.
Using KMP Algorithm: -1 , comparing 6 times.

Test case #13:
Using BF Algorithm: 61 , comparing 80 times.
Using KMP Algorithm: 61 , comparing 78 times.

Process returned 0 (0x0)   execution time : 0.054 s
Press any key to continue.

在考虑了next的匹配次数之后


Test case #1:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 10 times.

Test case #2:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: 0 , comparing 13 times.

Test case #3:
Using BF Algorithm: 4 , comparing 7 times.
Using KMP Algorithm: 4 , comparing 9 times.

Test case #4:
Using BF Algorithm: 3 , comparing 6 times.
Using KMP Algorithm: 0 , comparing 5 times.

Test case #5:
Using BF Algorithm: 3 , comparing 7 times.
Using KMP Algorithm: 3 , comparing 11 times.

Test case #6:
Using BF Algorithm: 1 , comparing 2 times.
Using KMP Algorithm: 1 , comparing 2 times.

Test case #7:
Using BF Algorithm: 1 , comparing 3 times.
Using KMP Algorithm: 1 , comparing 5 times.

Test case #8:
Using BF Algorithm: 8 , comparing 22 times.
Using KMP Algorithm: 8 , comparing 36 times.

Test case #9:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 26 times.

Test case #10:
Using BF Algorithm: 10 , comparing 12 times.
Using KMP Algorithm: 10 , comparing 13 times.

Test case #11:
Using BF Algorithm: -1 , comparing 20 times.
Using KMP Algorithm: -1 , comparing 15 times.

Test case #12:
Using BF Algorithm: -1 , comparing 6 times.
Using KMP Algorithm: -1 , comparing 10 times.

Test case #13:
Using BF Algorithm: 61 , comparing 80 times.
Using KMP Algorithm: 61 , comparing 93 times.

Process returned 0 (0x0)   execution time : 0.043 s
Press any key to continue.

在小规模数据下，两种算法差别并不大
随机生成了几组大规模数据进行测试，其中

test14中的主串和模板串由26个小写字母组成
test15中的主串和模板串由6个小写字母组成
test16中的主串和模板串由2个小写字母组成

Test case #14:
Original String Size: 100000
Patten String Size: 30720
Using BF Algorithm: 63136 , comparing 96413 times.
Using KMP Algorithm: 63136 , comparing 128248 times.

Test case #15:
Original String Size: 100000
Patten String Size: 54
Using BF Algorithm: 91754 , comparing 110191 times.
Using KMP Algorithm: 91754 , comparing 107137 times.

Test case #16:
Original String Size: 100000
Patten String Size: 2048
Using BF Algorithm: 48173 , comparing 97859 times.
Using KMP Algorithm: 48173 , comparing 67859 times.

Process returned 0 (0x0)   execution time : 0.320 s
Press any key to continue.

结论是在小规模测试中，求next数组带来的开销相对较大不可忽视，计入这一部分时，kmp算法的比较次数可能更多，在更大规模测试下（主串规模远远大于模式串），同时主串和模式串的元素种类较少时，KMP的效率才会比较明显的体现出来

###动态数组设计方式和静态的区别

可扩展性：使用动态数组设计串，最大的好处是可以自动扩充字符串的规模，字符串的插入有更高的自由度，相比之下，静态数组的设计下，字符串规模将有不可更改的最大限制
代码编写：动态数组设计时要考虑内存申请以及及时释放，相比之下静态的设计就简单得多，直接声明一个固定大小即可，无需其他工作
当程序肯定运行在一个较小且比较固定的规模下时，可以使用静态，否则应该使用动态数组设计方式

skyword_sun

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
字符串模式匹配算法（KMP）

问题描述编写程序比较暴力匹配算法和KMP算法在匹配字符串的时候的比较次数，使用动态数组的顺序存储结构算法思想暴力匹配算法（BruteForce）的做法是逐个字符串匹配，当有主串某字符和模板串首字符相等是，向下比较下一字符；当匹配到某个位置出现不同时，回到原来的匹配位置的下一位重新匹配，理论复杂度O(mn)O(mn)O(mn)，其中mmm和nnn分别是主串和模板串的规模。KMP算法对模板串定...
复制链接

扫一扫

专栏目录