【整理于2017年4月】
问题描述
编写程序比较暴力匹配算法和KMP算法在匹配字符串的时候的比较次数,使用动态数组的顺序存储结构
算法思想
暴力匹配算法(BruteForce)的做法是逐个字符串匹配,当有主串某字符和模板串首字符相等是,向下比较下一字符;当匹配到某个位置出现不同时,回到原来的匹配位置的下一位重新匹配,理论复杂度
O
(
m
n
)
O(mn)
O(mn),其中
m
m
m和
n
n
n分别是主串和模板串的规模。
KMP算法对模板串定义了next数组,意义在于,当出现匹配失败的情况时,模板串的匹配下标不是回到初始位置,而是回到 n e x t [ j ] next[j] next[j]位置继续向下匹配。从而节省了不必要的比较,同时保证不会错过某些位置。理论复杂度 O ( m + n ) O(m+n) O(m+n)
next数组的含义:对于下标j, n e x t [ j ] next[j] next[j]的含义是给出了0~j-1中的最长公共前后缀,从而,下一次匹配时,我们直接回到最长前缀的下一位继续匹配即可
next数组的求法:递推方式。在求出 n e x t [ j ] next[j] next[j]之后:
- 若 s t r [ j ] = = s t r [ k ] str[j]==str[k] str[j]==str[k] ,直接更新k值,即 n e x t [ j + 1 ] = k + 1 next[j+1]=k+1 next[j+1]=k+1
- 若不然,将这时的情形看成对前k个字符的匹配,置 k = n e x t [ k ] k=next[k] k=next[k](由递推性质,该值一定存在),进行下一次比较
代码设计
①"Dstring.h"头文件: 定义动态串结构体,并定义了以下函数:
- void Initiate(Dstring *S, int mlen, char *str)//初始化,长度为mlen,存储字符串str
- bool Insert(Dstring *S, int pos, Dstring T)//在pos位置之后插入T
- bool Delete(Dstring *S, int pos, int len)//在pos位置之后删除长度为len的字符串
- bool Substring(Dstring *S, int pos, int len, Dstring T)//取出从pos开始,长度为len的字符串,存在T中
- void Destroy(Dstring *S)//销毁字符串
- void Dstring_print(Dstring *S)//输出字符串
②"main.cpp"主文件: 利用写好的Dstring,实现Brute-Force和KMP并比较
- int Brute_Force_Match(Dstring S, Dstring T, int &cnt)//暴力匹配部分
- void getNext(Dstring T, int nxt[], int &cnt)//对模板串求next数组
- int KMP(Dstring S, Dstring T, int nxt[], int &cnt)//kmp匹配部分
③两种算法比较次数的对比
要点如下:
- 每次匹配开始时,定义计数器cnt,置零,在每次 i f ( S − > s t r [ i ] = = T . s t r [ j ] ) if(S->str[i] == T.str[j]) if(S−>str[i]==T.str[j])判断语句中加入cnt++,修改计数器的值
- 计数器以引用形式传入匹配函数中,不再作为函数的返回值
- 将KMP算法生成next数组时用到的匹配也计入其中(课本示例中未计入)
- 与教材实现的一点关键不同:假设主串S,模板串T,对于T在S中出现多次的测试情形,当第一次匹配成功时就返回。而不是继续匹配到最后
程序代码
①Dstring.h
#ifndef DSTRING_H_INCLUDED
#define DSTRING_H_INCLUDED
/*
Index of string starts from 0.
*/
#include <cstdio>
#include <cstring>
#include <stdexcept>
#include <sstream>
using namespace std;
typedef struct
{
char *str;
int maxLength;//Maximum capacity.
int size;//The Number of Characters Dstring has now.
}Dstring;
void Initiate(Dstring *S, int mlen, char *str)//Initiate Dstring with size = len and str..
{
S->str = (char *)malloc(sizeof(char)*mlen);//Apply memory.
S->maxLength = mlen;
S->size = strlen(str);
int len = S->size;
for(int i = 0; i < len; i++)
{
S->str[i] = str[i];
}
}
bool Insert(Dstring *S, int pos, Dstring T)//Insert Dstring T at the pos-th position of S.
{
if(pos < 0 || pos > S->size)//Illegal pos parameter.
{
ostringstream s;
s<<"Illegal pos parameter."<<endl;
throw invalid_argument(s.str());
return false;
}
char *p;
if(S->size + T.size > S->maxLength)//Apply more memory to store new string.
{
p = (char *)realloc(S->str, (S->size + T.size)*sizeof(char));
if(p == NULL)
{
ostringstream s;
s<<"System has run out of RAM."<<endl;
throw invalid_argument(s.str());
return false;
}
}
for(int i = S->size - 1; i >= pos; i--)//move substring(pos, size-1) T.size units forward.
S->str[i+T.size] = S->str[i];
for(int i = 0; i < T.size; i++)//Insert characters 1 by 1.
S->str[pos+i] = T.str[i];
S->size += T.size;
return true;
}
bool Delete(Dstring *S, int pos, int len)//Delete len units from pos to pos+len.
{
if(S->size <= 0)
{
ostringstream s;
s<<"This string has already been empty."<<endl;
throw invalid_argument(s.str());
return false;
}
if(pos < 0 || len < 0 || pos + len > S->size)
{
ostringstream s;
s<<"Illegal parameter : pos or len."<<endl;
throw invalid_argument(s.str());
return false;
}
else
{
for(int i = pos + len; i <= S->size-1; i++)
S->str[i-len] = S->str[i];
S->size -= len;
return true;
}
}
bool Substring(Dstring *S, int pos, int len, Dstring *T)//Get substring in S from position pos with length len, let T store it.
{
if(pos < 0 || len < 0 || pos + len > S->size)
{
ostringstream s;
s<<"Illegal parameter : pos or len."<<endl;
throw invalid_argument(s.str());
return false;
}
else
{
for(int i = 0; i < len; i++)
T->str[i] = S->str[pos+i];
T->size = len;
return true;
}
}
void Destroy(Dstring *S)
{
free(S->str);
S->size = 0;
S->maxLength = 0;
}
void Dstring_print(Dstring *S)//Output.
{
int len = S->size;
for(int i = 0; i < len; i++)
{
printf("%c",S->str[i]);
}
printf("\n");
}
#endif // DSTRING_H_INCLUDED
②main.cpp
#include <iostream>
#include <malloc.h>
#include <stdexcept>
#include <sstream>
#include <cstdio>
#include "Dstring.h"
using namespace std;
//Pattern-Match--Brutal Algorithm
int Brute_Force_Match(Dstring S, Dstring T, int &cnt)
{
int i, j, pos;
i = 0; j = 0;
int lens = S.size;
int lent = T.size;
while(i < lens && j < lent)
{
if(cnt++ && S.str[i] == T.str[j])
{
i++;j++;
}
else
{
i = i - j + 1;
j = 0;
}
}
if(j == T.size) pos = i - T.size;
else pos = -1;
return pos;
}
//Pattern-Matching--KMP Algorithm
void getNext(Dstring T, int nxt[], int &cnt)
{
int j = 1, k = 0;
nxt[0] = -1;
nxt[1] = 0;
while(j < T.size)
{
if(cnt++ && T.str[j] == T.str[k])
{
nxt[j+1] = k + 1;
j++;k++;
}
else if(k == 0)
{
nxt[j+1] = 0;
j++;
}
else k = nxt[k];
}
}
int KMP(Dstring S, Dstring T, int nxt[], int &cnt)
{
getNext(T, nxt, cnt);
int i = 0, j = 0;
while(i < S.size && j < S.size)
{
if(cnt++ && S.str[i] == T.str[j])
{
i++;j++;
if(j == T.size)break;
//add this to guarantee that algorithm return the position where pattern first appear.
}
else if(j == 0)i++;
else j = nxt[j];
}
int pos;
if(j == T.size)pos = i - T.size;
else pos = -1;
return pos;
}
int nxt[200];
int main()
{
freopen("in.txt","r",stdin);
Dstring a, b;
int n, len, cnt1, cnt2;
char *c = (char *)malloc(sizeof(char)*200);
scanf("%d",&n);
getchar();
for(int index = 1; index <= n; index++)
{
cin>>c;
len = strlen(c) + 1;
//cout<<"**"<<len<<endl;
Initiate(&a, len, c);
cout<<endl;
scanf("%s",c);
len = strlen(c) + 1;
Initiate(&b, len, c);
cnt1 = cnt2 = 0;
int pos1 = Brute_Force_Match(a, b, cnt1);
int pos2 = KMP(a, b, nxt, cnt2);
printf("Test case #%d:\n",index);
//printf("Original String: ");
//Dstring_print(&a);
//printf("Patten String: ");
//Dstring_print(&b);
printf("Using BF Algorithm: %d , comparing %d times.\n", pos1, cnt1);
printf("Using KMP Algorithm: %d , comparing %d times.\n", pos2, cnt2);
}
}
###测试样例与测试结果
数据:随机生成的小规模数据
13
abcdefg
hijk
abcdefg
abcdefg
abcdefg
efg
abcabc
abc
cdacacac
caca
cccc
c
ccdd
cd
fkajfjkkellfjkbnmffefilckajaafinncme
ellfjkbnmffefi
dhjelkd
dwjidowwkjdnkjbja
afewfefefecdfeffgthyttrwedcdfefsrfaffcdfefsggrdg
cd
aaaaaaaa
aaaab
cddcdc
abcde
fjaeislfhakjklfeaufjkejfujfehfjeukfheyfefgejhfefdhwdhwhdwhdadwhkjdhwjadhwkjadhwjkadhjkwahdjkwahdjaaaaaaaaaaaa
whkjdhwjadhwkj
结果:
在不考虑next数组用到的匹配的时候
Test case #1:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 7 times.
Test case #2:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 7 times.
Test case #3:
Using BF Algorithm: 4 , comparing 7 times.
Using KMP Algorithm: 4 , comparing 7 times.
Test case #4:
Using BF Algorithm: 3 , comparing 6 times.
Using KMP Algorithm: 3 , comparing 6 times.
Test case #5:
Using BF Algorithm: 3 , comparing 7 times.
Using KMP Algorithm: 3 , comparing 7 times.
Test case #6:
Using BF Algorithm: 1 , comparing 2 times.
Using KMP Algorithm: 1 , comparing 2 times.
Test case #7:
Using BF Algorithm: 1 , comparing 3 times.
Using KMP Algorithm: 1 , comparing 3 times.
Test case #8:
Using BF Algorithm: 8 , comparing 22 times.
Using KMP Algorithm: 8 , comparing 22 times.
Test case #9:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 7 times.
Test case #10:
Using BF Algorithm: 10 , comparing 12 times.
Using KMP Algorithm: 10 , comparing 12 times.
Test case #11:
Using BF Algorithm: -1 , comparing 20 times.
Using KMP Algorithm: -1 , comparing 11 times.
Test case #12:
Using BF Algorithm: -1 , comparing 6 times.
Using KMP Algorithm: -1 , comparing 6 times.
Test case #13:
Using BF Algorithm: 61 , comparing 80 times.
Using KMP Algorithm: 61 , comparing 78 times.
Process returned 0 (0x0) execution time : 0.054 s
Press any key to continue.
在考虑了next的匹配次数之后
Test case #1:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 10 times.
Test case #2:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: 0 , comparing 13 times.
Test case #3:
Using BF Algorithm: 4 , comparing 7 times.
Using KMP Algorithm: 4 , comparing 9 times.
Test case #4:
Using BF Algorithm: 3 , comparing 6 times.
Using KMP Algorithm: 0 , comparing 5 times.
Test case #5:
Using BF Algorithm: 3 , comparing 7 times.
Using KMP Algorithm: 3 , comparing 11 times.
Test case #6:
Using BF Algorithm: 1 , comparing 2 times.
Using KMP Algorithm: 1 , comparing 2 times.
Test case #7:
Using BF Algorithm: 1 , comparing 3 times.
Using KMP Algorithm: 1 , comparing 5 times.
Test case #8:
Using BF Algorithm: 8 , comparing 22 times.
Using KMP Algorithm: 8 , comparing 36 times.
Test case #9:
Using BF Algorithm: -1 , comparing 7 times.
Using KMP Algorithm: -1 , comparing 26 times.
Test case #10:
Using BF Algorithm: 10 , comparing 12 times.
Using KMP Algorithm: 10 , comparing 13 times.
Test case #11:
Using BF Algorithm: -1 , comparing 20 times.
Using KMP Algorithm: -1 , comparing 15 times.
Test case #12:
Using BF Algorithm: -1 , comparing 6 times.
Using KMP Algorithm: -1 , comparing 10 times.
Test case #13:
Using BF Algorithm: 61 , comparing 80 times.
Using KMP Algorithm: 61 , comparing 93 times.
Process returned 0 (0x0) execution time : 0.043 s
Press any key to continue.
在小规模数据下,两种算法差别并不大
随机生成了几组大规模数据进行测试,其中
- test14中的主串和模板串由26个小写字母组成
- test15中的主串和模板串由6个小写字母组成
- test16中的主串和模板串由2个小写字母组成
Test case #14:
Original String Size: 100000
Patten String Size: 30720
Using BF Algorithm: 63136 , comparing 96413 times.
Using KMP Algorithm: 63136 , comparing 128248 times.
Test case #15:
Original String Size: 100000
Patten String Size: 54
Using BF Algorithm: 91754 , comparing 110191 times.
Using KMP Algorithm: 91754 , comparing 107137 times.
Test case #16:
Original String Size: 100000
Patten String Size: 2048
Using BF Algorithm: 48173 , comparing 97859 times.
Using KMP Algorithm: 48173 , comparing 67859 times.
Process returned 0 (0x0) execution time : 0.320 s
Press any key to continue.
结论是在小规模测试中,求next数组带来的开销相对较大不可忽视,计入这一部分时,kmp算法的比较次数可能更多,在更大规模测试下(主串规模远远大于模式串),同时主串和模式串的元素种类较少时,KMP的效率才会比较明显的体现出来
###动态数组设计方式和静态的区别
- 可扩展性:使用动态数组设计串,最大的好处是可以自动扩充字符串的规模,字符串的插入有更高的自由度,相比之下,静态数组的设计下,字符串规模将有不可更改的最大限制
- 代码编写:动态数组设计时要考虑内存申请以及及时释放,相比之下静态的设计就简单得多,直接声明一个固定大小即可,无需其他工作
- 当程序肯定运行在一个较小且比较固定的规模下时,可以使用静态,否则应该使用动态数组设计方式