KMP算法与poj 3461

最新推荐文章于 2020-03-28 23:36:49 发布

steer_z

最新推荐文章于 2020-03-28 23:36:49 发布

阅读量234

点赞数

分类专栏：保研机试准备文章标签： c++ 算法数据结构字符串

本文链接：https://blog.csdn.net/qq_40931241/article/details/105109470

版权

保研机试准备专栏收录该内容

6 篇文章 0 订阅

订阅专栏

最近学习编译原理，学到词法分析时，发现龙书的课后习题有一个关于很奇怪的算法，定义了一个很奇怪的失效函数，当时就很懵逼，上网查询了什么是失效函数，才发现，失效函数是KMP算法里面大名鼎鼎的next数组，然后考虑到不久之后的保研机试，便开始了对KMP算法的深入研究。

一、 KMP算法简介

假设字符串pat长为m，字符串text长为n。

KMP算法：是一种字符串匹配算法，可以在O(m+n)的时间内对两个字符串进行匹配，找到其中一个字符串在另一个字符串的起始位置（pat在text中的位置）。

KMP的关键：next数组，其实就是失效函数形成的一个数组。完全由pat决定，和pat字符串的长度相同。其中的每一个元素就是以该下标为结尾的pat字符串前缀集合和后缀集合的交集中长度最大的那个字符串的长度。比如字符串“ababa”，前缀集合{a,ab,aba,abab}，后缀集合为{a,ba,aba,baba}，交集为{a,aba}，“aba”的长度为3，所以next[4] = 3。next其余元素计算类似。

KMP的具体算法请知乎搜索KMP，有大量优质资源，不在此赘述。

KMP代码如下：

#include <iostream>
#include <stdio.h>
#include <cstring>
using namespace std;

char cpat[1000];
char ctext[10000];
int my_next[1000];
int cpat_length;
int ctext_length;

void cal_next() {
    my_next[0] = -1; \\ 注意给my_next[0]初始化为-1。
    int i = 0,j = -1;
    while (j < cpat_length && i < ctext_length) {
        if (j == -1 || cpat[i] == cpat[j]) {
            i++;
            j++;
            my_next[i] = j;
        }
        else {
            j = my_next[j];
        }
    }
    
}
int search() {
    int i = 0,j = 0;
    while (j < cpat_length && i < ctext_length) {
        if (j == -1 || cpat[j] == ctext[i]) {
            i++;
            j++;
        } 
        else {
            j = my_next[j];
        }
    }
    if (j == cpat_length) {
        return i - j;
    }
    return -1;
}

int main(int argc, char const *argv[])
{
    scanf("%s%s",cpat,ctext);
    cpat_length = strlen(cpat);
    ctext_length = strlen(ctext);
    cal_next();
    printf("%d\n",search());
    return 0;
}

出现的问题：
开始我使用的c++的string类来定义pat和text字符串，然后cal_next()函数的代码如下：

void cal_next() {
	my_next[0] = -1;
	int i = 0,j = -1;
	while (i < pat.length() && j < pat.length()) {
		if (j == -1 && pat[i] == pat[j]) {
			i++;
			j++;
			my_next[i] = j;
		}
		else {
			j = my_next[j];
		}
	}
}

c++字符串的length()方法返回值类型问题

上面这段代码看着是没有逻辑问题的，但是运行会发现while循环的循环体并没有被执行，my_next数组除了第一个元素为-1，其余全部为0。
原因为c++字符串的length()方法返回值为无符号的整型，而j = -1，有符号数与无符号数进行大小比较，-1是大于该无符号数的，故while循环体并没有被执行。
正确写法如下：

void cal_next() {
	my_next[0] = -1;
	int i = 0,j = -1;
	int pat_length = pat.length(); \\ 首先将无符号数转化为有符号数
	while (i < pat.length() && j < pat_length) {
		if (j == -1 && pat[i] == pat[j]) {
			i++;
			j++;
			my_next[i] = j;
		}
		else {
			j = my_next[j];
		}
	}
}

在处理vector的size（）函数时类似。

二、poj 3461题解

原题：
Description

The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of the Oulipo group. A quote from the book:

Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…

Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive 'T’s is not unusual. And they never use spaces.

So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {‘A’, ‘B’, ‘C’, …, ‘Z’} and two finite strings over that alphabet, a word W and a text T, count the number of occurrences of W in T. All the consecutive characters of W must exactly match consecutive characters of T. Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:

One line with the word W, a string over {‘A’, ‘B’, ‘C’, …, ‘Z’}, with 1 ≤ |W| ≤ 10,000 (here |W| denotes the length of the string W).
One line with the text T, a string over {‘A’, ‘B’, ‘C’, …, ‘Z’}, with |W| ≤ |T| ≤ 1,000,000.
Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W in the text T.

Sample Input

3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN
Sample Output

1
3
0

题意：

找由英文字母表组成的W在文本T中出现的次数。

所以说这个就是KMP算法的一个简单使用，其中最为关键的一点是，当成功匹配到一个字符串的处理，假设 j 指向W的元素，那么 j 下一步应该指向W中的哪个元素。这个问题不考虑好，会出现超时问题。

其实上面这个问题，很简单，和KMP原理类似，既然已经完全匹配，我们已经知道了被匹配字符串T的一些信息，所以此时要利用next数组中包含的前缀和后缀相同的信息，来降低时间复杂度。具体做法是，指向T的 i 不变，j = next[j]，和KMP算法核心思想相同。具体见如下完整代码：

#include <iostream>
#include<cstdio>
#include <cstring>
using namespace std;

char cpat[100020];
char ctext[1000020];
int my_next[100020];
int pat_length;
int text_length;

void cal_next() {
    int i = 0, j = -1;
    my_next[0] = -1;
    while (i < pat_length && j < pat_length) {
        if (j == -1 || cpat[i] == cpat[j]) {
            i++;
            j++;
            my_next[i] = j;
        }
        else {
            j = my_next[j];
        }
    }
}
int countTimes() {
    int i = 0,j = 0;
    int words = 0;
    while (i < text_length && j < pat_length) {
        if (j == -1 || cpat[j] == ctext[i]) {
            i++;
            j++;
            if (j >= pat_length) {
                words++;
                j = my_next[j]; // 很重要，关键所在
            
            }   
        }
        else {
            j = my_next[j];
        }
    }
    return words;
}
int main(int argc, char const *argv[])
{
    int n;
    cin >> n;
    for (int i = 0;i < n;i++) {
        scanf("%s%s", cpat, ctext);
        pat_length = strlen(cpat);
        text_length = strlen(ctext); 
        cal_next();
        int res = countTimes();
        cout << res << endl;
    }
    return 0;
}

出现的问题：

RUNTIME ERROR 运行时错误，我一开始使用的类写的，样例正确，但不知为何一直运行时错误，没有办法，便将需要用到的数据结构与改为全局变量，然后定义了两个函数来实现KMP算法。定义那几个全局变量时一定要注意范围。
超时错误，如果解决上面说的那个关键问题，那么很有可能是因为你使用的cin，将cin换成scanf("%s%s", cpat, ctext);便可以解决问题。注意加上#include <cstdio> 。

作为一个第一次接触编程便是学习的C++的人，我对c的库实在是不熟悉，但c确实快一点，没办法。

使用printf以及scanf是要记得加在这里插入代码片#include <cstdio>，使用strlen函数来判断char数组的长度时记得加#include <cstring>。

steer_z

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
KMP算法与poj 3461

最近学习编译原理，学到词法分析时，发现龙书的课后习题有一个关于很奇怪的算法，定义了一个很奇怪的失效函数，当时就很懵逼，上网查询了什么是失效函数，才发现，失效函数是KMP算法里面大名鼎鼎的next数组，然后考虑到不久之后的保研机试，便开始了对KMP算法的深入研究。一、 KMP算法简介假设字符串pat长为m，字符串text长为n。KMP算法：是一种字符串匹配算法，可以在O(m+n)的时间内对两个...
复制链接

扫一扫

专栏目录