关于字典树的构造及KMP

最新推荐文章于 2023-03-24 10:52:21 发布

lsd&xql

最新推荐文章于 2023-03-24 10:52:21 发布

阅读量623

点赞数

分类专栏：算法块文章标签：字典树及KMP

本文链接：https://blog.csdn.net/lsdstone/article/details/96509686

版权

算法块专栏收录该内容

73 篇文章 1 订阅

订阅专栏

字典树

概念：

又称单词查找树，Trie树，是一种树形结构，是一种哈希树的变种。典型应用是用于统计，排序和保存大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：利用字符串的公共前缀来减少查询时间，最大限度地减少无谓的字符串比较，查询效率比哈希树高。

字典树可用于储存大量字符串，并能快速查找所需的单词等等，常被应用于搜索中的词频统计，且占有空间也相对较少。

字典树如下图所示：

在这里插入图片描述

关于字典树的构造：

1.首先保存一个数据表示当前字母的个数和一个指向下一层数据的指针数组。

代码如下：

#include<iostream>
using namespace std;
const int maxn = 26;
struct treenode{

	int count;

	treenode*next[26];

}head;

2.然后对字典树进行初始化：

void init()
	{
	head.count = 0;
	for (int i = 0; i < 26; i++)
	{
		head.next[i] = 0;
	}
}

3.插入操作：假设我们现在要把字符串s插入现有的字典树，我们从根结点开始一边遍历字符串s一边通过next走向对应字符的分支，若不存在则新建，最后对终点的字符结点进行收尾操作。

	treenode*createnode()
{
		treenode* newnode;
		newnode = (treenode*)malloc(sizeof(treenode));
		newnode->count = 0;
		for (int i = 0; i < maxn; i++)
		{
			newnode->next[i] = 0;
		}
		return newnode;
	}
	
	void insert(char *s)
		{
		treenode*t = &head;
		int i = 0;
		while (s[i])
		{	
			int temp = s[i] - 'a';
				t->count++;
			if (t->next[temp] == NULL)t->next[temp]=createnode();
			t = t->next[temp];
			i++;
		}
	}

假设查询字典树中是否存在字符串s，我们从根节点开始一边遍历s一边通过next走向对应的分支，若不存在分支则返回false，若走到终点且验证终点存在则返回true。

bool search(char *s)
	{
		treenode*t = &head;
		int i = 0;
		while (s[i])
		{
			int temp = s[i] - 'a';

			if (t->next[temp] == NULL)return false;
			t = t->next[temp];
			i++;
		}
		if (t->count)return true;
		return false;
	}

下面演示一下操作并附上代码：

#include "stdafx.h"
#include<iostream>
#include<cstring>
using namespace std;
const int maxn = 26;
struct treenode{

	int count;

	treenode*next[26];

}head;


void init_tree()
	{
	head.count = 0;
	for (int i = 0; i < 26; i++)
	{
		head.next[i] = 0;

	}
}

	treenode*createnode()
{
		treenode* newnode;
		newnode = (treenode*)malloc(sizeof(treenode));
		newnode->count = 0;
		for (int i = 0; i < maxn; i++)
		{
			newnode->next[i] = 0;

		}

		return newnode;
}



	void insert(char *s)
		{
		treenode*t = &head;
		int i = 0;
		while (s[i])
		{	
			int temp = s[i] - 'a';

			if (t->next[temp] == NULL)t->next[temp]=createnode();
			t = t->next[temp];
			t->count++;
			i++;
		}
		t->count++;
	
	}

	bool search(char *s)
	{
		treenode*t = &head;
		int i = 0;
		while (s[i])
		{
			int temp = s[i] - 'a';

			if (t->next[temp] == NULL)return false;
			t = t->next[temp];
			i++;
		}
		if (t->count)return true;
		return false;
	}


	int main()
		{
		init_tree();
		char s[1000];
		scanf("%s",s);//输入一个字符串并插入字典树
		insert(s);
		if (search(s))printf("匹配成功\n");//再在字典树中查找该字符串
		
		return 0;
	}

在这里插入图片描述

如上图匹配成功

看下面一道模板题：

小Hi和小Ho是一对好朋友，出生在信息化社会的他们对编程产生了莫大的兴趣，他们约定好互相帮助，在编程的学习道路上一同前进。

这一天，他们遇到了一本词典，于是小Hi就向小Ho提出了那个经典的问题：“小Ho，你能不能对于每一个我给出的字符串，都在这个词典里面找到以这个字符串开头的所有单词呢？”

身经百战的小Ho答道：“怎么会不能呢！你每给我一个字符串，我就依次遍历词典里的所有单词，检查你给我的字符串是不是这个单词的前缀不就是了？”

小Hi笑道：“你啊，还是太年轻了！~假设这本词典里有10万个单词，我询问你一万次，你得要算到哪年哪月去？”

小Ho低头算了一算，看着那一堆堆的0，顿时感觉自己这辈子都要花在上面了…

小Hi看着小Ho的囧样，也是继续笑道：“让我来提高一下你的知识水平吧~你知道树这样一种数据结构么？”

小Ho想了想，说道：“知道~它是一种基础的数据结构，就像这里说的一样！”

小Hi满意的点了点头，说道：“那你知道我怎么样用一棵树来表示整个词典么？”

小Ho摇摇头表示自己不清楚。

提示一：Trie树的建立

“你看，我们现在得到了这样一棵树，那么你看，如果我给你一个字符串ap，你要怎么找到所有以ap开头的单词呢？”小Hi又开始考校小Ho。

“唔…一个个遍历所有的单词？”小Ho还是不忘自己最开始提出来的算法。

“笨！这棵树难道就白构建了！”小Hi教训完小Ho，继续道：“看好了！”

提示二：如何使用Trie树

提示三：在建立Trie树时同时进行统计！

“那么现在！赶紧去用代码实现吧！”小Hi如是说道

×Close
提示一：Trie树的建立
小Hi于是在纸上画了一会，递给小Ho，道：“你看这棵树和这个词典有什么关系？”
在这里插入图片描述
小Ho盯着手里的纸想了一会道：“我知道了！对于从树的根节点走到每一个黑色节点所经过的路径，如果将路径上的字母都连起来的话，就都对应着词典中的一个单词呢！”

小Hi说道：“那你知道如何根据一个词典构建这样一棵树么？”

“不造！”

“想你也不知道，我来告诉你吧~”小Hi摆出一副老师的样子，说道：“你先这么想，如果我已经有了这样的一个词典和对应的一棵树，我要添加一个新的单词apart，我应该怎么做？”

“让我想想……”小Ho又开始苦思冥想：“首先我要先看看已经能走到哪一步了对吧？比如我从1号节点走"a"这一条边就可以走到2号节点，然后从2号节点走"p"这一条边可以走到3号节点，然后……就没路可走了！这时候我就需要添加一条从3号节点出发且标记为"p"的边才可以接着往下走……最后就是这样了！然后我把最后到达的这个结点标记为黑色就可以了。”
在这里插入图片描述
小Ho盯着手里的纸想了一会道：“我知道了！对于从树的根节点走到每一个黑色节点所经过的路径，如果将路径上的字母都连起来的话，就都对应着词典中的一个单词呢！”

小Hi说道：“那你知道如何根据一个词典构建这样一棵树么？”

“不造！”

Input

输入的第一行为一个正整数n，表示词典的大小，其后n行，每一行一个单词（不保证是英文单词，也有可能是火星文单词哦），单词由不超过10个的小写英文字母组成，可能存在相同的单词，此时应将其视作不同的单词。接下来的一行为一个正整数m，表示小Hi询问的次数，其后m行，每一行一个字符串，该字符串由不超过10个的小写英文字母组成，表示小Hi的一个询问。
在20%的数据中n, m<=10，词典的字母表大小<=2.
在60%的数据中n, m<=1000，词典的字母表大小<=5.
在100%的数据中n, m<=100000，词典的字母表大小<=26.
本题按通过的数据量排名哦～

Output

对于小Hi的每一个询问，输出一个整数Ans,表示词典中以小Hi给出的字符串为前缀的单词的个数。

Sample Input

5
babaab
babbbaaaa
abba
aaaaabaa
babaababb
5
babb
baabaaa
bab
bb
bbabbaab

Sample Output

1
0
3
0
0

题目分析：

首先将字符串依次存入字典树中然后再通过next数组来得出count从而返回出count的值。只需要改写一下search函数即可。

AC代码如下：

#include<iostream>
#include<cstring>
#include<stdio.h>
using namespace std;
const int maxn = 26;
struct treenode{

	int count;

	treenode*next[26];

}head;


void init_tree()
	{
	head.count = 0;
	for (int i = 0; i < 26; i++)
	{
		head.next[i] = 0;

	}
}

	treenode*createnode()
{
		treenode* newnode;
		newnode = (treenode*)malloc(sizeof(treenode));
		newnode->count = 0;
		for (int i = 0; i < maxn; i++)
		{
			newnode->next[i] = 0;

		}

		return newnode;
}



	void insert(char *s)
		{
		treenode*t = &head;
		int i = 0;
		while (s[i])
		{	
			int temp = s[i] - 'a';

			if (t->next[temp] == NULL)t->next[temp]=createnode();
			t = t->next[temp];
			t->count++;
			i++;
		}
		
	
	}

	/*bool search(char *s)
	{
		treenode*t = &head;
		int i = 0;
		while (s[i])
		{
			int temp = s[i] - 'a';

			if (t->next[temp] == NULL)return false;
			t = t->next[temp];
			i++;
		}
		if (t->count)return true;
		return false;
	}*/


	int search(char *s)
	{	
		treenode*t = &head;
		int temp, i = 0;
		while (s[i])
		{
			temp = s[i] - 'a';
			if (t->next[temp] == NULL)return 0;	
			t = t->next[temp];
			i++;
	}
	
		return t->count;
	}


	int main()
		{
		init_tree();
		char word[15];
		int n;
		cin >> n;
		getchar();
		while (n--)
		{
			gets(word);
			insert(word);
		}
		cin >> n;
		gets(word);
		while (n--)
		{
			gets(word);
		printf("%d\n", search(word));


		}
		return 0;
	}

KMP

概念：

KMP算法是一种改进的字符串匹配算法，由D.E.Knuth，J.H.Morris和V.R.Pratt提出的，因此人们称它为克努特—莫里斯—普拉特操作（简称KMP算法）。KMP算法的核心是利用匹配失败后的信息，尽量减少模式串与主串的匹配次数以达到快速匹配的目的。具体实现就是通过一个next()函数实现，函数本身包含了模式串的局部匹配信息。KMP算法的时间复杂度O(m+n)。

原理：

其核心原理是通过预处理模式串的所有前缀串的最长相同前后缀，构造next数组做跳转减少模式串指针的回退长度，从而减少时间复杂度。

步骤：

1.首先我们需要构造出一个next数组，找出模式串的最长相同前后缀。
如果j=-1，或者当前字符串匹配成功即（s[i]==s[j]）则另i++及j++,继续匹配下一个字符。
如果j!=-1且当前字符匹配失败（即s[i]!=s[j]）另i不变，j=next[j]。意味着失配时j就等于在j之前有多少个最大长度为next[j]的最大相同前后缀（一个跳转操作）。换言之在匹配失败时，模式串向右移动的位数为：失配字符所造的位置减去失配字符对应的next值。next[j]=k表示j之前的字符串中有最大长度为k的相同前后缀。

KMP模板代码如下：

#include<iostream>
#include<cstring>
using namespace std;
int next1[100];
void getnext(char *s)
{
	next1[0] = -1;
	int i = 0, j = -1;
	while (s[i])
	{
		if (j == -1 || s[i] == s[j])next1[++i] = ++j;

		else
		j = next1[j];
	}
}
void kmp(char*s, char*t)
{
	
	getnext(t);
	int len1 = strlen(t);
	int len = strlen(s);
	 int i=0,j = 0;
	while (s[i])
	{
		if (j == -1 || s[i] == t[j])
		{
			i++, j++;

			if (j == len1)
			{
				cout << i - len1 << endl;
				j = next1[j];
			}
		}
		else
			j = next1[j];

	}

}

int main()
	{
	kmp("abcabcab", "abcab");

	return 0;
}

输出结果如图所示：

（在零和三的下标上匹配）
在这里插入图片描述

下面是一道kmp模板题：

Given two sequences of numbers : a[1], a[2], … , a[N], and b[1], b[2], … , b[M] (1 <= M <= 10000, 1 <= N <= 1000000). Your task is to find a number K which make a[K] = b[1], a[K + 1] = b[2], … , a[K + M - 1] = b[M]. If there are more than one K exist, output the smallest one.

Input

The first line of input is a number T which indicate the number of cases. Each case contains three lines. The first line is two numbers N and M (1 <= M <= 10000, 1 <= N <= 1000000). The second line contains N integers which indicate a[1], a[2], … , a[N]. The third line contains M integers which indicate b[1], b[2], … , b[M]. All integers are in the range of [-1000000, 1000000].

Output

For each test case, you should output one line which only contain K described above. If no such K exists, output -1 instead.

Sample Input

2
13 5
1 2 1 2 3 1 2 3 1 3 2 1 2
1 2 3 1 3
13 5
1 2 1 2 3 1 2 3 1 3 2 1 2
1 2 3 2 1

Sample Output

6
-1

题目思路：给定一个原串和一个模式串找出他们匹配时的位置，如果找不到则返回-1，反之则返回位置。

AC代码如下：

#include<iostream>
#include<cstdio>
#include<cstring>
using namespace std;

int a[1000005];
int b[10005];
int nxt[10005];

int n, m;
void getnext()
{
	int len = m;
	int t1 = 0, t2;
	t2 = nxt[0] = -1;

	while (t1<len)
	{
		if (t2 == -1 || b[t1] == b[t2])
		{
			t1++; t2++;
			nxt[t1] = t2;
		}
		else t2 = nxt[t2];

	}
}

int kmp(int *a, int *b)
{
	int len1 = n, len2 = m;
	int t1 = 0, t2 = 0;
	while (t1 < len1&&t2< len2)
	{
		if (t2 == -1 || a[t1] == b[t2])
		{
			t1++;
			t2++;
		}
		else
			t2 = nxt[t2];
	}
	if (t2 == len2)
		return t1 - t2 + 1;
	else
		return -1;

}

int main()
{
	int t;
	scanf("%d", &t);
	while (t--)
	{
		scanf("%d%d", &n, &m);
		for (int i = 0; i < n; i++)
			scanf("%d", &a[i]);

		for (int j = 0; j < m; j++)
			scanf("%d", &b[j]);

		getnext();
		int ans = kmp(a, b);
		printf("%d\n", ans);
	}

	return 0;
}