HDU - 2222 Keywords Search（AC自动机（KMP + Tire））

最新推荐文章于 2021-10-08 16:03:29 发布

宇智波一打七~

最新推荐文章于 2021-10-08 16:03:29 发布

阅读量407

点赞数 1

分类专栏： AC自动机文章标签： AC自动机图像检索关键词匹配描述性文本高效算法

本文链接：https://blog.csdn.net/weixin_51979465/article/details/119941329

版权

AC自动机专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Description

In the modern time, Search engine came into the life of everybody like Google, Baidu, etc.
Wiskey also wants to bring this feature to his image retrieval system.
Every image have a long description, when users type some keywords to find the image, the system will match the keywords with description of image and show the image which the most keywords be matched.
To simplify the problem, giving you a description of image, and some keywords, you should tell me how many keywords will be match.

Input

First line will contain one integer means how many cases will follow by.
Each case will contain two integers N means the number of keywords and N keywords follow. (N <= 10000)
Each keyword will only contains characters ‘a’-‘z’, and the length will be not longer than 50.
The last line is the description, and the length will be not longer than 1000000.

Output

Print how many keywords are contained in the description.

Sample Input

1
5
she
he
say
shr
her
yasherhs

Sample Output

题意

给你一堆模式串和一个主串，问有多少模式串是这个主串的子串

思路

当然暴力是不行的，时间就n^3,铁TLE，如果暴力枚举每个模式串再KMP的话也是不行的，就是n2了，也是铁TLE，现在就要想一下这个高级数据结构AC自动机了，这个AC自动机是个什么原理呢，就是把所有的模式串放到一个字典树上，然后用主串跟这个字典树匹配，只要是遇到不匹配的就跳到最长公共前后缀上，这个就类似于KMP的那个next数组的操作，在这里不过是换到树上了，就是说如果现在遍历到不匹配了，然后他下面的单词也不可能匹配了，就要跳到树的下一条分叉上，具体的话请看下图
在这里插入图片描述
这个蓝色箭头就是AC自动机中所说的fail指针了，他指向的就是最长相同前后缀的那个下一个位置，这样的话，就跟KMP的思路非常像了，而且这个代码的匹配部分也是跟KMP大同小异的，只是求fail指针的时候需要用到bfs，但是为什么会用到bfs呢，回想一下咱们KMP中求next数组的过程，就是类似于一个dp的过程，就是从之前的状态推出来现在的状态，当时的KMP不过就是AC自动机的特例而已，就是一个只有一条线的字典树，不难发现他这个更新的过程是按层更新的，就是用到那个bfs啦，求fail指针的过程其实除了bfs，其他的跟kmp还真的没有什么区别，现在就看一下这个题的代码吧

#include<iostream>
#include<cstdio>
#include<cstring>
using namespace std;
const int N = 10010,S = 55,M = 1000100;
int tr[N*S][26],cnt[N*S],ne[N*S],idx,q[N*S];
char str[M];
void insert(char s[]){
    int n = strlen(s);
    int p = 0;
    for(int i=0;i<n;i++){
        int t = s[i] - 'a';
        if(!tr[p][t]) tr[p][t] = ++idx;
        p = tr[p][t];
    }
    cnt[p] ++;
}
void build(){
    int hh = 0,tt = -1,p=0;
    for(int i=0;i<26;i++){
        if(tr[p][i]) q[++tt] = tr[p][i];//把起点都入队
    }
    while(hh <= tt){
        int t = q[hh++];
        for(int i=0;i<26;i++){//枚举每个字母，看他的下一层是否有这个字母，有的话就更新一下
            int c = tr[t][i];
            if(c == 0) continue;
            int j = ne[t];
            while(j && !tr[j][i]) j = ne[j];
            if(tr[j][i]) j = tr[j][i];
            ne[c] = j;
            q[++tt] = c;
        }
    }
}
int main(){
    int T;
    scanf("%d",&T);
    while(T--){
        int n;
        memset(tr,0,sizeof tr);
        memset(cnt,0,sizeof cnt);
        memset(ne,0,sizeof cnt);
        idx=0;
        scanf("%d",&n);
        for(int i=1;i<=n;i++){
            scanf("%s",str);
            insert(str);
        }
        scanf("%s",str+1);
        build();
        int len = strlen(str+1),res = 0;
        for(int i=1,j=0;i<=len;i++){
            int t = str[i] - 'a';
            while(j && !tr[j][t]) j = ne[j];
            if(tr[j][t]) j = tr[j][t];
            int p = j;
            while(p){//就是一个匹配的字符串中可能包含很多个单词，比如说abcd中就包括了abcd,bcd,cd,d，
            		 //咱们要都遍历完才能得到正确答案
                res += cnt[p];
                cnt[p] = 0;
                p = ne[p];
            }
        }
        printf("%d\n",res);
    }
    return 0;
}