——by m4trix
AC算法:Aho-Corasick algorithm (Alfred V.Aho & Margaret J.Corasick)
Alfred V.Aho:《编译原理》(龙书)的作者哦!
问题:
给出n个单词,再给出一段包含m个字符的文章,请找出有多少个单词在文章中出现过?
Refer:
给出n个单词,再给出一段包含m个字符的文章,请找出有多少个单词在文章中出现过?
Refer:
杭州电子科技大学 ACM题:Keywords Search
http://acm.hdu.edu.cn/showproblem.php?pid=2222
http://acm.hdu.edu.cn/showproblem.php?pid=2222
AC算法即适用于解决这种问题:在目标文本串T中定位一个模式串集合P中的每一个模式串p出现的位置。
继续:
假设在当前文本位置 ,已经找到了既是文本串T = t1t2t3...ti的后缀,同时也是模式串集合P中某个模式串pk的前缀的最长字符串。问题的关键在于每读入一个新的文本字符时,如何更新这个最长字符串的长度?AC算法给出了答案。
要想理解AC算法,需要对KMP算法有个透彻的理解,对KMP算法的理解,可以参考我之前写的一个小心得:"《柔性字符串匹配》读书笔记(1)之--KMP算法(单模式串匹配、前缀匹配)"
AC算法是KMP算法向多模式串情节的扩展(两者都是在对同一问题(模式串前缀的自包含问题)的研究中产生出来的):
KMP算法:单模式串匹配、前缀匹配
KMP算法:单模式串匹配、前缀匹配
在KMP算法中,只有一个模式串,当部分匹配该模式串时,查找的是:既是该模式串(部分匹配部分)的后缀、同时又是该模式串(部分匹配部分)的前缀的最长字符串的长度。
AC算法:多模式串匹配、前缀匹配
而AC算法是KMP算法向多模式串情景的扩展,因此当部分匹配某模式串时,查找的是:既是该部分匹配模式串部分的后缀,同时又是需匹配模式串集合中某个模式串的前缀的最长字符串的长度。
AC算法:
AC算法:
Step1:对模式串集合进行预处理,得到一个有限状态自动机(在该自动机的基础上,有三张表对应:goto表、fail表、output表);
Step2:将要匹配的文本T作为有限自动机的输入,输出含有那些patterns及这些patterns在文本中的位置。
(1)、goto表:由模式集合P中的所有模式构成的状态转移自动机;
- 预处理阶段:
(1)、goto表:由模式集合P中的所有模式构成的状态转移自动机;
e.g.
有模式集合:P{p0, p1, p2, p3}
p0 = "he"
p1 = "she"
p2 = "his"
p3 = "hers"
状态转移自动机
(2)、fail表:所谓fail表,即当我们处于状态转移自动机中的某个状态时,接下来继续输入字符(文本T中的一个字符),状态转移自动机无法继续进行跳转时(即匹配fail了),那么我们应该跳转到状态机的哪个位置来继续进行匹配呢?不是所有这种情况都需要重新跳转到state0来重新进行匹配的,根据(既是该部分匹配模式串部分的后缀,同时又是需匹配模式串集合中某个模式串的前缀的最长字符串的长度)原理,可以构造出fail表。fail表即当匹配失败时,状态转移自动机进行合理跳转的一个映射表。
上图状态转移自动机对应的fail表表示成图形为:
状态fail转换图
fail表:
从上图可以看出:
部分匹配模式串匹配部分:"sh"的后缀"h",是模式串集合中模式:"he"、"his"、"hers" 的前缀,且是最长前缀,对应有映射关系:f(4) = 1;
部分匹配模式串匹配部分:"sh"的后缀"h",是模式串集合中模式:"he"、"his"、"hers" 的前缀,且是最长前缀,对应有映射关系:f(4) = 1;
部分匹配模式串匹配部分:"she"的后缀"he",是模式串集合中模式:"he"、"hers" 的前缀,且是最长前缀,对应有映射关系:f(5) = 2;
部分匹配模式串匹配部分:"his"的后缀"s",是模式串集合中模式:"she"的前缀,且是最长前缀,对应有映射关系:f(7) = 3;
部分匹配模式串匹配部分:"hers"的后缀"s",是模式串集合中模式:"she"的前缀,且是最长前缀,对应有映射关系:f(9) = 3;
(3)、output表:指的是状态和模式串之间的一种关系。即当状态机到达某种状态时,模式串集合中可能某些模式串已经完成了匹配,output表即状态与已完成匹配的模式串之间的映射表。
如上示例中,有output表:
- 匹配阶段:
在自动状态转换机中依次输入文本T中的字符:
if (matching) {
查找goto表,进入相应的状态State i;
查看output(i):
if (output(i)不为空) {
输出匹配位置;
}
} else { // fail
查找fail表,进入相应的的状态State fail(i);
继续当前字符的匹配动作;
}
e.g.
以文本T"ushers"示例:
以文本T"ushers"示例:
从状态自动转换机的State0出发,
接收'u', 在goto表中发现回到State0;继续
接收's', goto State3, 查一下output(3)为空字符串,说明没有匹配到patterns;继续
接收'h', goto State4, 查一下output(4)为空字符串,说明没有匹配到patterns;继续
接收'e', goto State5, 查一下output(5)为{"she", "he"}, 说明匹配中了模式"she"和模式"he", 输出在整个文本字符串中的位置;继续
接收'r', fail, fail(5)==2, goto State2;继续
匹配'r', goto State8, 查一下output(8)为空字符串,说明没有匹配到patterns;继续
接收's', goto State9, 查一下output(9)为{"her"}, 说明匹配中了模式"her", 输出在整个文本字符串中的位置;继续
输入文本结束,整个匹配过程结束。
结果:
文本T"ushers"会匹配中模式"she"、"he"、"her", 其位置分别为巴拉巴拉...
文本T"ushers"会匹配中模式"she"、"he"、"her", 其位置分别为巴拉巴拉...
AC算法的优点:
优点1:扫描文本时完全不需要回溯;
优点2:AC算法的时间复杂度是O(n),与patterns的个数及长度都没有关系。因为Text中的每个字符都必须输入自动机,所以最好最坏情况下都是O(n),加上预处理时间,那就是O(M+n),M是patterns长度总和。
AC代码参考:
参考开源系统:snort入侵检测系统的acsmx.h & acsmx.c文件:
acsmx.h
/*
* Copyright (C) 2002 Martin Roesch <roesch@sourcefire.com>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*/
/*
* ACSMX.H
*/
#ifndef ACSMX_H
#define ACSMX_H
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*
* Prototypes
*/
#define MAXLEN 256
#define ALPHABET_SIZE 256
#define ACSM_FAIL_STATE -1
//#define AC_DEBUG 1
// present a pattern string
typedef struct _acsm_pattern {
struct _acsm_pattern *next;
unsigned char *patrn; // pattern string, be converted to upper, e.g. "ABABCAB"
unsigned char *casepatrn; // pattern string, not converted to upper, e.g. "aBAbCab"
int n; // the length fo pattern string
int nocase; // whether case sensitive, 0 is & 1 is not
int nmatch;
unsigned int id;
void *data; //user self-data
} ACSM_PATTERN;
// present a state of the trie's node
typedef struct {
int NextState[ALPHABET_SIZE]; // Next state - based on input character goto table
int FailState; // Failure state - used while building NFA & DFA fail table
ACSM_PATTERN *MatchList; // List of patterns that end here, if any output table
} ACSM_STATETABLE;
/*
* State machine Struct
*/
typedef struct {
int acsmMaxStates; // acsmMaxStates = total character of all patterns' sum + 1 (State 0)
int acsmNumStates; // States's number of the trie
ACSM_PATTERN *acsmPatterns; // a list include all acsm patterns
ACSM_STATETABLE *acsmStateTable; // a array include all statetable of all trie's nodes
} ACSM_STRUCT;
/*
* Prototypes
*/
int _acsmAddPattern(ACSM_STRUCT *p, unsigned char *pat, int n, int nocase, unsigned int id, void *data);
int _acsmSearch(ACSM_STRUCT *acsm, unsigned char *Tx, int n, int match_full, unsigned int (*PrintMatch)(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data), void *data);
void init_xlatcase(); // API
ACSM_STRUCT *acsmNew(); // API
int acsmCompile(ACSM_STRUCT *acsm); // API
void acsmFree(ACSM_STRUCT *acsm); // API
unsigned int PrintMatch(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data); // API
void PrintSummary(ACSM_PATTERN *pattern); // API
void PrintGotoTable(ACSM_STRUCT *acsm); // API
void PrintFailTable(ACSM_STRUCT *acsm); // API
void PrintOutputTable(ACSM_STRUCT *acsm); // API
#define acsmAddPattern(p, pat, n, nocase, id) _acsmAddPattern(p, pat, n, nocase, id, NULL) // API
#define acsmAddPattern2(p, pat, n, nocase, data, id) _acsmAddPattern(p, pat, n, nocase, id, data) // API
#define acsmSearch(acsm, Tx, n) _acsmSearch(acsm, Tx, n, 0, NULL, NULL) // API
#define acsmSearchCB(acsm, Tx, n, match_fn, data) _acsmSearch(acsm, Tx, n, 0, match_fn, data) // API
#define acsmFullMatch(acsm, Tx, n) _acsmSearch(acsm, Tx, n, 1, NULL, NULL) // API
#define acsmFullMatchCB(acsm, Tx, n, match_fn, data) _acsmSearch(acsm, Tx, n, 1, match_fn, data) // API
#endif
acsmx.c
/*
* Multi-Pattern Search Engine
*
* Aho-Corasick State Machine - uses a Deterministic Finite Automata - DFA
*
* Copyright (C) 2002 Sourcefire, Inc.
* Marc Norton
*
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*
*
* Reference - Efficient String matching: An Aid to Bibliographic Search
* Alfred V Aho and Margaret J Corasick
* Bell Labratories
* Copyright(C) 1975 Association for Computing Machinery, Inc.
*
* Implemented from the 4 algorithms in the paper by Aho & Corasick
* and some implementation ideas from 'Practical Algorithms in C'
*
* Notes:
* 1) This version uses about 1024 bytes per pattern character - heavy on the memory.
* 2) This algorithm finds all occurrences of all patterns within a
* body of text.
* 3) Support is included to handle upper and lower case matching.
* 4) Some comopilers optimize the search routine well, others don't, this makes all the difference.
* 5) Aho inspects all bytes of the search text, but only once so it's very efficient,
* if the patterns are all large than the Modified Wu-Manbar method is often faster.
* 6) I don't subscribe to any one method is best for all searching needs,
* the data decides which method is best,
* and we don't know until after the search method has been tested on the specific data sets.
*
* May 2002: Marc Norton 1st Version
* June 2002: Modified interface for SNORT, added case support
* Aug 2002: Cleaned up comments, and removed dead code.
* Nov 2,2002: Fixed queue_init(), added count = 0
*
* Wangyao : wangyao@cs.hit.edu.cn
*
* Apr 24,2007: WangYao Combined Build_NFA() and Convert_NFA_To_DFA() into Build_DFA();
* And Delete Some redundancy Code
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "acsmx.h"
//#include <ngx_core.h>
#define MEMASSERT(p,s) if(!p){fprintf(stderr,"ACSM-No Memory: %s!\n",s);exit(0);}
//-------------------------------------------------------------------------------
//static void *ngx_waf_ac_shm_pool = NULL;
/*
* Malloc the AC Memory From shm pool
*/
static void *
AC_MALLOC(int n)
{
void *p;
p = malloc(n);
//p = ngx_slab_alloc((ngx_slab_pool_t *)ngx_waf_ac_shm_pool, n);
return p;
}
/*
* Free the AC Memory to shm pool
*/
static void
AC_FREE(void *p)
{
if (p) {
free(p);
//ngx_slab_free((ngx_slab_pool_t *)ngx_waf_ac_shm_pool, p);
}
}
//-------------------------------------------------------------------------------
//-------------------------------------------------------------------------------
/*
* Single Linked List
*
* Empty Queue:
* S
* -------
* |head |--> NULL
* -------
* |tail |--> NULL
* -------
* |count| = 0
* -------
*
* Queue with nodes:
* ------------------------------------
* | |
* | S ----> Node Node ---> Node
* | ------- | ------- ------- -------
* | |head |--- |next |-->|next |------>|next |-->NULL
* | ------- ------- ------- -------
* ---|tail | |state| |state| |state|
* ------- ------- ------- -------
* |count| = 3
* -------
*
* Used for .
*/
/*
* Simple QUEUE NODE
*/
typedef struct _qnode
{
int state;
struct _qnode *next;
} QNODE;
/*
* Simple QUEUE Structure
*/
typedef struct _queue
{
QNODE *head, *tail;
int count;
} QUEUE;
/*
*Init the Queue
*/
static void
queue_init(QUEUE *s)
{
s->head = s->tail = 0;
s->count = 0;
}
/*
* Add Tail Item to queue
*/
static void
queue_add(QUEUE *s, int state)
{
QNODE *q;
if (!s->head) { // Queue is empty
q = s->tail = s->head = (QNODE *)AC_MALLOC(sizeof(QNODE));
MEMASSERT(q, "queue_add"); // if malloc failed, exit the problom
q->state = state;
q->next = 0; // Set the New Node's Next Null
} else {
q = (QNODE *)AC_MALLOC(sizeof(QNODE));
MEMASSERT(q, "queue_add");
q->state = state;
q->next = 0;
s->tail->next = q; // Add the new Node into the queue
s->tail = q; // set the new node is the Queue's Tail
}
s->count++;
}
/*
* Remove Head Item from queue
*/
static int
queue_remove(QUEUE *s)
{
int state = 0;
QNODE *q;
// Remove A QueueNode From the head of the Queue
if (s->head) {
q = s->head;
state = q->state;
s->head = s->head->next;
s->count--;
// If Queue is Empty, After Remove A QueueNode
if (!s->head) {
s->tail = 0;
s->count = 0;
}
// Free the QueNode Memory
AC_FREE(q);
}
return state;
}
/*
* Return The count of the Node in the Queue
*/
static int
queue_count(QUEUE *s)
{
return s->count;
}
/*
* Free the Queue Memory
*/
static void
queue_free(QUEUE *s)
{
while (queue_count(s)) {
queue_remove(s);
}
}
//-------------------------------------------------------------------------------
/*
* Case Translation Table
*/
static unsigned char xlatcase[256];
/*
* Init the xlatcase Table,Trans alpha to UpperMode
* Just for the NoCase State
*/
void
init_xlatcase()
{
int i;
for (i = 0; i < 256; i++) {
xlatcase[i] = toupper (i);
}
}
/*
* Convert the pattern string into upper
*/
static void
ConvertCaseEx(unsigned char *d, unsigned char *s, int m)
{
int i;
for (i = 0; i < m; i++) {
d[i] = xlatcase[s[i]];
}
}
/*
* Add a pattern to the list of patterns terminated at this state.
* Insert at front of list.
*/
static void
AddMatchListEntry(ACSM_STRUCT *acsm, int state, ACSM_PATTERN *px)
{
ACSM_PATTERN *p;
p = (ACSM_PATTERN *)AC_MALLOC(sizeof(ACSM_PATTERN));
MEMASSERT(p, "AddMatchListEntry");
memcpy(p, px, sizeof(ACSM_PATTERN));
// Add the new pattern to the pattern list
p->next = acsm->acsmStateTable[state].MatchList;
acsm->acsmStateTable[state].MatchList = p;
}
/*
* Add Pattern States
*/
static void
AddPatternStates(ACSM_STRUCT *acsm, ACSM_PATTERN *p)
{
unsigned char *pattern;
int state = 0, next, n;
n = p->n; // The number of alpha in the pattern string
pattern = p->patrn;
// Match up pattern with existing states
for (; n > 0; pattern++, n--) {
next = acsm->acsmStateTable[state].NextState[*pattern];
if (next == ACSM_FAIL_STATE) {
break;
}
state = next;
}
// Add new states for the rest of the pattern bytes, 1 state per byte
for (; n > 0; pattern++, n--) {
acsm->acsmNumStates ++;
acsm->acsmStateTable[state].NextState[*pattern] = acsm->acsmNumStates;
state = acsm->acsmNumStates;
}
//Here,An accept state,just add into the MatchListof the state
AddMatchListEntry(acsm, state, p);
}
/*
* Build Deterministic Finite Automata
*/
static void
Build_DFA(ACSM_STRUCT *acsm)
{
int r, s;
int i;
QUEUE q, *queue = &q;
// Init a Queue
queue_init(queue);
// Add the state 0 transitions 1st
// 1st depth Node's FailState is 0, fail(x)=0
for (i = 0; i < ALPHABET_SIZE; i++) {
s = acsm->acsmStateTable[0].NextState[i];
if (s) {
queue_add(queue, s);
acsm->acsmStateTable[s].FailState = 0;
}
}
// Build the fail state transitions for each valid state
while (queue_count(queue) > 0) {
r = queue_remove(queue);
// Find Final States for any Failure
for (i = 0; i < ALPHABET_SIZE; i++) {
int fs, next;
// Note NextState[i] is a const variable in this block
if ((s = acsm->acsmStateTable[r].NextState[i]) != ACSM_FAIL_STATE) {
queue_add(queue, s);
fs = acsm->acsmStateTable[r].FailState;
// Locate the next valid state for 'i' starting at s
// Note the variable "next"
// Note "NextState[i]" is a const variable in this block
while ((next=acsm->acsmStateTable[fs].NextState[i]) == ACSM_FAIL_STATE) {
fs = acsm->acsmStateTable[fs].FailState;
}
// Update 's' state failure state to point to the next valid state
acsm->acsmStateTable[s].FailState = next;
// NOTES: 感谢网友提供的补丁,来修正pattern中有重叠的现象。(copy vs opy)
ACSM_PATTERN *pat = acsm->acsmStateTable[next].MatchList;
for (; pat != NULL; pat = pat->next) {
AddMatchListEntry(acsm, s, pat);
}
} else {
acsm->acsmStateTable[r].NextState[i] = acsm->acsmStateTable[acsm->acsmStateTable[r].FailState].NextState[i];
}
}
}
// Clean up the queue
queue_free(queue);
}
/*
* Init the acsm DataStruct
*/
ACSM_STRUCT *
acsmNew()
{
ACSM_STRUCT *p;
// For shm share, Plz init this array in global
//init_xlatcase();
p = (ACSM_STRUCT *)AC_MALLOC(sizeof(ACSM_STRUCT));
MEMASSERT(p, "acsmNew");
if (p) {
memset(p, 0, sizeof(ACSM_STRUCT));
}
return p;
}
/*
* Add a pattern to the list of patterns for this state machine
*/
int
_acsmAddPattern(ACSM_STRUCT *p, unsigned char *pat, int n, int nocase, unsigned int id, void *data)
{
ACSM_PATTERN *plist;
plist = (ACSM_PATTERN *)AC_MALLOC(sizeof (ACSM_PATTERN));
MEMASSERT(plist, "acsmAddPattern");
plist->patrn = (unsigned char *)AC_MALLOC(n+1);
memset(plist->patrn+n, 0, 1);
ConvertCaseEx(plist->patrn, pat, n);
plist->casepatrn = (unsigned char *)AC_MALLOC(n+1);
memset(plist->casepatrn + n, 0, 1);
memcpy(plist->casepatrn, pat, n);
plist->n = n;
plist->nocase = nocase;
plist->nmatch = 0;
plist->id = id;
plist->data = data;
// Add the pattern into the pattern list
plist->next = p->acsmPatterns;
p->acsmPatterns = plist;
return 0;
}
/*
* Compile State Machine
*/
int
acsmCompile(ACSM_STRUCT *acsm)
{
int i, k;
int size;
ACSM_PATTERN *plist;
// Count number of states, acsmMaxStates = total character of all patterns' sum + 1 (State 0)
acsm->acsmMaxStates = 1; // State 0
for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
acsm->acsmMaxStates += plist->n;
}
size = sizeof(ACSM_STATETABLE) * acsm->acsmMaxStates;
acsm->acsmStateTable = (ACSM_STATETABLE *)AC_MALLOC(size);
if (acsm->acsmStateTable == NULL) {
return -1;
}
memset(acsm->acsmStateTable, 0, size);
// Initialize state zero as a branch
acsm->acsmNumStates = 0;
// Initialize all States NextStates to FAILED
for (k = 0; k < acsm->acsmMaxStates; k++) {
for (i = 0; i < ALPHABET_SIZE; i++) {
acsm->acsmStateTable[k].NextState[i] = ACSM_FAIL_STATE;
}
}
// Add each Pattern to the State Table
for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
AddPatternStates(acsm, plist);
}
#ifdef AC_DEBUG
PrintGotoTable(acsm);
#endif
// Set all failed state transitions which from state 0 to return to the 0'th state
for (i = 0; i < ALPHABET_SIZE; i++) {
if (acsm->acsmStateTable[0].NextState[i] == ACSM_FAIL_STATE) {
acsm->acsmStateTable[0].NextState[i] = 0;
}
}
// Build the NFA
Build_DFA(acsm);
return 0;
}
/* 64KB Memory */
static unsigned char Tc[64*1024];
/*
* Search Text or Binary Data for Pattern matches
*/
int
_acsmSearch(ACSM_STRUCT *acsm, unsigned char *Tx, int n, int match_full,
unsigned int (*PrintMatch)(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data), void *data)
{
int state;
ACSM_PATTERN *mlist;
unsigned char *Tend;
ACSM_STATETABLE *StateTable = acsm->acsmStateTable;
int nfound = 0; // Number of the found(matched) patten string
unsigned char *T;
int index, i;
unsigned int id;
// Case conversion
ConvertCaseEx(Tc, Tx, n);
T = Tc;
Tend = T + n;
for (state = 0; T < Tend; T++) {
state = StateTable[state].NextState[*T];
// State is a accept state?
if (StateTable[state].MatchList != NULL) {
for (mlist = StateTable[state].MatchList; mlist != NULL; mlist = mlist->next) {
if (match_full && n != mlist->n) {
continue;
}
// Get the index of the Match Pattern String in the Text
index = T - mlist->n + 1 - Tc;
if (!mlist->nocase) {
for (i = 0; i < mlist->n; i++) {
if (Tx[index+i] != mlist->casepatrn[i]) {
goto CONTINUE;
}
}
}
mlist->nmatch++;
nfound++;
if (PrintMatch != NULL) {
id = PrintMatch(acsm->acsmPatterns, mlist, index, data);
printf("id: %u\n", id);
}
CONTINUE:
;
}
}
}
return nfound;
}
/*
* Free all memory
*/
void
acsmFree(ACSM_STRUCT *acsm)
{
int i;
ACSM_PATTERN *mlist, *ilist;
for (i = 0; i < acsm->acsmMaxStates; i++) {
if (acsm->acsmStateTable[i].MatchList != NULL) {
mlist = acsm->acsmStateTable[i].MatchList;
while (mlist) {
ilist = mlist;
mlist = mlist->next;
AC_FREE(ilist);
}
}
}
AC_FREE(acsm->acsmStateTable);
mlist = acsm->acsmPatterns;
while (mlist) {
ilist = mlist;
mlist = mlist->next;
AC_FREE(ilist->patrn);
AC_FREE(ilist->casepatrn);
AC_FREE(ilist);
}
AC_FREE(acsm);
}
/*
* Print A Match String's Information and return the pattern's id
*/
unsigned int
PrintMatch(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data)
{
// Count the Each Match Pattern
ACSM_PATTERN *temp = pattern;
for (; temp != NULL; temp = temp->next) {
if (!strcmp((const char *)temp->patrn, (const char *)mlist->patrn)) { //strcmp succeed return 0, So here use "!" operation
temp->nmatch++;
}
}
printf("Match caseKeyWord %s index: %d, id: %d, nmatch: %d\n", mlist->casepatrn, index, mlist->id, mlist->nmatch);
return mlist->id;
}
void
PrintGotoTable(ACSM_STRUCT *acsm)
{
int i, n, m;
ACSM_STATETABLE *p_state_table;
printf("\n-----------------------------------\n");
printf("GotoTables:\n");
printf("-----------------------------------\n");
for (i = 0; i <= acsm->acsmNumStates; i++) {
p_state_table = &acsm->acsmStateTable[i];
printf("State[%d]'s Goto Table:\n", i);
for (n = 0; n < 16; n ++) {
for (m = 0; m < 16; m ++) {
if (p_state_table->NextState[n * 16 + m] != 0 && p_state_table->NextState[n * 16 + m] != -1) {
printf("------ %c ------\n", (n * 16 + m));
printf("| %2d | ----> | %2d |\n", i, p_state_table->NextState[n * 16 + m]);
printf("------ ------\n");
}
}
}
printf("\n");
}
printf("-----------------------------------\n");
}
void
PrintFailTable(ACSM_STRUCT *acsm)
{
int i;
ACSM_STATETABLE *p_state_table;
printf("\n-----------------------------------\n");
printf("OutputTable:\n");
printf("-----------------------------------\n");
for (i = 0; i <= acsm->acsmNumStates; i++) {
p_state_table = &acsm->acsmStateTable[i];
printf("State[%d] ----> State[%d]\n", i, p_state_table->FailState);
}
printf("-----------------------------------\n");
}
void
PrintOutputTable(ACSM_STRUCT *acsm)
{
int i;
ACSM_STATETABLE *p_state_table;
ACSM_PATTERN *p_pattern;
printf("\n-----------------------------------\n");
printf("OutputTable:\n");
printf("-----------------------------------\n");
for (i = 0; i <= acsm->acsmNumStates; i++) {
p_state_table = &acsm->acsmStateTable[i];
p_pattern = p_state_table->MatchList;
if (p_pattern != NULL) {
printf("State[%d]'s Output Table:\n{", i);
for (; p_pattern != NULL; p_pattern = p_pattern->next) {
printf(" %s ", p_pattern->casepatrn);
}
printf("}\n");
}
}
printf("-----------------------------------\n");
}
int main(int argc, char **argv)
{
int i, nocase = 0, f=0, matchcount = 0;
ACSM_STRUCT *acsm;
unsigned char text[MAXLEN];
if(argc < 3)
{
fprintf(stderr, "Usage: ./ac text word-1 word-2 ... word-n -nocase\n");
// ./ac usher hers his she he // because the pattern insert in the head of list
// ./ac usher e hers his she he
// ./ac usher e hers his she he
exit(0);
}
init_xlatcase();
acsm = acsmNew();
strcpy(text, argv[1]);
for (i = 1; i < argc; i++) {
if (strcmp(argv[i], "-nocase") == 0) {
nocase = 1;
}
}
for (i = 2; i < argc; i++) {
if (argv[i][0] == '-') {
continue;
}
printf("AddPattern: %.*s\n", strlen(argv[i]), argv[i]);
acsmAddPattern(acsm, argv[i], strlen(argv[i]), nocase, i);
}
acsmCompile(acsm);
#ifdef AC_DEBUG
//PrintGotoTable(acsm); // not here, the Goto Table has been changed
PrintFailTable(acsm);
PrintOutputTable(acsm);
#endif
matchcount = acsmSearchCB(acsm, text, strlen(text), PrintMatch, NULL);
acsmFree(acsm);
printf("%d matched.\n", matchcount);
return (0);
}
AC代码分析:
int
acsmCompile(ACSM_STRUCT *acsm)
{
{
......
// Add each Pattern to the State Table
for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
AddPatternStates(acsm, plist); {
for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
AddPatternStates(acsm, plist); {
......
// Here,An accept state,just add into the MatchListof the state
AddMatchListEntry(acsm, state, p);
// output表的生成
}
}
}
/* 以上代码,即 goto表的生成 以及 output表的生成过程
* 搞定所有State的NextState[ALPHABET_SIZE](goto表) & MatchList(output表)
* typedef struct { // the structure present a State
* int NextState[ALPHABET_SIZE]; // Next state - based on input character* int FailState; // Failure state - used while building NFA & DFA
* ACSM_PATTERN *MatchList; // List of patterns that end here, if any
* } ACSM_STATETABLE;
*
* 由该结构来表示,这里会浪费掉一定的空间(因为 acsmNumStates < acsmMaxStates)
* typedef struct {
* int acsmMaxStates; // acsmMaxStates = total character of all patterns' sum + 1 (State 0)* int acsmNumStates; // States's number of the trie
* ACSM_PATTERN *acsmPatterns; // a list include all acsm patterns
* ACSM_STATETABLE *acsmStateTable; // a array include all statetable of all trie's nodes
* } ACSM_STRUCT;
*/
......
}
static void
Build_DFA(ACSM_STRUCT *acsm)
{
......
/* 这里是整个fail表的构建,存储在每个State结构的FailState中
* typedef struct { // the structure present a State
* int NextState[ALPHABET_SIZE]; // Next state - based on input character* int FailState; // Failure state - used while building NFA & DFA
* ACSM_PATTERN *MatchList; // List of patterns that end here, if any
* } ACSM_STATETABLE;
*/
// 对照下图对代码进行分析:
}
AC代码注意:
这里值得商榷的地方是,该处用存储态度,会造成一定的空间浪费:
int
acsmCompile(ACSM_STRUCT *acsm)
{
int i, k;
int size;
ACSM_PATTERN *plist;
// Count number of states, acsmMaxStates = total character of all patterns' sum + 1 (State 0)
acsm->acsmMaxStates = 1; // State 0
for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
acsm->acsmMaxStates += plist->n;
}
size = sizeof(ACSM_STATETABLE) * acsm->acsmMaxStates;
acsm->acsmStateTable = (ACSM_STATETABLE *)AC_MALLOC(size);
if (acsm->acsmStateTable == NULL) {
return -1;
}
memset(acsm->acsmStateTable, 0, size);
AC代码注意:
这里值得商榷的地方是,该处用存储态度,会造成一定的空间浪费:
int
acsmCompile(ACSM_STRUCT *acsm)
{
int i, k;
int size;
ACSM_PATTERN *plist;
// Count number of states, acsmMaxStates = total character of all patterns' sum + 1 (State 0)
acsm->acsmMaxStates = 1; // State 0
for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
acsm->acsmMaxStates += plist->n;
}
size = sizeof(ACSM_STATETABLE) * acsm->acsmMaxStates;
acsm->acsmStateTable = (ACSM_STATETABLE *)AC_MALLOC(size);
if (acsm->acsmStateTable == NULL) {
return -1;
}
memset(acsm->acsmStateTable, 0, size);
......
}
Refer:
Refer:
"AC 经典多模式匹配算法"
http://blog.csdn.net/ijuliet/article/details/4210858
http://blog.csdn.net/ijuliet/article/details/4210858
"AC(Aho—Corasiek) 多模式匹配算法"
http://my.oschina.net/amince/blog/196426 该博客值得关注
http://my.oschina.net/amince/blog/196426 该博客值得关注
"Aho-Corasick算法学习"
http://blog.csdn.net/sealyao/article/details/4560427
http://blog.csdn.net/sealyao/article/details/4560427