《柔性字符串匹配》读书笔记(2)之--AC算法(多模式串匹配、前缀匹配)

                                                                                                                         ——by m4trix


AC算法:Aho-Corasick algorithm       (Alfred V.Aho    &    Margaret J.Corasick)
Alfred V.Aho:《编译原理》(龙书)的作者哦!

问题:
给出n个单词,再给出一段包含m个字符的文章,请找出有多少个单词在文章中出现过?
Refer: 
        杭州电子科技大学 ACM题:Keywords Search
        http://acm.hdu.edu.cn/showproblem.php?pid=2222

AC算法即适用于解决这种问题:在目标文本串T中定位一个模式串集合P中的每一个模式串p出现的位置。

继续:
假设在当前文本位置 ,已经找到了既是文本串T = t1t2t3...ti的后缀,同时也是模式串集合P中某个模式串pk的前缀的最长字符串。问题的关键在于每读入一个新的文本字符时,如何更新这个最长字符串的长度?AC算法给出了答案。

要想理解AC算法,需要对KMP算法有个透彻的理解,对KMP算法的理解,可以参考我之前写的一个小心得:"《柔性字符串匹配》读书笔记(1)之--KMP算法(单模式串匹配、前缀匹配)"

AC算法是KMP算法向多模式串情节的扩展(两者都是在对同一问题(模式串前缀的自包含问题)的研究中产生出来的)
KMP算法:单模式串匹配、前缀匹配
在KMP算法中,只有一个模式串,当部分匹配该模式串时,查找的是:既是该模式串(部分匹配部分)的后缀、同时又是该模式串(部分匹配部分)的前缀的最长字符串的长度。
AC算法:多模式串匹配、前缀匹配
而AC算法是KMP算法向多模式串情景的扩展,因此当部分匹配某模式串时,查找的是:既是该部分匹配模式串部分的后缀,同时又是需匹配模式串集合中某个模式串的前缀最长字符串的长度。

AC算法:
Step1:对模式串集合进行预处理,得到一个有限状态自动机(在该自动机的基础上,有三张表对应:goto表、fail表、output表);
Step2:将要匹配的文本T作为有限自动机的输入,输出含有那些patterns及这些patterns在文本中的位置。

  • 预处理阶段:
经典的AC算法由三个表构成:(这三个表,都在预处理阶段得到)
(1)、goto表
:由模式集合P中的所有模式构成的状态转移自动机;
e.g. 
有模式集合:P{p0p1p2p3}
p0 = "he"
p1 = "she"
p2 = "his"
p3 = "hers"


        状态转移自动机


(2)、fail表:所谓fail表,即当我们处于状态转移自动机中的某个状态时,接下来继续输入字符(文本T中的一个字符),状态转移自动机无法继续进行跳转时(即匹配fail了),那么我们应该跳转到状态机的哪个位置来继续进行匹配呢?不是所有这种情况都需要重新跳转到state0来重新进行匹配的,根据(既是该部分匹配模式串部分的后缀,同时又是需匹配模式串集合中某个模式串的前缀最长字符串的长度)原理,可以构造出fail表。fail表即当匹配失败时,状态转移自动机进行合理跳转的一个映射表。

上图状态转移自动机对应的fail表表示成图形为:
       状态fail转换图

fail表:
从上图可以看出:
部分匹配模式串匹配部分:"sh"的后缀"h",是模式串集合中模式:"he"、"his"、"hers" 的前缀,且是最长前缀,对应有映射关系:f(4) = 1;
部分匹配模式串匹配部分:"she"的后缀"he",是模式串集合中模式:"he"、"hers" 的前缀,且是最长前缀,对应有映射关系:f(5) = 2
部分匹配模式串匹配部分:"his"的后缀"s",是模式串集合中模式:"she"的前缀,且是最长前缀,对应有映射关系:f(7) = 3
部分匹配模式串匹配部分:"hers"的后缀"s",是模式串集合中模式:"she"的前缀,且是最长前缀,对应有映射关系:f(9) = 3


(3)、output表:指的是状态和模式串之间的一种关系。即当状态机到达某种状态时,模式串集合中可能某些模式串已经完成了匹配,output表即状态与已完成匹配的模式串之间的映射表。

如上示例中,有output表:


  • 匹配阶段:
在自动状态转换机中依次输入文本T中的字符:
    if (matching) {
            查找goto表,进入相应的状态State i;
            查看output(i):
             if (output(i)不为空) {
                     输出匹配位置;
             }
    } else {  // fail
             查找fail表,进入相应的的状态State fail(i);
             继续当前字符的匹配动作;
    }

e.g.
以文本T"ushers"示例:
从状态自动转换机的State0出发, 
接收'u', 在goto表中发现回到State0;继续
接收's', goto State3, 查一下output(3)为空字符串,说明没有匹配到patterns;继续
接收'h', goto State4, 查一下output(4)为空字符串,说明没有匹配到patterns;继续
接收'e', goto State5, 查一下output(5)为{"she", "he"}, 说明匹配中了模式"she"和模式"he", 输出在整个文本字符串中的位置;继续
接收'r', fail, fail(5)==2, goto State2;继续
匹配'r', goto State8, 查一下output(8)为空字符串,说明没有匹配到patterns;继续
接收's', goto State9, 查一下output(9)为{"her"}, 说明匹配中了模式"her", 输出在整个文本字符串中的位置;继续
输入文本结束,整个匹配过程结束。
结果:
文本T"ushers"会匹配中模式"she"、"he"、"her", 其位置分别为巴拉巴拉...


AC算法的优点:
优点1:扫描文本时完全不需要回溯;
优点2:AC算法的时间复杂度是O(n),与patterns的个数及长度都没有关系。因为Text中的每个字符都必须输入自动机,所以最好最坏情况下都是O(n),加上预处理时间,那就是O(M+n),M是patterns长度总和。


AC代码参考:
参考开源系统:snort入侵检测系统的acsmx.h & acsmx.c文件:
acsmx.h
/*
 * Copyright (C) 2002 Martin Roesch <roesch@sourcefire.com>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 */


/*
 * ACSMX.H
 */
#ifndef ACSMX_H
#define ACSMX_H

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
 * Prototypes
 */
#define MAXLEN           256
#define ALPHABET_SIZE    256
#define ACSM_FAIL_STATE  -1

//#define AC_DEBUG  1

// present a pattern string
typedef struct _acsm_pattern {
    struct _acsm_pattern  *next;
    unsigned char         *patrn;      // pattern string, be converted to upper,  e.g. "ABABCAB"
    unsigned char         *casepatrn;  // pattern string, not converted to upper, e.g. "aBAbCab"
    int                   n;           // the length fo pattern string
    int                   nocase;      // whether case sensitive, 0 is & 1 is not
    int                   nmatch;
    unsigned int          id;
    void                  *data;       //user self-data
} ACSM_PATTERN;

// present a state of the trie's node
typedef struct {
    int           NextState[ALPHABET_SIZE];  // Next state - based on input character          goto table
    int           FailState;                 // Failure state - used while building NFA & DFA  fail table
    ACSM_PATTERN  *MatchList;                // List of patterns that end here, if any         output table
} ACSM_STATETABLE;

/*
 * State machine Struct
 */
typedef struct {
    int              acsmMaxStates;    // acsmMaxStates = total character of all patterns' sum + 1 (State 0)
    int              acsmNumStates;    // States's number of the trie
    ACSM_PATTERN     *acsmPatterns;    // a list include all acsm patterns
    ACSM_STATETABLE  *acsmStateTable;  // a array include all statetable of all trie's nodes
} ACSM_STRUCT;

/*
 * Prototypes
 */
int _acsmAddPattern(ACSM_STRUCT *p, unsigned char *pat, int n, int nocase, unsigned int id, void *data);
int _acsmSearch(ACSM_STRUCT *acsm, unsigned char *Tx, int n, int match_full, unsigned int (*PrintMatch)(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data), void *data);

void init_xlatcase();                                                                               // API
ACSM_STRUCT *acsmNew();                                                                             // API
int acsmCompile(ACSM_STRUCT *acsm);                                                                 // API
void acsmFree(ACSM_STRUCT *acsm);                                                                   // API
unsigned int PrintMatch(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data);         // API
void PrintSummary(ACSM_PATTERN *pattern);                                                           // API
void PrintGotoTable(ACSM_STRUCT *acsm);                                                             // API
void PrintFailTable(ACSM_STRUCT *acsm);                                                             // API
void PrintOutputTable(ACSM_STRUCT *acsm);                                                           // API
#define acsmAddPattern(p, pat, n, nocase, id)         _acsmAddPattern(p, pat, n, nocase, id, NULL)  // API
#define acsmAddPattern2(p, pat, n, nocase, data, id)  _acsmAddPattern(p, pat, n, nocase, id, data)  // API
#define acsmSearch(acsm, Tx, n)                       _acsmSearch(acsm, Tx, n, 0, NULL, NULL)       // API
#define acsmSearchCB(acsm, Tx, n, match_fn, data)     _acsmSearch(acsm, Tx, n, 0, match_fn, data)   // API
#define acsmFullMatch(acsm, Tx, n)                    _acsmSearch(acsm, Tx, n, 1, NULL, NULL)       // API
#define acsmFullMatchCB(acsm, Tx, n, match_fn, data)  _acsmSearch(acsm, Tx, n, 1, match_fn, data)   // API

#endif

acsmx.c
/*
 * Multi-Pattern Search Engine
 *
 * Aho-Corasick State Machine - uses a Deterministic Finite Automata - DFA
 *
 * Copyright (C) 2002 Sourcefire, Inc.
 * Marc Norton
 *
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 *
 *
 * Reference - Efficient String matching: An Aid to Bibliographic Search
 *             Alfred V Aho and Margaret J Corasick
 *             Bell Labratories
 *             Copyright(C) 1975 Association for Computing Machinery, Inc.
 *
 * Implemented from the 4 algorithms in the paper by Aho & Corasick
 * and some implementation ideas from 'Practical Algorithms in C'
 *
 * Notes:
 *     1) This version uses about 1024 bytes per pattern character - heavy on the memory.
 *     2) This algorithm finds all occurrences of all patterns within a
 *        body of text.
 *     3) Support is included to handle upper and lower case matching.
 *     4) Some comopilers optimize the search routine well, others don't, this makes all the difference.
 *     5) Aho inspects all bytes of the search text, but only once so it's very efficient,
 *        if the patterns are all large than the Modified Wu-Manbar method is often faster.
 *     6) I don't subscribe to any one method is best for all searching needs,
 *        the data decides which method is best,
 *        and we don't know until after the search method has been tested on the specific data sets.
 *
 *  May   2002: Marc Norton 1st Version
 *  June  2002: Modified interface for SNORT, added case support
 *  Aug   2002: Cleaned up comments, and removed dead code.
 *  Nov 2,2002: Fixed queue_init(), added count = 0
 *
 *  Wangyao : wangyao@cs.hit.edu.cn
 *
 *  Apr 24,2007: WangYao Combined Build_NFA() and Convert_NFA_To_DFA() into Build_DFA();
 *                       And Delete Some redundancy Code
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "acsmx.h"

//#include <ngx_core.h>

#define MEMASSERT(p,s)  if(!p){fprintf(stderr,"ACSM-No Memory: %s!\n",s);exit(0);}


//-------------------------------------------------------------------------------
//static void  *ngx_waf_ac_shm_pool = NULL;

/*
 * Malloc the AC Memory From shm pool
 */
static void *
AC_MALLOC(int n)
{
    void *p;

    p = malloc(n);
    //p = ngx_slab_alloc((ngx_slab_pool_t *)ngx_waf_ac_shm_pool, n);

    return p;
}

/*
 * Free the AC Memory to shm pool
 */
static void
AC_FREE(void *p)
{
    if (p) {
        free(p);
        //ngx_slab_free((ngx_slab_pool_t *)ngx_waf_ac_shm_pool, p);
    }
}
//-------------------------------------------------------------------------------




//-------------------------------------------------------------------------------
/*
 * Single Linked List
 *
 * Empty Queue:
 *   S
 * -------
 * |head |--> NULL
 * -------
 * |tail |--> NULL
 * -------
 * |count| = 0
 * -------
 *
 * Queue with nodes:
 *  ------------------------------------
 *  |                                   |
 *  |    S      ----> Node      Node    --->  Node
 *  |  -------  |    -------   -------       -------
 *  |  |head |---    |next |-->|next |------>|next |-->NULL
 *  |  -------       -------   -------       -------
 *  ---|tail |       |state|   |state|       |state|
 *     -------       -------   -------       -------
 *     |count| = 3
 *     -------
 *
 *  Used for .
 */

/*
 * Simple QUEUE NODE
 */
typedef struct _qnode
{
    int            state;
    struct _qnode  *next;
} QNODE;

/*
 * Simple QUEUE Structure
 */
typedef struct _queue
{
    QNODE  *head, *tail;
    int    count;
} QUEUE;

/*
 *Init the Queue
 */
static void
queue_init(QUEUE *s)
{
    s->head = s->tail = 0;
    s->count = 0;
}

/*
 * Add Tail Item to queue
 */
static void
queue_add(QUEUE *s, int state)
{
    QNODE  *q;
    
    if (!s->head) {                 // Queue is empty
        q = s->tail = s->head = (QNODE *)AC_MALLOC(sizeof(QNODE));
        MEMASSERT(q, "queue_add");  // if malloc failed, exit the problom
        q->state = state;
        q->next = 0;                // Set the New Node's Next Null
    } else {
        q = (QNODE *)AC_MALLOC(sizeof(QNODE));
        MEMASSERT(q, "queue_add");
        q->state = state;
        q->next = 0;
        s->tail->next = q;  // Add the new Node into the queue
        s->tail = q;        // set the new node is the Queue's Tail
    }
    s->count++;
}

/*
 * Remove Head Item from queue
 */
static int
queue_remove(QUEUE *s)
{
    int    state = 0;
    QNODE  *q;
    
    // Remove A QueueNode From the head of the Queue
    if (s->head) {
        q = s->head;
        state = q->state;
        s->head = s->head->next;
        s->count--;

        // If Queue is Empty, After Remove A QueueNode
        if (!s->head) {
            s->tail = 0;
            s->count = 0;
        }

        // Free the QueNode Memory
        AC_FREE(q);
    }
    return state;
}

/*
 * Return The count of the Node in the Queue
 */
static int
queue_count(QUEUE *s)
{
    return s->count;
}

/*
 * Free the Queue Memory
 */
static void
queue_free(QUEUE *s)
{
    while (queue_count(s)) {
        queue_remove(s);
    }
}
//-------------------------------------------------------------------------------




/*
 * Case Translation Table
 */
static unsigned char xlatcase[256];

/*
 * Init the xlatcase Table,Trans alpha to UpperMode
 * Just for the NoCase State
 */
void
init_xlatcase()
{
    int i;

    for (i = 0; i < 256; i++) {
        xlatcase[i] = toupper (i);
    }
}

/*
 * Convert the pattern string into upper
 */
static void
ConvertCaseEx(unsigned char *d, unsigned char *s, int m)
{
    int i;

    for (i = 0; i < m; i++) {
        d[i] = xlatcase[s[i]];
    }
}

/*
 * Add a pattern to the list of patterns terminated at this state.
 * Insert at front of list.
 */
static void
AddMatchListEntry(ACSM_STRUCT *acsm, int state, ACSM_PATTERN *px)
{
    ACSM_PATTERN  *p;

    p = (ACSM_PATTERN *)AC_MALLOC(sizeof(ACSM_PATTERN));
    MEMASSERT(p, "AddMatchListEntry");
    memcpy(p, px, sizeof(ACSM_PATTERN));

    // Add the new pattern to the pattern list
    p->next = acsm->acsmStateTable[state].MatchList;
    acsm->acsmStateTable[state].MatchList = p;
}

/*
 * Add Pattern States
 */
static void
AddPatternStates(ACSM_STRUCT *acsm, ACSM_PATTERN *p)
{
    unsigned char  *pattern;
    int            state = 0, next, n;
    
    n = p->n;            // The number of alpha in the pattern string
    pattern = p->patrn;

    // Match up pattern with existing states
    for (; n > 0; pattern++, n--) {
        next = acsm->acsmStateTable[state].NextState[*pattern];
        if (next == ACSM_FAIL_STATE) {
            break;
        }
        state = next;
    }

    // Add new states for the rest of the pattern bytes, 1 state per byte
    for (; n > 0; pattern++, n--) {
        acsm->acsmNumStates ++;
        acsm->acsmStateTable[state].NextState[*pattern] = acsm->acsmNumStates;
        state = acsm->acsmNumStates;
    }

    //Here,An accept state,just add into the MatchListof the state
    AddMatchListEntry(acsm, state, p);
}


/*
 * Build Deterministic Finite Automata
 */
static void
Build_DFA(ACSM_STRUCT *acsm)
{
    int    r, s;
    int    i;
    QUEUE  q, *queue = &q;

    // Init a Queue
    queue_init(queue);

    // Add the state 0 transitions 1st
    // 1st depth Node's FailState is 0, fail(x)=0
    for (i = 0; i < ALPHABET_SIZE; i++) {
        s = acsm->acsmStateTable[0].NextState[i];
        if (s) {
            queue_add(queue, s);
            acsm->acsmStateTable[s].FailState = 0;
        }
    }

    // Build the fail state transitions for each valid state
    while (queue_count(queue) > 0) {
        r = queue_remove(queue);

        // Find Final States for any Failure
        for (i = 0; i < ALPHABET_SIZE; i++) {
            int fs, next;
            // Note NextState[i] is a const variable in this block
            if ((s = acsm->acsmStateTable[r].NextState[i]) != ACSM_FAIL_STATE) {
                queue_add(queue, s);
                fs = acsm->acsmStateTable[r].FailState;

                // Locate the next valid state for 'i' starting at s
                // Note the  variable "next"
                // Note "NextState[i]" is a const variable in this block
                while ((next=acsm->acsmStateTable[fs].NextState[i]) == ACSM_FAIL_STATE) {
                    fs = acsm->acsmStateTable[fs].FailState;
                }

                // Update 's' state failure state to point to the next valid state
                acsm->acsmStateTable[s].FailState = next;

                // NOTES: 感谢网友提供的补丁,来修正pattern中有重叠的现象。(copy vs opy)
                ACSM_PATTERN *pat = acsm->acsmStateTable[next].MatchList;
                for (; pat != NULL; pat = pat->next) {
                    AddMatchListEntry(acsm, s, pat);
                }
            } else {
                acsm->acsmStateTable[r].NextState[i] = acsm->acsmStateTable[acsm->acsmStateTable[r].FailState].NextState[i];
            }
        }
    }

    // Clean up the queue
    queue_free(queue);
}


/*
 * Init the acsm DataStruct
 */
ACSM_STRUCT *
acsmNew()
{
    ACSM_STRUCT *p;
    
    // For shm share, Plz init this array in global
    //init_xlatcase();

    p = (ACSM_STRUCT *)AC_MALLOC(sizeof(ACSM_STRUCT));
    MEMASSERT(p, "acsmNew");
    if (p) {
        memset(p, 0, sizeof(ACSM_STRUCT));
    }

    return p;
}


/*
 * Add a pattern to the list of patterns for this state machine
 */
int
_acsmAddPattern(ACSM_STRUCT *p, unsigned char *pat, int n, int nocase, unsigned int id, void *data)
{
    ACSM_PATTERN  *plist;

    plist = (ACSM_PATTERN *)AC_MALLOC(sizeof (ACSM_PATTERN));
    MEMASSERT(plist, "acsmAddPattern");

    plist->patrn = (unsigned char *)AC_MALLOC(n+1);
    memset(plist->patrn+n, 0, 1);
    ConvertCaseEx(plist->patrn, pat, n);

    plist->casepatrn = (unsigned char *)AC_MALLOC(n+1);
    memset(plist->casepatrn + n, 0, 1);
    memcpy(plist->casepatrn, pat, n);

    plist->n = n;
    plist->nocase = nocase;
    plist->nmatch = 0;
    plist->id = id;
    plist->data = data;

    // Add the pattern into the pattern list
    plist->next = p->acsmPatterns;
    p->acsmPatterns = plist;

    return 0;
}

/*
 * Compile State Machine
 */
int
acsmCompile(ACSM_STRUCT *acsm)
{
    int           i, k;
    int           size;
    ACSM_PATTERN  *plist;

    // Count number of states, acsmMaxStates = total character of all patterns' sum + 1 (State 0)
    acsm->acsmMaxStates = 1;  // State 0
    for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
        acsm->acsmMaxStates += plist->n;
    }

    size = sizeof(ACSM_STATETABLE) * acsm->acsmMaxStates;
    acsm->acsmStateTable = (ACSM_STATETABLE *)AC_MALLOC(size);
    if (acsm->acsmStateTable == NULL) {
        return -1;
    }
    memset(acsm->acsmStateTable, 0, size);

    // Initialize state zero as a branch
    acsm->acsmNumStates = 0;

    // Initialize all States NextStates to FAILED
    for (k = 0; k < acsm->acsmMaxStates; k++) {
        for (i = 0; i < ALPHABET_SIZE; i++) {
            acsm->acsmStateTable[k].NextState[i] = ACSM_FAIL_STATE;
        }
    }

    // Add each Pattern to the State Table
    for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
        AddPatternStates(acsm, plist);
    }
#ifdef AC_DEBUG
    PrintGotoTable(acsm);
#endif

    // Set all failed state transitions which from state 0 to return to the 0'th state
    for (i = 0; i < ALPHABET_SIZE; i++) {
        if (acsm->acsmStateTable[0].NextState[i] == ACSM_FAIL_STATE) {
            acsm->acsmStateTable[0].NextState[i] = 0;
        }
    }

    // Build the NFA
    Build_DFA(acsm);

    return 0;
}


/* 64KB Memory */
static unsigned char Tc[64*1024];

/*
 *   Search Text or Binary Data for Pattern matches
 */
int
_acsmSearch(ACSM_STRUCT *acsm, unsigned char *Tx, int n, int match_full,
        unsigned int (*PrintMatch)(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data), void *data)
{
    int state;
    ACSM_PATTERN *mlist;
    unsigned char *Tend;
    ACSM_STATETABLE *StateTable = acsm->acsmStateTable;
    int nfound = 0;  // Number of the found(matched) patten string
    unsigned char *T;
    int index, i;
    unsigned int id;

    // Case conversion
    ConvertCaseEx(Tc, Tx, n);
    T = Tc;
    Tend = T + n;

    for (state = 0; T < Tend; T++) {
        state = StateTable[state].NextState[*T];

        // State is a accept state?
        if (StateTable[state].MatchList != NULL) {
            for (mlist = StateTable[state].MatchList; mlist != NULL; mlist = mlist->next) {
                if (match_full && n != mlist->n) {
                    continue;
                }
                // Get the index of the Match Pattern String in the Text
                index = T - mlist->n + 1 - Tc;

                if (!mlist->nocase) {
                    for (i = 0; i < mlist->n; i++) {
                        if (Tx[index+i] != mlist->casepatrn[i]) {
                            goto CONTINUE;
                        }
                    } 
                }

                mlist->nmatch++;
                nfound++;

                if (PrintMatch != NULL) {
                    id = PrintMatch(acsm->acsmPatterns, mlist, index, data);
                    printf("id: %u\n", id);
                }
CONTINUE:
                ;
            }
        }
    }

    return nfound;
}


/*
 * Free all memory
 */
void
acsmFree(ACSM_STRUCT *acsm)
{
    int i;
    ACSM_PATTERN *mlist, *ilist;

    for (i = 0; i < acsm->acsmMaxStates; i++) {
        if (acsm->acsmStateTable[i].MatchList != NULL) {
            mlist = acsm->acsmStateTable[i].MatchList;
            while (mlist) {
                ilist = mlist;
                mlist = mlist->next;
                AC_FREE(ilist);
            }
        }
    }

    AC_FREE(acsm->acsmStateTable);
    mlist = acsm->acsmPatterns;
    while (mlist) {
        ilist = mlist;
        mlist = mlist->next;
        AC_FREE(ilist->patrn);
        AC_FREE(ilist->casepatrn);
        AC_FREE(ilist);
    }
    AC_FREE(acsm);
}

/*
 * Print A Match String's Information and return the pattern's id
 */
unsigned int
PrintMatch(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data)
{
    // Count the Each Match Pattern
    ACSM_PATTERN  *temp = pattern;

    for (; temp != NULL; temp = temp->next) {
        if (!strcmp((const char *)temp->patrn, (const char *)mlist->patrn)) {  //strcmp succeed return 0, So here use "!" operation
            temp->nmatch++;
        }
    }

    printf("Match caseKeyWord %s index: %d, id: %d, nmatch: %d\n", mlist->casepatrn, index, mlist->id, mlist->nmatch);

    return mlist->id;
}

void
PrintGotoTable(ACSM_STRUCT *acsm)
{
    int              i, n, m;
    ACSM_STATETABLE  *p_state_table;

    printf("\n-----------------------------------\n");
    printf("GotoTables:\n");
    printf("-----------------------------------\n");
    for (i = 0; i <= acsm->acsmNumStates; i++) {
        p_state_table = &acsm->acsmStateTable[i];
        printf("State[%d]'s Goto Table:\n", i);
        for (n = 0; n < 16; n ++) {
            for (m = 0; m < 16; m ++) {
                if (p_state_table->NextState[n * 16 + m] != 0 && p_state_table->NextState[n * 16 + m] != -1) {
                    printf("------   %c   ------\n", (n * 16 + m));
                    printf("| %2d | ----> | %2d |\n", i, p_state_table->NextState[n * 16 + m]); 
                    printf("------       ------\n");
                }
            }
        }
        printf("\n");
    }
    printf("-----------------------------------\n");
}

void
PrintFailTable(ACSM_STRUCT *acsm)
{
    int              i;
    ACSM_STATETABLE  *p_state_table;

    printf("\n-----------------------------------\n");
    printf("OutputTable:\n");
    printf("-----------------------------------\n");
    for (i = 0; i <= acsm->acsmNumStates; i++) {
        p_state_table = &acsm->acsmStateTable[i];
        printf("State[%d] ----> State[%d]\n", i, p_state_table->FailState);
    }
    printf("-----------------------------------\n");
}

void
PrintOutputTable(ACSM_STRUCT *acsm)
{
    int              i;
    ACSM_STATETABLE  *p_state_table;
    ACSM_PATTERN     *p_pattern;

    printf("\n-----------------------------------\n");
    printf("OutputTable:\n");
    printf("-----------------------------------\n");
    for (i = 0; i <= acsm->acsmNumStates; i++) {
        p_state_table = &acsm->acsmStateTable[i];
        p_pattern = p_state_table->MatchList;
        if (p_pattern != NULL) {
            printf("State[%d]'s Output Table:\n{", i);
            for (; p_pattern != NULL; p_pattern = p_pattern->next) {
                printf(" %s ", p_pattern->casepatrn);
            }
            printf("}\n");
        }
    }
    printf("-----------------------------------\n");
}


int main(int argc, char **argv)
{
    int            i, nocase = 0, f=0, matchcount = 0;
    ACSM_STRUCT    *acsm;
    unsigned char  text[MAXLEN];

    if(argc < 3)
    {
        fprintf(stderr, "Usage: ./ac text word-1 word-2 ... word-n  -nocase\n");
        // ./ac usher hers his she he    // because the pattern insert in the head of list
        // ./ac usher e hers his she he
        // ./ac usher e hers his she he
        exit(0);
    }

    init_xlatcase();

    acsm = acsmNew();
    strcpy(text, argv[1]);
    for (i = 1; i < argc; i++) {
        if (strcmp(argv[i], "-nocase") == 0) {
            nocase = 1;
        }
    }

    for (i = 2; i < argc; i++) {
        if (argv[i][0] == '-') {
            continue;
        }
        printf("AddPattern: %.*s\n", strlen(argv[i]), argv[i]);
        acsmAddPattern(acsm, argv[i], strlen(argv[i]), nocase, i);
    }
    acsmCompile(acsm);

#ifdef AC_DEBUG
    //PrintGotoTable(acsm);  // not here, the Goto Table has been changed
    PrintFailTable(acsm);
    PrintOutputTable(acsm);
#endif

    matchcount = acsmSearchCB(acsm, text, strlen(text), PrintMatch, NULL);
    acsmFree(acsm);
    printf("%d matched.\n", matchcount);

    return (0);
}


AC代码分析:
int
acsmCompile(ACSM_STRUCT *acsm)
{
    ......

    // Add each Pattern to the State Table
    for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
        AddPatternStates(acsm, plist); {
            ......

            // Here,An accept state,just add into the MatchListof the state
            AddMatchListEntry(acsm, state, p);
            // output表的生成
        }
    }

    /* 以上代码,即 goto表的生成 以及 output表的生成过程
     * 搞定所有State的NextState[ALPHABET_SIZE](goto表) &  MatchList(output表)
     *  typedef struct {                                                                           // the structure present a State
     *      int                               NextState[ALPHABET_SIZE];  // Next state - based on input character
     *      int                               FailState;                                              // Failure state - used while building NFA & DFA
     *      ACSM_PATTERN  *MatchList;                                       // List of patterns that end here, if any
     *  } ACSM_STATETABLE;
     *
     *  由该结构来表示,这里会浪费掉一定的空间(因为 acsmNumStates < acsmMaxStates)
     *  typedef struct {
     *      int                                      acsmMaxStates;       // acsmMaxStates = total character of all patterns' sum + 1 (State 0)
     *      int                                      acsmNumStates;  // States's number of the trie
     *      ACSM_PATTERN        *acsmPatterns;          // a list include all acsm patterns
     *      ACSM_STATETABLE  *acsmStateTable;  // a array include all statetable of all trie's nodes
     *  } ACSM_STRUCT;
     */

    ......
}

static void
Build_DFA(ACSM_STRUCT *acsm)
{
    ......
    /* 这里是整个fail表的构建,存储在每个State结构的FailState中
     *  typedef struct {                                                                    // the structure present a State
     *      int                               NextState[ALPHABET_SIZE];  // Next state - based on input character
     *      int                               FailState;                                     // Failure state - used while building NFA & DFA
     *      ACSM_PATTERN  *MatchList;                                   // List of patterns that end here, if any
     *  } ACSM_STATETABLE;
     */
     // 对照下图对代码进行分析:
    
}


AC代码注意: 
这里值得商榷的地方是,该处用存储态度,会造成一定的空间浪费:
int
acsmCompile(ACSM_STRUCT *acsm)
{
    int           i, k;
    int           size;
    ACSM_PATTERN  *plist;

    // Count number of states, acsmMaxStates = total character of all patterns' sum + 1 (State 0)
    acsm->acsmMaxStates = 1;  // State 0
    for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
        acsm->acsmMaxStates += plist->n;
    }

    size = sizeof(ACSM_STATETABLE) * acsm->acsmMaxStates;
    acsm->acsmStateTable = (ACSM_STATETABLE *)AC_MALLOC(size);
    if (acsm->acsmStateTable == NULL) {
        return -1;
    }
    memset(acsm->acsmStateTable, 0, size);

    ......
}


Refer:
"AC 经典多模式匹配算法"
http://blog.csdn.net/ijuliet/article/details/4210858
"AC(Aho—Corasiek) 多模式匹配算法"
http://my.oschina.net/amince/blog/196426      该博客值得关注
  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值