《柔性字符串匹配》读书笔记（2）之－－AC算法（多模式串匹配、前缀匹配）

最新推荐文章于 2023-11-10 22:05:17 发布

m4trixl9

最新推荐文章于 2023-11-10 22:05:17 发布

阅读量2.6k

点赞数 2

分类专栏： algorithm

本文链接：https://blog.csdn.net/hnudlz/article/details/46533335

版权

algorithm 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

——by m4trix

AC算法：Aho-Corasick algorithm (Alfred V.Aho & Margaret J.Corasick)

Alfred V.Aho：《编译原理》（龙书）的作者哦！

问题：
给出n个单词，再给出一段包含m个字符的文章，请找出有多少个单词在文章中出现过？
Refer:

杭州电子科技大学　ACM题：Keywords Search
http://acm.hdu.edu.cn/showproblem.php?pid=2222

AC算法即适用于解决这种问题：在目标文本串T中定位一个模式串集合P中的每一个模式串p出现的位置。

继续：

假设在当前文本位置，已经找到了既是文本串T = t₁t₂t₃...t_i的后缀，同时也是模式串集合P中某个模式串p^k的前缀的最长字符串。问题的关键在于每读入一个新的文本字符时，如何更新这个最长字符串的长度？AC算法给出了答案。

要想理解AC算法，需要对KMP算法有个透彻的理解，对KMP算法的理解，可以参考我之前写的一个小心得："《柔性字符串匹配》读书笔记（１）之－－KMP算法（单模式串匹配、前缀匹配）"

AC算法是KMP算法向多模式串情节的扩展（两者都是在对同一问题（模式串前缀的自包含问题）的研究中产生出来的）：
KMP算法：单模式串匹配、前缀匹配

在KMP算法中，只有一个模式串，当部分匹配该模式串时，查找的是：既是该模式串（部分匹配部分）的后缀、同时又是该模式串（部分匹配部分）的前缀的最长字符串的长度。

AC算法：多模式串匹配、前缀匹配

而AC算法是KMP算法向多模式串情景的扩展，因此当部分匹配某模式串时，查找的是：既是该部分匹配模式串部分的后缀，同时又是需匹配模式串集合中某个模式串的前缀的最长字符串的长度。

AC算法：

Step1：对模式串集合进行预处理，得到一个有限状态自动机（在该自动机的基础上，有三张表对应：goto表、fail表、output表）；

Step2：将要匹配的文本T作为有限自动机的输入，输出含有那些patterns及这些patterns在文本中的位置。

预处理阶段：

经典的AC算法由三个表构成：（这三个表，都在预处理阶段得到）
（１）、goto表：由模式集合P中的所有模式构成的状态转移自动机；

e.g.

有模式集合：P{p₀, p₁, p₂, p₃}

p₀ = "he"

p₁ = "she"

p₂ = "his"

p₃ = "hers"

　　　　　　　　状态转移自动机

（２）、fail表：所谓fail表，即当我们处于状态转移自动机中的某个状态时，接下来继续输入字符（文本Ｔ中的一个字符），状态转移自动机无法继续进行跳转时（即匹配fail了），那么我们应该跳转到状态机的哪个位置来继续进行匹配呢？不是所有这种情况都需要重新跳转到state0来重新进行匹配的，根据（既是该部分匹配模式串部分的后缀，同时又是需匹配模式串集合中某个模式串的前缀的最长字符串的长度）原理，可以构造出fail表。fail表即当匹配失败时，状态转移自动机进行合理跳转的一个映射表。

上图状态转移自动机对应的fail表表示成图形为：

　　　　　　　状态fail转换图

fail表：

从上图可以看出：
部分匹配模式串匹配部分："sh"的后缀"h"，是模式串集合中模式："he"、"his"、"hers" 的前缀，且是最长前缀，对应有映射关系：f(4) = 1；

部分匹配模式串匹配部分："she"的后缀"he"，是模式串集合中模式："he"、"hers" 的前缀，且是最长前缀，对应有映射关系：f(5) = 2；

部分匹配模式串匹配部分："his"的后缀"s"，是模式串集合中模式："she"的前缀，且是最长前缀，对应有映射关系：f(7) = 3；

部分匹配模式串匹配部分："hers"的后缀"s"，是模式串集合中模式："she"的前缀，且是最长前缀，对应有映射关系：f(9) = 3；

（３）、output表：指的是状态和模式串之间的一种关系。即当状态机到达某种状态时，模式串集合中可能某些模式串已经完成了匹配，output表即状态与已完成匹配的模式串之间的映射表。

如上示例中，有output表：

匹配阶段：

在自动状态转换机中依次输入文本T中的字符：

if (matching) {

查找goto表，进入相应的状态State i;

查看output(i):

if (output(i)不为空) {

输出匹配位置；

}

} else { // fail

查找fail表，进入相应的的状态State fail(i);

继续当前字符的匹配动作；

}

e.g.
以文本T"ushers"示例：

从状态自动转换机的State0出发，

接收'u', 在goto表中发现回到State0；继续

接收's', goto State3, 查一下output(3)为空字符串，说明没有匹配到patterns；继续

接收'h', goto State4, 查一下output(4)为空字符串，说明没有匹配到patterns；继续

接收'e', goto State5, 查一下output(5)为{"she", "he"}, 说明匹配中了模式"she"和模式"he", 输出在整个文本字符串中的位置；继续

接收'r', fail, fail(5)==2, goto State2；继续

匹配'r', goto State8, 查一下output(8)为空字符串，说明没有匹配到patterns；继续

接收's', goto State9, 查一下output(9)为{"her"}, 说明匹配中了模式"her", 输出在整个文本字符串中的位置；继续

输入文本结束，整个匹配过程结束。

结果：
文本T"ushers"会匹配中模式"she"、"he"、"her", 其位置分别为巴拉巴拉...

AC算法的优点：

优点１：扫描文本时完全不需要回溯；

优点２：AC算法的时间复杂度是O(n)，与patterns的个数及长度都没有关系。因为Text中的每个字符都必须输入自动机，所以最好最坏情况下都是O(n)，加上预处理时间，那就是O(M+n)，M是patterns长度总和。

AC代码参考：

参考开源系统：snort入侵检测系统的acsmx.h & acsmx.c文件：

acsmx.h

/*
 * Copyright (C) 2002 Martin Roesch <roesch@sourcefire.com>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 */


/*
 * ACSMX.H
 */
#ifndef ACSMX_H
#define ACSMX_H

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
 * Prototypes
 */
#define MAXLEN           256
#define ALPHABET_SIZE    256
#define ACSM_FAIL_STATE  -1

//#define AC_DEBUG  1

// present a pattern string
typedef struct _acsm_pattern {
    struct _acsm_pattern  *next;
    unsigned char         *patrn;      // pattern string, be converted to upper,  e.g. "ABABCAB"
    unsigned char         *casepatrn;  // pattern string, not converted to upper, e.g. "aBAbCab"
    int                   n;           // the length fo pattern string
    int                   nocase;      // whether case sensitive, 0 is & 1 is not
    int                   nmatch;
    unsigned int          id;
    void                  *data;       //user self-data
} ACSM_PATTERN;

// present a state of the trie's node
typedef struct {
    int           NextState[ALPHABET_SIZE];  // Next state - based on input character          goto table
    int           FailState;                 // Failure state - used while building NFA & DFA  fail table
    ACSM_PATTERN  *MatchList;                // List of patterns that end here, if any         output table
} ACSM_STATETABLE;

/*
 * State machine Struct
 */
typedef struct {
    int              acsmMaxStates;    // acsmMaxStates = total character of all patterns' sum + 1 (State 0)
    int              acsmNumStates;    // States's number of the trie
    ACSM_PATTERN     *acsmPatterns;    // a list include all acsm patterns
    ACSM_STATETABLE  *acsmStateTable;  // a array include all statetable of all trie's nodes
} ACSM_STRUCT;

/*
 * Prototypes
 */
int _acsmAddPattern(ACSM_STRUCT *p, unsigned char *pat, int n, int nocase, unsigned int id, void *data);
int _acsmSearch(ACSM_STRUCT *acsm, unsigned char *Tx, int n, int match_full, unsigned int (*PrintMatch)(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data), void *data);

void init_xlatcase();                                                                               // API
ACSM_STRUCT *acsmNew();                                                                             // API
int acsmCompile(ACSM_STRUCT *acsm);                                                                 // API
void acsmFree(ACSM_STRUCT *acsm);                                                                   // API
unsigned int PrintMatch(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data);         // API
void PrintSummary(ACSM_PATTERN *pattern);                                                           // API
void PrintGotoTable(ACSM_STRUCT *acsm);                                                             // API
void PrintFailTable(ACSM_STRUCT *acsm);                                                             // API
void PrintOutputTable(ACSM_STRUCT *acsm);                                                           // API
#define acsmAddPattern(p, pat, n, nocase, id)         _acsmAddPattern(p, pat, n, nocase, id, NULL)  // API
#define acsmAddPattern2(p, pat, n, nocase, data, id)  _acsmAddPattern(p, pat, n, nocase, id, data)  // API
#define acsmSearch(acsm, Tx, n)                       _acsmSearch(acsm, Tx, n, 0, NULL, NULL)       // API
#define acsmSearchCB(acsm, Tx, n, match_fn, data)     _acsmSearch(acsm, Tx, n, 0, match_fn, data)   // API
#define acsmFullMatch(acsm, Tx, n)                    _acsmSearch(acsm, Tx, n, 1, NULL, NULL)       // API
#define acsmFullMatchCB(acsm, Tx, n, match_fn, data)  _acsmSearch(acsm, Tx, n, 1, match_fn, data)   // API

#endif

acsmx.c

/*
 * Multi-Pattern Search Engine
 *
 * Aho-Corasick State Machine - uses a Deterministic Finite Automata - DFA
 *
 * Copyright (C) 2002 Sourcefire, Inc.
 * Marc Norton
 *
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 *
 *
 * Reference - Efficient String matching: An Aid to Bibliographic Search
 *             Alfred V Aho and Margaret J Corasick
 *             Bell Labratories
 *             Copyright(C) 1975 Association for Computing Machinery, Inc.
 *
 * Implemented from the 4 algorithms in the paper by Aho & Corasick
 * and some implementation ideas from 'Practical Algorithms in C'
 *
 * Notes:
 *     1) This version uses about 1024 bytes per pattern character - heavy on the memory.
 *     2) This algorithm finds all occurrences of all patterns within a
 *        body of text.
 *     3) Support is included to handle upper and lower case matching.
 *     4) Some comopilers optimize the search routine well, others don't, this makes all the difference.
 *     5) Aho inspects all bytes of the search text, but only once so it's very efficient,
 *        if the patterns are all large than the Modified Wu-Manbar method is often faster.
 *     6) I don't subscribe to any one method is best for all searching needs,
 *        the data decides which method is best,
 *        and we don't know until after the search method has been tested on the specific data sets.
 *
 *  May   2002: Marc Norton 1st Version
 *  June  2002: Modified interface for SNORT, added case support
 *  Aug   2002: Cleaned up comments, and removed dead code.
 *  Nov 2,2002: Fixed queue_init(), added count = 0
 *
 *  Wangyao : wangyao@cs.hit.edu.cn
 *
 *  Apr 24,2007: WangYao Combined Build_NFA() and Convert_NFA_To_DFA() into Build_DFA();
 *                       And Delete Some redundancy Code
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "acsmx.h"

//#include <ngx_core.h>

#define MEMASSERT(p,s)  if(!p){fprintf(stderr,"ACSM-No Memory: %s!\n",s);exit(0);}


//-------------------------------------------------------------------------------
//static void  *ngx_waf_ac_shm_pool = NULL;

/*
 * Malloc the AC Memory From shm pool
 */
static void *
AC_MALLOC(int n)
{
    void *p;

    p = malloc(n);
    //p = ngx_slab_alloc((ngx_slab_pool_t *)ngx_waf_ac_shm_pool, n);

    return p;
}

/*
 * Free the AC Memory to shm pool
 */
static void
AC_FREE(void *p)
{
    if (p) {
        free(p);
        //ngx_slab_free((ngx_slab_pool_t *)ngx_waf_ac_shm_pool, p);
    }
}
//-------------------------------------------------------------------------------




//-------------------------------------------------------------------------------
/*
 * Single Linked List
 *
 * Empty Queue:
 *   S
 * -------
 * |head |--> NULL
 * -------
 * |tail |--> NULL
 * -------
 * |count| = 0
 * -------
 *
 * Queue with nodes:
 *  ------------------------------------
 *  |                                   |
 *  |    S      ----> Node      Node    --->  Node
 *  |  -------  |    -------   -------       -------
 *  |  |head |---    |next |-->|next |------>|next |-->NULL
 *  |  -------       -------   -------       -------
 *  ---|tail |       |state|   |state|       |state|
 *     -------       -------   -------       -------
 *     |count| = 3
 *     -------
 *
 *  Used for .
 */

/*
 * Simple QUEUE NODE
 */
typedef struct _qnode
{
    int            state;
    struct _qnode  *next;
} QNODE;

/*
 * Simple QUEUE Structure
 */
typedef struct _queue
{
    QNODE  *head, *tail;
    int    count;
} QUEUE;

/*
 *Init the Queue
 */
static void
queue_init(QUEUE *s)
{
    s->head = s->tail = 0;
    s->count = 0;
}

/*
 * Add Tail Item to queue
 */
static void
queue_add(QUEUE *s, int state)
{
    QNODE  *q;
    
    if (!s->head) {                 // Queue is empty
        q = s->tail = s->head = (QNODE *)AC_MALLOC(sizeof(QNODE));
        MEMASSERT(q, "queue_add");  // if malloc failed, exit the problom
        q->state = state;
        q->next = 0;                // Set the New Node's Next Null
    } else {
        q = (QNODE *)AC_MALLOC(sizeof(QNODE));
        MEMASSERT(q, "queue_add");
        q->state = state;
        q->next = 0;
        s->tail->next = q;  // Add the new Node into the queue
        s->tail = q;        // set the new node is the Queue's Tail
    }
    s->count++;
}

/*
 * Remove Head Item from queue
 */
static int
queue_remove(QUEUE *s)
{
    int    state = 0;
    QNODE  *q;
    
    // Remove A QueueNode From the head of the Queue
    if (s->head) {
        q = s->head;
        state = q->state;
        s->head = s->head->next;
        s->count--;

        // If Queue is Empty, After Remove A QueueNode
        if (!s->head) {
            s->tail = 0;
            s->count = 0;
        }

        // Free the QueNode Memory
        AC_FREE(q);
    }
    return state;
}

/*
 * Return The count of the Node in the Queue
 */
static int
queue_count(QUEUE *s)
{
    return s->count;
}

/*
 * Free the Queue Memory
 */
static void
queue_free(QUEUE *s)
{
    while (queue_count(s)) {
        queue_remove(s);
    }
}
//-------------------------------------------------------------------------------




/*
 * Case Translation Table
 */
static unsigned char xlatcase[256];

/*
 * Init the xlatcase Table,Trans alpha to UpperMode
 * Just for the NoCase State
 */
void
init_xlatcase()
{
    int i;

    for (i = 0; i < 256; i++) {
        xlatcase[i] = toupper (i);
    }
}

/*
 * Convert the pattern string into upper
 */
static void
ConvertCaseEx(unsigned char *d, unsigned char *s, int m)
{
    int i;

    for (i = 0; i < m; i++) {
        d[i] = xlatcase[s[i]];
    }
}

/*
 * Add a pattern to the list of patterns terminated at this state.
 * Insert at front of list.
 */
static void
AddMatchListEntry(ACSM_STRUCT *acsm, int state, ACSM_PATTERN *px)
{
    ACSM_PATTERN  *p;

    p = (ACSM_PATTERN *)AC_MALLOC(sizeof(ACSM_PATTERN));
    MEMASSERT(p, "AddMatchListEntry");
    memcpy(p, px, sizeof(ACSM_PATTERN));

    // Add the new pattern to the pattern list
    p->next = acsm->acsmStateTable[state].MatchList;
    acsm->acsmStateTable[state].MatchList = p;
}

/*
 * Add Pattern States
 */
static void
AddPatternStates(ACSM_STRUCT *acsm, ACSM_PATTERN *p)
{
    unsigned char  *pattern;
    int            state = 0, next, n;
    
    n = p->n;            // The number of alpha in the pattern string
    pattern = p->patrn;

    // Match up pattern with existing states
    for (; n > 0; pattern++, n--) {
        next = acsm->acsmStateTable[state].NextState[*pattern];
        if (next == ACSM_FAIL_STATE) {
            break;
        }
        state = next;
    }

    // Add new states for the rest of the pattern bytes, 1 state per byte
    for (; n > 0; pattern++, n--) {
        acsm->acsmNumStates ++;
        acsm->acsmStateTable[state].NextState[*pattern] = acsm->acsmNumStates;
        state = acsm->acsmNumStates;
    }

    //Here,An accept state,just add into the MatchListof the state
    AddMatchListEntry(acsm, state, p);
}


/*
 * Build Deterministic Finite Automata
 */
static void
Build_DFA(ACSM_STRUCT *acsm)
{
    int    r, s;
    int    i;
    QUEUE  q, *queue = &q;

    // Init a Queue
    queue_init(queue);

    // Add the state 0 transitions 1st
    // 1st depth Node's FailState is 0, fail(x)=0
    for (i = 0; i < ALPHABET_SIZE; i++) {
        s = acsm->acsmStateTable[0].NextState[i];
        if (s) {
            queue_add(queue, s);
            acsm->acsmStateTable[s].FailState = 0;
        }
    }

    // Build the fail state transitions for each valid state
    while (queue_count(queue) > 0) {
        r = queue_remove(queue);

        // Find Final States for any Failure
        for (i = 0; i < ALPHABET_SIZE; i++) {
            int fs, next;
            // Note NextState[i] is a const variable in this block
            if ((s = acsm->acsmStateTable[r].NextState[i]) != ACSM_FAIL_STATE) {
                queue_add(queue, s);
                fs = acsm->acsmStateTable[r].FailState;

                // Locate the next valid state for 'i' starting at s
                // Note the  variable "next"
                // Note "NextState[i]" is a const variable in this block
                while ((next=acsm->acsmStateTable[fs].NextState[i]) == ACSM_FAIL_STATE) {
                    fs = acsm->acsmStateTable[fs].FailState;
                }

                // Update 's' state failure state to point to the next valid state
                acsm->acsmStateTable[s].FailState = next;

                // NOTES: 感谢网友提供的补丁，来修正pattern中有重叠的现象。(copy vs opy)
                ACSM_PATTERN *pat = acsm->acsmStateTable[next].MatchList;
                for (; pat != NULL; pat = pat->next) {
                    AddMatchListEntry(acsm, s, pat);
                }
            } else {
                acsm->acsmStateTable[r].NextState[i] = acsm->acsmStateTable[acsm->acsmStateTable[r].FailState].NextState[i];
            }
        }
    }

    // Clean up the queue
    queue_free(queue);
}


/*
 * Init the acsm DataStruct
 */
ACSM_STRUCT *
acsmNew()
{
    ACSM_STRUCT *p;
    
    // For shm share, Plz init this array in global
    //init_xlatcase();

    p = (ACSM_STRUCT *)AC_MALLOC(sizeof(ACSM_STRUCT));
    MEMASSERT(p, "acsmNew");
    if (p) {
        memset(p, 0, sizeof(ACSM_STRUCT));
    }

    return p;
}


/*
 * Add a pattern to the list of patterns for this state machine
 */
int
_acsmAddPattern(ACSM_STRUCT *p, unsigned char *pat, int n, int nocase, unsigned int id, void *data)
{
    ACSM_PATTERN  *plist;

    plist = (ACSM_PATTERN *)AC_MALLOC(sizeof (ACSM_PATTERN));
    MEMASSERT(plist, "acsmAddPattern");

    plist->patrn = (unsigned char *)AC_MALLOC(n+1);
    memset(plist->patrn+n, 0, 1);
    ConvertCaseEx(plist->patrn, pat, n);

    plist->casepatrn = (unsigned char *)AC_MALLOC(n+1);
    memset(plist->casepatrn + n, 0, 1);
    memcpy(plist->casepatrn, pat, n);

    plist->n = n;
    plist->nocase = nocase;
    plist->nmatch = 0;
    plist->id = id;
    plist->data = data;

    // Add the pattern into the pattern list
    plist->next = p->acsmPatterns;
    p->acsmPatterns = plist;

    return 0;
}

/*
 * Compile State Machine
 */
int
acsmCompile(ACSM_STRUCT *acsm)
{
    int           i, k;
    int           size;
    ACSM_PATTERN  *plist;

    // Count number of states, acsmMaxStates = total character of all patterns' sum + 1 (State 0)
    acsm->acsmMaxStates = 1;  // State 0
    for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
        acsm->acsmMaxStates += plist->n;
    }

    size = sizeof(ACSM_STATETABLE) * acsm->acsmMaxStates;
    acsm->acsmStateTable = (ACSM_STATETABLE *)AC_MALLOC(size);
    if (acsm->acsmStateTable == NULL) {
        return -1;
    }
    memset(acsm->acsmStateTable, 0, size);

    // Initialize state zero as a branch
    acsm->acsmNumStates = 0;

    // Initialize all States NextStates to FAILED
    for (k = 0; k < acsm->acsmMaxStates; k++) {
        for (i = 0; i < ALPHABET_SIZE; i++) {
            acsm->acsmStateTable[k].NextState[i] = ACSM_FAIL_STATE;
        }
    }

    // Add each Pattern to the State Table
    for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
        AddPatternStates(acsm, plist);
    }
#ifdef AC_DEBUG
    PrintGotoTable(acsm);
#endif

    // Set all failed state transitions which from state 0 to return to the 0'th state
    for (i = 0; i < ALPHABET_SIZE; i++) {
        if (acsm->acsmStateTable[0].NextState[i] == ACSM_FAIL_STATE) {
            acsm->acsmStateTable[0].NextState[i] = 0;
        }
    }

    // Build the NFA
    Build_DFA(acsm);

    return 0;
}


/* 64KB Memory */
static unsigned char Tc[64*1024];

/*
 *   Search Text or Binary Data for Pattern matches
 */
int
_acsmSearch(ACSM_STRUCT *acsm, unsigned char *Tx, int n, int match_full,
        unsigned int (*PrintMatch)(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data), void *data)
{
    int state;
    ACSM_PATTERN *mlist;
    unsigned char *Tend;
    ACSM_STATETABLE *StateTable = acsm->acsmStateTable;
    int nfound = 0;  // Number of the found(matched) patten string
    unsigned char *T;
    int index, i;
    unsigned int id;

    // Case conversion
    ConvertCaseEx(Tc, Tx, n);
    T = Tc;
    Tend = T + n;

    for (state = 0; T < Tend; T++) {
        state = StateTable[state].NextState[*T];

        // State is a accept state?
        if (StateTable[state].MatchList != NULL) {
            for (mlist = StateTable[state].MatchList; mlist != NULL; mlist = mlist->next) {
                if (match_full && n != mlist->n) {
                    continue;
                }
                // Get the index of the Match Pattern String in the Text
                index = T - mlist->n + 1 - Tc;

                if (!mlist->nocase) {
                    for (i = 0; i < mlist->n; i++) {
                        if (Tx[index+i] != mlist->casepatrn[i]) {
                            goto CONTINUE;
                        }
                    } 
                }

                mlist->nmatch++;
                nfound++;

                if (PrintMatch != NULL) {
                    id = PrintMatch(acsm->acsmPatterns, mlist, index, data);
                    printf("id: %u\n", id);
                }
CONTINUE:
                ;
            }
        }
    }

    return nfound;
}


/*
 * Free all memory
 */
void
acsmFree(ACSM_STRUCT *acsm)
{
    int i;
    ACSM_PATTERN *mlist, *ilist;

    for (i = 0; i < acsm->acsmMaxStates; i++) {
        if (acsm->acsmStateTable[i].MatchList != NULL) {
            mlist = acsm->acsmStateTable[i].MatchList;
            while (mlist) {
                ilist = mlist;
                mlist = mlist->next;
                AC_FREE(ilist);
            }
        }
    }

    AC_FREE(acsm->acsmStateTable);
    mlist = acsm->acsmPatterns;
    while (mlist) {
        ilist = mlist;
        mlist = mlist->next;
        AC_FREE(ilist->patrn);
        AC_FREE(ilist->casepatrn);
        AC_FREE(ilist);
    }
    AC_FREE(acsm);
}

/*
 * Print A Match String's Information and return the pattern's id
 */
unsigned int
PrintMatch(ACSM_PATTERN *pattern, ACSM_PATTERN *mlist, int index, void *data)
{
    // Count the Each Match Pattern
    ACSM_PATTERN  *temp = pattern;

    for (; temp != NULL; temp = temp->next) {
        if (!strcmp((const char *)temp->patrn, (const char *)mlist->patrn)) {  //strcmp succeed return 0, So here use "!" operation
            temp->nmatch++;
        }
    }

    printf("Match caseKeyWord %s index: %d, id: %d, nmatch: %d\n", mlist->casepatrn, index, mlist->id, mlist->nmatch);

    return mlist->id;
}

void
PrintGotoTable(ACSM_STRUCT *acsm)
{
    int              i, n, m;
    ACSM_STATETABLE  *p_state_table;

    printf("\n-----------------------------------\n");
    printf("GotoTables:\n");
    printf("-----------------------------------\n");
    for (i = 0; i <= acsm->acsmNumStates; i++) {
        p_state_table = &acsm->acsmStateTable[i];
        printf("State[%d]'s Goto Table:\n", i);
        for (n = 0; n < 16; n ++) {
            for (m = 0; m < 16; m ++) {
                if (p_state_table->NextState[n * 16 + m] != 0 && p_state_table->NextState[n * 16 + m] != -1) {
                    printf("------   %c   ------\n", (n * 16 + m));
                    printf("| %2d | ----> | %2d |\n", i, p_state_table->NextState[n * 16 + m]); 
                    printf("------       ------\n");
                }
            }
        }
        printf("\n");
    }
    printf("-----------------------------------\n");
}

void
PrintFailTable(ACSM_STRUCT *acsm)
{
    int              i;
    ACSM_STATETABLE  *p_state_table;

    printf("\n-----------------------------------\n");
    printf("OutputTable:\n");
    printf("-----------------------------------\n");
    for (i = 0; i <= acsm->acsmNumStates; i++) {
        p_state_table = &acsm->acsmStateTable[i];
        printf("State[%d] ----> State[%d]\n", i, p_state_table->FailState);
    }
    printf("-----------------------------------\n");
}

void
PrintOutputTable(ACSM_STRUCT *acsm)
{
    int              i;
    ACSM_STATETABLE  *p_state_table;
    ACSM_PATTERN     *p_pattern;

    printf("\n-----------------------------------\n");
    printf("OutputTable:\n");
    printf("-----------------------------------\n");
    for (i = 0; i <= acsm->acsmNumStates; i++) {
        p_state_table = &acsm->acsmStateTable[i];
        p_pattern = p_state_table->MatchList;
        if (p_pattern != NULL) {
            printf("State[%d]'s Output Table:\n{", i);
            for (; p_pattern != NULL; p_pattern = p_pattern->next) {
                printf(" %s ", p_pattern->casepatrn);
            }
            printf("}\n");
        }
    }
    printf("-----------------------------------\n");
}


int main(int argc, char **argv)
{
    int            i, nocase = 0, f=0, matchcount = 0;
    ACSM_STRUCT    *acsm;
    unsigned char  text[MAXLEN];

    if(argc < 3)
    {
        fprintf(stderr, "Usage: ./ac text word-1 word-2 ... word-n  -nocase\n");
        // ./ac usher hers his she he    // because the pattern insert in the head of list
        // ./ac usher e hers his she he
        // ./ac usher e hers his she he
        exit(0);
    }

    init_xlatcase();

    acsm = acsmNew();
    strcpy(text, argv[1]);
    for (i = 1; i < argc; i++) {
        if (strcmp(argv[i], "-nocase") == 0) {
            nocase = 1;
        }
    }

    for (i = 2; i < argc; i++) {
        if (argv[i][0] == '-') {
            continue;
        }
        printf("AddPattern: %.*s\n", strlen(argv[i]), argv[i]);
        acsmAddPattern(acsm, argv[i], strlen(argv[i]), nocase, i);
    }
    acsmCompile(acsm);

#ifdef AC_DEBUG
    //PrintGotoTable(acsm);  // not here, the Goto Table has been changed
    PrintFailTable(acsm);
    PrintOutputTable(acsm);
#endif

    matchcount = acsmSearchCB(acsm, text, strlen(text), PrintMatch, NULL);
    acsmFree(acsm);
    printf("%d matched.\n", matchcount);

    return (0);
}

AC代码分析：

int

acsmCompile(ACSM_STRUCT *acsm)
{

......

// Add each Pattern to the State Table
for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
AddPatternStates(acsm, plist); {

......

// Here,An accept state,just add into the MatchListof the state

AddMatchListEntry(acsm, state, p);

// output表的生成

}
}

/* 以上代码，即 goto表的生成以及 output表的生成过程

* 搞定所有State的NextState[ALPHABET_SIZE]（goto表） & MatchList（output表）

* typedef struct { // the structure present a State

* int NextState[ALPHABET_SIZE]; // Next state - based on input character
* int FailState; // Failure state - used while building NFA & DFA
* ACSM_PATTERN *MatchList; // List of patterns that end here, if any

* } ACSM_STATETABLE;

* 由该结构来表示，这里会浪费掉一定的空间（因为 acsmNumStates < acsmMaxStates）

* typedef struct {

* int acsmMaxStates; // acsmMaxStates = total character of all patterns' sum + 1 (State 0)
* int acsmNumStates; // States's number of the trie
* ACSM_PATTERN *acsmPatterns; // a list include all acsm patterns
* ACSM_STATETABLE *acsmStateTable; // a array include all statetable of all trie's nodes
* } ACSM_STRUCT;

......

}

static void
Build_DFA(ACSM_STRUCT *acsm)
{

......

/* 这里是整个fail表的构建，存储在每个State结构的FailState中

* typedef struct { // the structure present a State

* int NextState[ALPHABET_SIZE]; // Next state - based on input character
* int FailState; // Failure state - used while building NFA & DFA
* ACSM_PATTERN *MatchList; // List of patterns that end here, if any

* } ACSM_STATETABLE;

// 对照下图对代码进行分析：

}

AC代码注意：
这里值得商榷的地方是，该处用存储态度，会造成一定的空间浪费：
int
acsmCompile(ACSM_STRUCT *acsm)
{
    int           i, k;
    int           size;
    ACSM_PATTERN *plist;

    // Count number of states, acsmMaxStates = total character of all patterns' sum + 1 (State 0)
    acsm->acsmMaxStates = 1; // State 0
    for (plist = acsm->acsmPatterns; plist != NULL; plist = plist->next) {
        acsm->acsmMaxStates += plist->n;
    }

    size = sizeof(ACSM_STATETABLE) * acsm->acsmMaxStates;
    acsm->acsmStateTable = (ACSM_STATETABLE *)AC_MALLOC(size);
    if (acsm->acsmStateTable == NULL) {
        return -1;
    }
    memset(acsm->acsmStateTable, 0, size);

......

}

Refer：

"AC算法详解"
http://blog.csdn.net/joylnwang/article/details/6793192

"在谈AC算法"
http://blog.csdn.net/joylnwang/article/details/6884450

"AC 经典多模式匹配算法"
http://blog.csdn.net/ijuliet/article/details/4210858

"AC(Aho—Corasiek) 多模式匹配算法"
http://my.oschina.net/amince/blog/196426　　　　　　该博客值得关注

"Aho-Corasick算法学习"
http://blog.csdn.net/sealyao/article/details/4560427

m4trixl9

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
《柔性字符串匹配》读书笔记（2）之－－AC算法（多模式串匹配、前缀匹配）

——by m4trixAC算法：Aho-Corasick algorithm (Alfred V.Aho & Margaret J.Corasick)Alfred V.Aho：《编译原理》（龙书）的作者哦！问题：给出n个单词，再给出一段包含m个字符的文章，请找出有多少个单词在文章中出现过？Refer: 杭州电子
复制链接

扫一扫