AC 经典多模式匹配算法

最新推荐文章于 2021-12-25 15:07:25 发布

WINCOL

最新推荐文章于 2021-12-25 15:07:25 发布

阅读量1.4k

点赞数

文章标签：算法 function output construction each graph

今天说说多模式匹配 AC 算法（ Aho and Corasick ），感谢追风侠帮忙整理资料，while(1) {Juliet.say("3Q");}。前面学习了 BM 、 Wu-Manber 算法， WM 由 BM 派生，不过 AC 与它们无染，是另外一种匹配思路。

1. 初识 AC 算法

Step1: 将由 patterns 组成的集合（要同时匹配多个 patterns 嘛）构成一个有限状态自动机。

Step2: 将要匹配的 text 作为自动机输入，输出含有哪些 patterns 及 patterns 在全文中位置。

自动机的执行动作由三个部分组成：

（1）一个 goto function

（2）一个 failure function

（3）一个 output function

我们先通过一个具体实例来了解一下这三个部分，及该自动机的运作方式。先有个大概印象，后面会具体讲解。 patterns 集合 {he, she, his ,hers} ，我们要在 ”ushers” 中查找并匹配。

goto function

(1) goto function

i 1 2 3 4 5 6 7 8 9

f(i) 0 0 0 1 2 0 3 0 3 （发现没？貌似 i 和 f(i) 有相同前缀哦^_^）

(2) failure function

i output(i)

2 {he}

5 {she,he}

7 {his}

9 {hers}

(3) output function

首先我们从状态 0 开始，接收匹配字符串的第一个字符 u ，在 goto （简称 goto function ）中可以看到回到状态 0 ，接着第二个字符 s ，发现转到状态 3 ，在 output 中查找一下 output(3) 为空字符串，说明没有匹配到 patterns 。继续匹配 h ，转到状态 4 ，查找 output 发现仍然没有匹配，继续字符 e ，状态转到了 5 ，查找 output ，发现 output(5) 匹配了两个字符串 she 和 he ，并输出在整个字符串中的位置。然后接着匹配 r ，但发现 g(5,r)=fail ，这时候我们需要查找 failure ，发现 f(5)=2 ，所以就转到状态 2 ，并接着匹配 r ，状态转移到了 8 ，接着匹配 s ，状态转移到了 9 ，查看 output ，并输出 output(9):hers ，记录下匹配位置。至此字符串 ushers 匹配完毕。

具体的匹配算法如下：

算法1. Pattern matching machine

输入：text 和M 。text 是x=a1a2...an ，M 是模式匹配自动机（包含了goto 函数g(),failure 函数f() ，output 函数output() ）

输出：text 中出现的pat 以及其位置。

state ←0

for i ←1 until n do // 吞入text 的ai

while g(state, ai)=fail

state ←f(state) // 直到能走下去，呵呵，至少0 那个状态怎么着都能走下去

state ←g(state,ai) // 得到下一个状态

if output(state) ≠empty // 能输出就输出

print i;

print output(state)

AC算法的时间复杂度是O(n)，与patterns的个数及长度都没有关系。因为Text中的每个字符都必须输入自动机，所以最好最坏情况下都是O(n)，加上预处理时间，那就是O(M+n)，M是patterns长度总和。

2. 构造三表

OK ，下面我们看看如何通过 patterns 集合构造上面的 3 个 function

这三个 function 的构造分两个阶段：

（1）我们决定状态和构造 goto function

（2）我们计算得出 failure function

而 output function 的构造贯穿于这两个阶段中。

2.1 goto 与 ouput 填表

我们仍然拿实例来一步步构造： patterns 集合 {he,she,his,hers}

首先我们构造 patterns he

he构造

然后接着构造 she

she构造

再构造 his ，由于在构造 his 的时候状态 0 接收 h 已经能到状态 1 ，所以就不用重新建一个状态了，有点像建 trie 树的过程，共用一段相同的前缀部分

his构造

最后构造 hers

hers构造

具体构造 goto function 算法如下：

算法2. Construction of the goto function

输入：patterns 集合K={y1,y2,...,yk}

输出：goto function g 和output function output 的中间结果

We assume output(s) is empty when state s is first created, and g(s,a)=fail if a is

undefined or if g(s,a) has not yet been defined. The procedure enter(y) inserts into

the goto graph a path that spells out y.

newstate ←0

fori ← 1until k // 对每个模式走一下enter(yi) ，要插到自动机里来嘛

enter(yi)

for all a such that g(0,a)=fail

g(0,a) ←0

enter(a1a2...am)

{

state ←0;j ←;1

while g(state,aj) ≠fail // 能走下去，就尽量延用以前的老路子，走不下去，就走下面的for() 拓展新路子

state ←g(state,aj)

j ←j+1

for p ←j until m // 拓展新路子

newstate ←newstate+1

g(state,ap) ←newstate

state ←newstate

output(state) ←{a1a2...am} // 此处state 为每次构造完一个pat 时遇到的那个状态

}

2.2 Failure 与 output 填表

Failure function 的构造：（这个比较抽象）

大家注意状态 0 不在 failure function 中，下面开始构造，首先对于所有 depth 为 1 的状态 s ， f(s)=0 ，然后归纳为所有 depth 为 d 的状态的 failure 值都由 depth-1 的状态得到。

具体讲，在计算 depth 为 d 的所有状态时候，我们会考虑到每一个 depth 为 d-1 的状态 r

1. 如果对于所有的字符 a ， g(r,a)=fail ，那么什么也不做，我认为这时候 r 已经是 trie 树的叶子结点了。

2. 否则的话，如果有 g(r,a)=s ，那么执行下面三步

（a）设置 state=f(r) // 用 state 记录跟 r 共前缀的东东

（b）执行 state=f(state) 零次或若干次，直到使得 g(state,a)!=fail （这个状态一定会有的因为 g(0,a)!=fail ） // 必须找条活路，能走下去的

（c）设置 f(s)=g(state,a) ，即相当于找到 f(s) 也是由一个状态匹配 a 字符转到的状态。

实例分析：

首先我们将 depth 为 1 的状态 f(1)=f(3)=0 ，然后考虑 depth 为 2 的结点 2 ， 6 ， 4

计算 f(2) 时候，我们设置 state=f(1)=0 ，因为 g(0,e)=0, 所以 f(2)=0;

计算 f(6) 时候，我们设置 state=f(1)=0 ，因为 g(0,i)=0, 所以 f(6)=0;

计算 f(4) 时候，我们设置 state=f(3)=0 ，因为 g(0,h)=1, 所以 f(4)=1;

然后考虑 depth 为 3 的结点 8 ， 7 ， 5

计算 f(8) 时候，我们设置 state=f(2)=0 ，因为 g(0,r)=0, 所以 f(8)=0;

计算 f(7) 时候，我们设置 state=f(6)=0 ，因为 g(0,s)=3, 所以 f(7)=3;

计算 f(5) 时候，我们设置 state=f(4)=1 ，因为 g(1,e)=2, 所以 f(5)=2;

最后考虑 depth 为 4 的结点 9

计算 f(9) 时候，我们设置 state=f(8)=0 ，因为 g(0,s)=3, 所以 f(9)=3;

具体算法：

算法3. Construction of the failure function

输入：goto function g and output function output from 算法2

输出：failure function f and output function output

queue ←empty

for each a such that g(0,a)=s ≠0// 其实这是广搜BFS 的过程

queue ←queue ∪{s}

f(s) ←0

while queue ≠empty

pop();

for each a such that g(r,a)=s ≠fail //r 是队列头状态，如果r 遇到a 能走下去

queue ←queue ∪{s} // 那就走

state ←f(r) // 与r 同前缀的state

while g(state,a)=fail // 其实一定能找着不fail 的，因为至少g(0,a) 不会fail

state ←f(state)

f(s) ←g(state,a) //OK ，这一步相当于找到了s 的同前缀状态，即f(s)

output(s) ←output(s) ∪output(f(s)) // 建议走一下例子中g(4,e)=5 的例子，然后ouput(5) ∪output(2)={she,he}

2.3 output

Output function 的构造参见算法 2 ， 3

3. 算法优化

改进 1 ：观察一下算法 3 中的 failure function 还不够优化

改进1

我们可以看到 g(4,e)=5 ，如果现在状态到了 4 并且当前的字符为 t ！ =e ，因为 g(4,t)=fail ，

所以就根据 f(4)=1, 跳转到状态 1 ，而之前已经知道 t ！ =e ，所以就没必要跳到 1 ，而直接跳到状态 f(1)=0 。

为了避免不必要的状态迁移，和 KMP 算法有异曲同工之处。重新定义了一个 failure function ： f1

f1(1)=0 ，

对于 i>1, 如果能使状态 f(i) 转移的所有字符也能使 i 状态转移，那么 f1(i)=f1(f(i)) ，

否则 f1(i)=f(i) 。

改进 2 ：由于 goto function 中并不是每个状态对应任何一个字符都有状态迁移的，当迁移为 fail 的时候，我们就要查 failure function ，然后换个状态迁移。现在我们根据 goto function 和 failure function 来构造一个确定的有限自动机 next move function ，该自动机的每个状态遇到每个字符都可以进行状态迁移，这样就省略了 failure function 。