AC算法是Alfred V.Aho(《编译原理》(龙书)的作者),和Margaret J.Corasick于1974年提出(与KMP算法同年)的一个经典的多模式匹配算法,可以保证对于给定的长度为n的文本,和模式集合
P{p1,p2,...pm}
,在O(n)时间复杂度内,找到文本中的所有目标模式,而与模式集合的规模m无关.
AC算法从某种程度上可以说是KMP算法在多模式环境下的扩展。
KMP 算法简述
对于模式串而言,其前缀,有可能也是模式串中的非前缀的子串,而且这里找的是最大前缀,非前缀可能包含多个前缀。
在KMP算法中有个数组,叫做前缀数组,也有的叫next数组,发现不匹配,下一步模式(pattern)串匹配目标(target)串的模式串的位置,它记录着字符串匹配过程中失配情况下,模式串可以向前跳几个字符,当然它描述的也是子串的对称程度,程度越高,值越大,当然之前可能出现再匹配的机会就更大。
示例1
序号 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
pattern | a | b | c | a | b | c | a | c | a | b |
next | 0 | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 |
示例2
序号 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pattern | a | g | c | t | a | g | c | a | g | c | t | a | g | c | t | g |
next | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 4 | 0 |
示例2中,a g c t a g c,包含两个前缀。对于t,其next一定小于其前面c的next。
AC自动机算法
AC are determined by three functions:goto function ,failure function,output function
Keyword Tree
A keyword tree (or a trie ) for a set of patterns
P
is a rooted tree
- each edge of K is labeled by a char acter
- any two edges out of a node have diferent labels
Define the label of a nodev as the concatenation of edge labels on the path from the root to v , and denote it byL(v) - for each
p∈P
there’s a node
v
with
L(v)=P , and - the label
L(v)
of any leaf
v
equals some
p∈P
A keyword tree for
P
={he,she,his,hers}
goto function
States: nodes of the keyword tree
initial state: 0 = the root
the goto function
- if edge
(q;v) is labeled by a , theng(q;a)=v ;- g(0;a)=0 for each a that does not label an edge out of the root the automaton stays at the initial state while scanning non-matching characters
- Otherwise g(q;a)=∅ ;
failure function
the failure function f(q) for q≠0 gives the state entered at a mismatch
f(q) is the node labeled by the longest proper suffix w ofL(q) s.t.w is a prefix of some pattern,a fail transition does not miss any potential occurrencesf(q) is always defined, since L(0)=ϵ is a prefix of any pattern
Dashed arrows are fail transitions
q 1 2 3 4 5 6 7 8 9 P h- e- s- h- e- i- s- r- s- f(q) 0 0 0 1 2 0 3 0 3 output function
the output function out(q) gives the set of patterns recognized when entering state q
q out(q) 2 {he} 5 {she,he} 7 {his} 9 {hers}07-14 18808-05 197810-12 68508-21 93204-11 689评论被折叠的 条评论 为什么被折叠? 到【灌水乐园】发言查看更多评论添加红包