自动机的思维导图

自动机有很多用途,不光是编译器会用到,众所周知的还有正则表达式。其他,简单的如字符串搜索,存储等,复杂点的如模型验证,定理证明……

最简单的自动机应该算是DFA了,再往简单了说,一个DFA就是一个 map<int,map<char,int>>, 其中 int 是 state id,char是转移标号;如果有人还嫌不够简单,最最简单的那就是 int [][256] 了,就是一个256列的矩阵。就这么简单的一个东西,却蕴含着非常深刻的思想,有着非常广泛的用途。

几个月前因为项目需要,再加上自己的兴趣,实现了一系列自动机相关的算法和数据结构(用C++11写的),其结果和成就感远超我当初一开始的想象和预期,因为有好几个算法我的实现远远胜于任何已有的实现,不光运行速度快得多,内存占用也小得多。今天有点闲暇,整理一个思维导图,以后有时间再慢慢逐个描述。

mind map 最好打开新窗口,再缩放着看,在这里只能看到很小的一部分。


DFA
1 DFA Minimization
1.1 General DFA Minimization
1.1.1 Hopcroft Algorithm
1.1.1.1 Partition Refinement
1.1.1.2 A mutation to the splitter generator, for performance
1.1.1.3 Waiting set strategy
1.1.1.3.1 Stack is faster
1.1.1.3.2 Queue is slower
1.1.1.4 Intersections of Partitions are all empty
1.1.1.5 * Efficient implementation needs double linked list * Use array index as the link
1.2 Path compression This is not DFA minimization, it's just an optimization. A sequence of states which all have just 1 transition. First state of the sequence couldn't be a final state. Last state of the sequence couldn't be a confluence state.
2 ADFA Acyclic DFA
2.1 Acyclic DFA Minimization General Minimization Algorithm works for this case, but there are dedicated algorithm, which needs less memory, sometimes faster
2.1.1 Sorted input
2.1.2 Random input
2.1.3 MinADFA or DAWG
2.2 Don't support the mapping Use it as a SET This is a dynamic SET
2.3 Identify Each Word Could be used as a MAP
2.3.1 The word id is the Dictionary ordinal of the word in the DAWG The word<=>id mapping is a bidirectional mapping
2.3.2 CLTK Implementaion
2.3.2.1 Save a number in the State.
2.3.2.2 The number is the total pathes from the state to all reachable final state.
2.3.2.3 Pros: Number update for random inserting word could be efficient enough
2.3.2.4 Cons: Mapping a word to its (dictinary) ordinal is not fast enough: O(|Σ|*strlen(word))
2.3.3 Febird Implementation
2.3.3.1 Save a number in the Transition
2.3.3.2 Let s be the source of the transition. Let c be the label char of the transition. The number saved in the transition is the total paths from s which label char is less than c.
2.3.3.2.1 The unique single transition of a state was saved in the state's self, and the number is always 0, thus it could be omited
2.3.3.2.2 In applications, there are about 50% states just have a single transition
2.3.3.2.3 For states which has multiple transitions, the number of its first transtion is always 0, this 0 is not omited, to omit this 0, implementation code would be much complex and nasty
2.3.3.3 Pros: Mapping a word to its dictionary ordinal is fast: O(strlen(word))
2.3.3.4 Cons: Number update for dynamic inserting word couldn't be efficient
3 Aho-Corasick Multiple Pattern Matching
3.1 Combine With Numbered ADFA
3.2 Word set storage in Adjacent Difference Use less memory
4 Febird Implementation: * Many states have 1 single transition * Few states have a few transitions * Very few states have many transitions * Almost all states have a small range of label char
4.1 struct State32 { // There are other impl uint32_t spos; char_t minch; char_t maxch; // inclusive unsigned term_bit : 1; unsigned pzip_bit : 1; }; // if minch==maxch, spos is the target state_id // else spos is the offset to targets map in mempool // the targets map consist a bitmap and targets id
4.2 All states are saved in a vector * State could be deleted * Deleted states are put into freelist * When state vector is full, double its capacity
4.3 Written in C++11 * Templatized, for performance & memory utilization * For concise, use lambda extensively * Use class enum, just for clear
5 Abastrac Concept: map<state,map<char,state>>


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值