pyahocorasick是个python模块,由两种数据结构实现:trie和Aho-Corasick自动机。
Trie是一个字符串索引的词典,检索相关项时时间和字符串长度成正比。
AC自动机能够在一次运行中找到给定集合所有字符串。AC自动机其实就是在Trie树上实现KMP,可以完成多模式串的匹配。
(推荐学习资料:http://blog.csdn.net/niushuai666/article/details/7002823;http://www.cnblogs.com/kuangbin/p/3164106.html)
作者
Wojciech Muła, wojciech_mula@poczta.onet.pl
官方地址:
https://pypi.python.org/pypi/pyahocorasick/
安装
python setup.py install
API
模块ahocorasick包含几个常量和Automaton类
Unicode和bytes
Automaton接受和返回的字符串既可以是unicode,也可以是bytes,取决于编译阶段设置。用模块成员unicode表示选择的类型。
需要注意的是:如果选择使用unicode,trie存储每个字母时用2或者4个bytes;如果选择bytes,每个字母只需要一个byte。
constants
- unicode:
- STORE_ANY,STORE_INTS,STORE_LENGTH
- EMPTY,TRIE,AHOCORASICK
- MATCH_EXACT_LENGTH,MATCH_AT_MOST_PREFIX,MATCH_AT_LEAST_PREFIX
Antomaton类
略。有兴趣了解请看官网介绍。
AutomatonSearchIter类
略。有兴趣了解请看官网介绍。
示例
import ahocorasick
A = ahocorasick.Automaton()
# 向trie树中添加单词
for index,word in enumerate("he her hers she".split()):
A.add_word(word, (index, word))
# 用法分析add_word(word,[value]) => bool
# 根据Automaton构造函数的参数store设置,value这样考虑:
# 1. 如果store设置为STORE_LENGTH,不能传递value,默认保存len(word)
# 2. 如果store设置为STORE_INTS,value可选,但必须是int类型,默认是len(automaton)
# 3. 如果store设置为STORE_ANY,value必须写,可以是任意类型
# 测试单词是否在树中
if "he" in A:
print True
else:
print False
A.get("he")
# (0,'he')
A.get("cat","<not exists>")
# '<not exists>'
A.get("dog")
# KeyError
# 将trie树转化为Aho-Corasick自动机
A.make_automaton()
# 找到所有匹配字符串
for item in A.iter("_hershe_"):
print item
#(2,(0,'he'))
#(3,(1,'her'))
#(4, (2, 'hers'))
#(6, (3, 'she'))
#(6, (0, 'he'))
示例2
import ahocorasick
A = ahocorasick.Automaton()
# 添加单词
for index,word in enumerate("cat catastropha rat rate bat".split()):
A.add_word(word, (index, word))
# prefix
list(A.keys("cat"))
## ["cat","catastropha"]
list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH))
## ['bat','cat','rat']
list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX))
## ["bat", "cat", "rat", "rate"]
list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX))
## ['rate']
## keys用法分析
## keys([prefix, [wildcard, [how]]]) => yield strings
## If prefix (a string) is given, then only words sharing this prefix are yielded.
## If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched:
## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern.
## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded.
## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.