简介
pyahocorasick是个python模块,由两种数据结构实现:trie和Aho-Corasick自动机。
简单使用方法
import ahocorasick
A = ahocorasick.Automaton()
for idx, key in enumerate('he her hers she'.split()):
A.add_word(key, (idx, key))
用get()
方法来查找
>>> A.get('he')
(0, 'he')
>>> A.get('she')
(3, 'she')
>>> A.get('cat', 'not exists')
'not exists'
>>> A.get('dog')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError
用A.make_automaton()
来转换 trie查找为Aho-Corasick search
test = ['abcdefg', 'abcdef', 'abcde','abcd','abc','ab','a','abdcef','cde']
def build_actree(wordlist):
actree = ahocorasick.Automaton()
for index, word in enumerate(wordlist):
actree.add_word(word, (index, word))
actree.make_automaton()
return actree
actree_test = build_actree(test)
for i in actree_test.iter('abcdefg'):
print (i)
>>>
(0, (6, 'a'))
(1, (5, 'ab'))
(2, (4, 'abc'))
(3, (3, 'abcd'))
(4, (2, 'abcde'))
(4, (8, 'cde'))
(5, (1, 'abcdef'))
(6, (0, 'abcdefg'))
从结果能看出,actree会查找除了 前缀和后缀的 按顺讯匹配到的字段,比如 abcd
和cde