AC自动机,是Aho-Corasick automaton的简称,该算法在1975年产生于贝尔实验室,是著名的多模匹配算法。AC自动机是对字典树算法的一种延伸,是字符串中运用非常广泛的一种算法;
AC自动机比字典树Trie多维护一个数组——fail数组。fail数组的作用是指向当前节点表示的字符串的后缀可以和模式串匹配上的最大长度的节点。是不是和KMP的next数组有点相似?
KMP的next数组是自己和自己的匹配,而AC自动机的fail数组是自己和模式串(当然也包括自己)的匹配。
比如:我们有字典集合{acd、aceb、bef、cef},则可以生产如下的AC自动机
当查找acefcab时,首先会按aceb的支路一直匹配到e,在e的位置发现找不到f,然后跳转到e的失败节点(即cef支路的e节点),查到f。并以此完成了第一次匹配。
实现方法
1.ahocorasick
安装方法:pip install pyahocorasick
使用说明
import ahocorasick
import time
class AhocorasickNer:
def __init__(self):
# self.user_dict_path = user_dict_path
self.actree = ahocorasick.Automaton()
self.flage=0
def add_keywords(self,key):
self.actree.add_word(key,(self.flage,key))
self.flage+=1
def make_automaton(self):
self.actree.make_automaton()
def get_ner_results(self, sentence):
# print(list(self.actree.keys()))
# for key in list(self.actree.keys()):
# print(key,list(self.actree.values(key)))
# print(self.actree.get(sentence,''))
ner_results = []
# i的形式为(index1,(index2,word))
# index1: 提取后的结果在sentence中的末尾索引
# index2: 提取后的结果在self.actree中的索引
# print(self.actree.get(word))
# print(self.actree.find_all(word))
# for i in self.actree.items(word):
# print(i)
# for i in self.actree.iter(sentence):
for i in self.actree.iter_long(sentence):
print("iter_long",i)
for i in self.actree.iter(sentence):
print("iter",i)
# ner_results.append((i[1], i[0] + 1 - len(i[1][1]), i[0] + 1))
return ner_results
if __name__ == "__main__":
def sub_seq(string):
sub_string=[]
sub=""
for s in string:
sub+=s
sub_string.append(sub)
return sub_string
ahocorasick_ner = AhocorasickNer()
# words=['abcdefg', 'abcdef', 'abcde','abcd','abc','ab','a','abdcef','cde']
words=["外卖公司","百度","百度集团","百度外卖","阿里巴巴","腾讯"]
# words = [ "百度","百度集团","百度外卖"]
for index,word in enumerate(words):
print(sub_seq(word))
for key in sub_seq(word):
ahocorasick_ner.add_keywords(key)
ahocorasick_ner.make_automaton()
while True:
sentence = input("\nINPUT : ")
ss = time.time()
res = ahocorasick_ner.get_ner_results(sentence)
print("TIME : {0}ms!".format(round(1000 * (time.time() - ss), 3)))
print("OUTPUT:{0}".format(res))
2. esmre
安装方法:pip install esmre
使用说明:http://xiaorui.cc/archives/1649
import esmre
index=esmre.Index()
words = [ "百度","百度集团","百度外卖","阿里巴巴","腾讯"]
for i,word in enumerate(words):
index.enter(word,word)
result=index.query("百度大战阿里巴巴")
print(result)
3.flashtext
安装方法:pip install flashtext
使用说明
from flashtext import KeywordProcessor
keyword_processor=KeywordProcessor()
def sub_seq(string):
sub_string = []
sub = ""
for s in string:
sub += s
sub_string.append(sub)
return sub_string
words = [ "百度","百度集团","百度外卖"]
for word in words:
print(sub_seq(word))
for sub_word in sub_seq(word):
keyword_processor.add_keyword(sub_word,word)
print(keyword_processor.get_all_keywords())
4.python代码实现