AC自动机

王哈哈嘎哈呢

于 2020-10-30 23:27:37 发布

阅读量248

点赞数 2

分类专栏：数据结构

本文链接：https://blog.csdn.net/weixin_45087321/article/details/109395567

版权

数据结构专栏收录该内容

3 篇文章 0 订阅

订阅专栏

AC自动机是一种多模匹配算法，所谓多模匹配，就是模式串有多个。其主要的步骤分为三步：
没有先后之分
<–借鉴1必须放到前面—>
<–借鉴2必须放在前面–>

简述：
AC自动机是一种树结构的多模快速匹配机制，如下图，比如同时匹配[‘he’，‘hers’，‘his’，‘she’]，将词表挂在树上，root的子节点为头部，去与text相匹配，当text中出现h，标记起始位置h在text中的index，继续向下匹配，如果途中出现两个圆圈的，返回一个值，继续向下匹配直至匹配结束，如果途中再次遇到两个圆圈，返回值后，继续向下匹配，直至匹配结束，当到达e后，向下匹配不成功，指针回到起始标记节点h处，继续向后匹配（匹配h or s）。有什么不足，请指正~！谢谢，代码有个小bug，就是当出现 hera 和era 时，返回的位置信息（0,4）（0,4），起始位置和终止位置有问题，有时间在修改一下。

1、用模式串建立字典树
字典树（Trie树）是一种变种的哈希数，存放字符串非常方便，查找效率也比较高。字典树中存放的字符串即是从根到叶子路径上所有结点值，每次插入新的字符串，在遍历字符串的同时，从根结点开始查找，若字符出现在当前结点的子结点，则转到子结点继续查找下一个字符，否则将该字符插到当前结点的子结点中。这里Python语言由于没有C/C++的结构体，所以使用类来构建出结点类。此外使用LIST类型来存放结点的子结点。
2、KPM处理
提到模式匹配，KMP肯定是不能少的。KMP中优化的核心便是NEXT数组，其每次匹配失败时均根据NEXT数组选择合适的位置开始匹配，而不是从头开始。同样的策略自动机也有，在自动机中，我们是Fail指针指向一个结点，当匹配失败时，则转到该结点的Fail结点继续回溯。
3、模式匹配
在AC自动机建立好后，就可以在自动机上匹配字符串上了，另外Python3默认使用兼容ASCII的UTF-8编码，简化了Python2中的编码问题，所以中文也是可以匹配的。

原理解析

AC自动机相比于字典树结构仅仅是多了fail结点，指向其已匹配成功的前缀。其模式匹配与KMP算法一致。

引用百度百科的图片，即sh后匹配e失败，此时h其实是已经匹配成功的状态，所以可以从74这个匹配成功h的结点之后继续匹配下一字符。
原理图片
这是copy别人的代码，起始位置和终止位置有问题，后面附带自己写的一个代码，解决位置问题

# -*- coding:utf-8 -*-
"""
Description: AC自动机

@author: WangLeAi
@date: 2018/8/19
"""
from collections import defaultdict


class TrieNode(object):
    def __init__(self, value=None):
        # 值
        self.value = value
        # fail指针
        self.fail = None
        # 尾标志：标志为i表示第i个模式串串尾，默认为0
        self.tail = 0
        # 子节点，{value:TrieNode}
        self.children = {}


class Trie(object):
    def __init__(self, words):
        print("初始化")
        # 根节点
        self.root = TrieNode()
        # 模式串个数
        self.count = 0
        self.words = words
        for word in words:
            self.insert(word)
        self.ac_automation()
        print("初始化完毕")

    def insert(self, sequence):
        """
        基操，插入一个字符串
        :param sequence: 字符串
        :return:
        """
        self.count += 1
        cur_node = self.root
        for item in sequence:
            if item not in cur_node.children:
                # 插入结点
                child = TrieNode(value=item)
                cur_node.children[item] = child
                cur_node = child
            else:
                cur_node = cur_node.children[item]
        cur_node.tail = self.count

    def ac_automation(self):
        """
        构建失败路径
        :return:
        """
        queue = [self.root]
        # BFS遍历字典树
        while len(queue):
            temp_node = queue[0]
            # 取出队首元素
            queue.remove(temp_node)
            for value in temp_node.children.values():
                # 根的子结点fail指向根自己
                if temp_node == self.root:
                    value.fail = self.root
                else:
                    # 转到fail指针
                    p = temp_node.fail
                    while p:
                        # 若结点值在该结点的子结点中，则将fail指向该结点的对应子结点
                        if value.value in p.children:
                            value.fail = p.children[value.value]
                            break
                        # 转到fail指针继续回溯
                        p = p.fail
                    # 若为None，表示当前结点值在之前都没出现过，则其fail指向根结点
                    if not p:
                        value.fail = self.root
                # 将当前结点的所有子结点加到队列中
                queue.append(value)

    def search(self, text):
        """
        模式匹配
        :param self:
        :param text: 长文本
        :return:
        """
        p = self.root
        # 记录匹配起始位置下标
        start_index = 0
        # 成功匹配结果集
        rst = defaultdict(list)
        for i in range(len(text)):
            single_char = text[i]
            while single_char not in p.children and p is not self.root:
                p = p.fail
            # 有一点瑕疵，原因在于匹配子串的时候，若字符串中部分字符由两个匹配词组成，此时后一个词的前缀下标不会更新
            # 这是由于KMP算法本身导致的，目前与下文循环寻找所有匹配词存在冲突
            # 但是问题不大，因为其标记的位置均为匹配成功的字符
            if single_char in p.children and p is self.root:
                start_index = i
            # 若找到匹配成功的字符结点，则指向那个结点，否则指向根结点
            if single_char in p.children:
                p = p.children[single_char]
            else:
                start_index = i
                p = self.root
            temp = p
            while temp is not self.root:
                # 尾标志为0不处理，但是tail需要-1从而与敏感词字典下标一致
                # 循环原因在于，有些词本身只是另一个词的后缀，也需要辨识出来
                if temp.tail:
                    rst[self.words[temp.tail - 1]].append((start_index, i))
                temp = temp.fail
        return rst


if __name__ == "__main__":
    test_words = ["不知", "不觉", "忘了爱"]
    test_text = """不知、不觉·间我~|~已经忘了爱❤。"""
    model = Trie(test_words)
    # defaultdict(<class 'list'>, {'不知': [(0, 1)], '不觉': [(3, 4)], '忘了爱': [(13, 15)]})
    print(str(model.search(test_text)))

自己写的代码，附带词性

class TrieNode(object):
    def __init__(self, value=None):
        # 值
        self.value = value
        # 终止符，结尾处为词的长度
        self.tail = 0
        # 默认词性
        self.types = None
        # 下级字典
        self.children = {}


class Trie(object):
    def __init__(self, words):
        self.root = TrieNode()
        for word in words:
            self.insert(word)

    def insert(self, sequences):
        cur_node = self.root
        # 词和词性
        sequence, n = sequences.split()
        for item in sequence:
            if item not in cur_node.children:
                # 插入结点
                child = TrieNode(value=item)
                cur_node.children[item] = child
                cur_node = child
            else:
                cur_node = cur_node.children[item]
        # 起始到终止字符的长度
        cur_node.tail = len(sequence)
        # 词性标注
        cur_node.types = n

    def t(self, text):
        '''
        逻辑操作
        :param text: 文本块
        :return:
        '''
        cur_node = self.root
        flag = True
        count = 0
        c = -1
        while flag:
            if c > len(text)-1:
                break
            else:
                if count >= len(text):
                    c += 1
                    count = c + 1
                    cur_node = self.root
                    continue
            # 判断当前字是否在下级字典中
            if text[count] in cur_node.children:
                # 是否是词的终止
                if cur_node.children[text[count]].tail:
                    yield [c + 1, cur_node.children[text[count]].tail + c + 1, cur_node.children[text[count]].types]
                # 是否含有下级字典    骨质 和 骨质增生 类型
                if len(cur_node.children[text[count]].children):
                    cur_node = cur_node.children[text[count]]
                count += 1
            else:
                if count - c == 1:
                    cur_node = self.root
                    c = count
                    count += 1
                else:
                    cur_node = self.root
                    count = c + 2
                    c += 1

t = Trie(['我 a', '门 d', '天安门 b', '安门 c', '天安 e'])
text = '我爱天安门'
a = t.t(text)
for i in a:
    print(text[i[0]:i[1]], i[2])

输出样式

我 a
天安 e
天安门 b
安门 c
门 d

~~目前只有结尾处最后的单个字无法识别，‘门’ 无法提出来，待改进。。。~~
以改进完成，修复之前的问题

王哈哈嘎哈呢

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
AC自动机

AC自动机是一种多模匹配算法，所谓多模匹配，就是模式串有多个。其主要的步骤分为三步：<–借鉴必须放到前面—>1、用模式串建立字典树字典树（Trie树）是一种变种的哈希数，存放字符串非常方便，查找效率也比较高。字典树中存放的字符串即是从根到叶子路径上所有结点值，每次插入新的字符串，在遍历字符串的同时，从根结点开始查找，若字符出现在当前结点的子结点，则转到子结点继续查找下一个字符，否则将该字符插到当前结点的子结点中。这里Python语言由于没有C/C++的结构体，所以使用类来构建出结点类。此外
复制链接

扫一扫