AC自动机算法原理详解与敏感词过滤实现（JAVA保姆级详解）

Rainyocode

已于 2024-07-28 06:00:50 修改

阅读量1.7k

点赞数 37

分类专栏：算法文章标签：算法 java 广度优先 leetcode 动态规划数据结构

于 2024-07-28 05:36:00 首次发布

本文链接：https://blog.csdn.net/qq_74851649/article/details/140745300

版权

算法专栏收录该内容

3 篇文章 0 订阅

订阅专栏

AC自动机(Aho-Corasick automaton)是一种高效的多模式字符串匹配算法,由Alfred V. Aho和Margaret J. Corasick于1975年提出。它能够在线性时间内在文本中搜索多个模式串,是解决多模式匹配问题的理想选择。我将从字典树、广度优先搜索和KMP算法等基础开始，详细介绍AC自动机的原理并给出一个简单的实现。

1.字典树

在AC自动机中我们是使用字典树去存储敏感词串的，首先我们先了解字典树的定义。

字典树,也叫前缀树,是一种特殊的树形结构。想象一下,它就像是一本巨大的字典,但是这本字典的组织方式非常特别。

字典树的结构

树的形状: 字典树像一棵倒立的树,根在上面,枝叶向下生长。
节点的含义: 树上的每个节点代表一个字母。从根节点(代表空字符)开始,沿着树往下走,经过的字母连起来就组成了一个单词。
单词的存储: 在字典树中,单词不是整个存在某个节点里,而是分散在从根到叶子的路径上。

举个例子

假设我们要存储"cat", "car", "dog"这三个单词:

要找"cat",就从根开始,先找"c",再找"a",最后找"t"。

要找"car",前面的"ca"和"cat"是一样的,只是最后一个字母不同。

在普通字典里找单词,你可能需要翻很多页。但在字典树中,只需要沿着树往下走,最多走单词的长度那么多步就能找到。

如果很多单词有相同的开头(比如"car"和"cat"),字典树可以共用这些相同的部分,省下不少空间。

想找所有以"ca"开头的单词?在字典树中,只需要找到"ca"对应的节点,然后看看它下面有哪些分支就行了。

字典树的构建方法

构建字典树的过程如下:

创建根节点。
对于要插入的每个字符串:
- 从根节点开始。
- 对字符串中的每个字符:
  - 如果当前节点没有对应该字符的子节点,创建一个新的子节点。
  - 移动到对应该字符的子节点。

在最后一个字符对应的节点标记为字符串的结束。


class TrieNode {
    private TrieNode[] children;
    private boolean isEndOfWord;

    public TrieNode() {
        children = new TrieNode[26]; // 假设只包含小写字母
        isEndOfWord = false;
    }
}

public class Trie {
    private TrieNode root;

    public Trie() {
        root = new TrieNode();
    }

    // 插入单词
    public void insert(String word) {
        TrieNode current = root;
        for (char ch : word.toCharArray()) {
            int index = ch - 'a';
            if (current.children[index] == null) {
                current.children[index] = new TrieNode();
            }
            current = current.children[index];
        }
        current.isEndOfWord = true;
    }

    // 搜索单词
    public boolean search(String word) {
        TrieNode node = searchNode(word);
        return node != null && node.isEndOfWord;
    }

    // 判断是否有以给定前缀开始的单词
    public boolean startsWith(String prefix) {
        return searchNode(prefix) != null;
    }

    private TrieNode searchNode(String word) {
        TrieNode current = root;
        for (char ch : word.toCharArray()) {
            int index = ch - 'a';
            if (current.children[index] == null) {
                return null;
            }
            current = current.children[index];
        }
        return current;
    }
}

我们挑其中几段代码解释


public void insert(String word) {
    TrieNode current = root;
    for (char ch : word.toCharArray()) {
        int index = ch - 'a';
        if (current.children[index] == null) {
            current.children[index] = new TrieNode();
        }
        current = current.children[index];
    }
    current.isEndOfWord = true;
}

这是字典树的插入方法，想象你正在玩一个拼字游戏,你需要把一个单词放进一个特殊的盒子里。这个盒子就是我们的字典树,而这个方法就是告诉你怎么把单词放进去。

开始的地方
TrieNode current = root;
你从盒子的入口(根节点)开始。这就像你站在迷宫的入口。
逐个处理字母
for (char ch : word.toCharArray())
你要一个字母一个字母地处理这个单词。就像你在迷宫里一步步走。
找到字母的位置
int index = ch - 'a';
每个字母都有它特定的位置。这里是把字母转换成数字,比如'a'是0,'b'是1,以此类推。
检查是否存在路径
if (current.children[index] == null)
你看看从当前位置到下一个字母是否有路.
创建新路径
current.children[index] = new TrieNode();
如果没有你就创建一条新路。这就像在迷宫里开辟一条新的通道。
移动到下一个位置
current = current.children[index];
然后你沿着这条路(可能是新建的,也可能是已存在的)走到下一个位置。
标记单词结束
current.isEndOfWord = true;
当你把单词的所有字母都处理完后,你在最后的位置插一面小旗子,表示走出迷宫。

private TrieNode searchNode(String word) {
    TrieNode current = root;
    for (char ch : word.toCharArray()) {
        int index = ch - 'a';
        if (current.children[index] == null) {
            return null;
        }
        current = current.children[index];
    }
    return current;
}

搜索方法，

站在起点
TrieNode current = root;
你站在迷宫的起点,也就是我们字典树的入口。
逐步查看地图
for (char ch : word.toCharArray())
每个字母就像地图上的一个路标。
确定下一步方向
int index = ch - 'a';
你把字母转换成一个数字,这就像是把路标转换成具体的方向。
检查路是否存在
if (current.children[index] == null)
你看看地图指示的下一步是否真的有路。如果没有路...
遇到死胡同
return null;
...那就意味着你走错了,或者这张地图是假的。
继续前进
current = current.children[index];
如果有路,你就沿着这条路继续前进,准备看地图上的下一个指示。
通过迷宫
return current;

2.广度优先搜索（BFS）

在AC自动机中,BFS用于构建失败指针。通过BFS,我们可以按照节点在字典树中的层次顺序构建失败指针,确保每个节点的失败指针都指向正确的位置。

广度优先搜索(BFS)是一种图遍历算法，BFS从图的某个起始节点开始,先访问起始节点的所有邻接节点,然后再访问这些邻接节点的邻接节点,以此类推。

以下是BFS的实现

import java.util.*;

public class BFS {
    private int V; // 图中顶点的数量
    private LinkedList<Integer>[] adj; // 邻接表

    BFS(int v) {
        V = v;
        adj = new LinkedList[v];
        for (int i = 0; i < v; ++i)
            adj[i] = new LinkedList();
    }

    // 添加边
    void addEdge(int v, int w) {
        adj[v].add(w);
    }

    // BFS遍历
    void BFS(int s) {
        boolean visited[] = new boolean[V];
        LinkedList<Integer> queue = new LinkedList<Integer>();

        visited[s] = true;
        queue.add(s);

        while (queue.size() != 0) {
            s = queue.poll();
            System.out.print(s + " ");

            Iterator<Integer> i = adj[s].listIterator();
            while (i.hasNext()) {
                int n = i.next();
                if (!visited[n]) {
                    visited[n] = true;
                    queue.add(n);
                }
            }
        }
    }
}

3.KMP算法

KMP算法是AC自动机中非常关键的算法，当我们在字典树中查找敏感词的时候

假设我们的主串是ABABABCDA，子串（敏感词）ABABC

当我们在主串匹配到ABABA的时候，匹配失败，KMP的算法就会避免像暴力算法一样让指针回退，从第二个字母（B）开始重新匹配，而是让主串的指针指向第五个字母(A），子串的指针指向第三个字母（A)，因为前面的AB，AB是相同的两个字符串，KMP算法中定义的lsp数组会记录当前位置可以跳过字符的个数，利用已经匹配过的信息,跳过一些不必要的比较，避免重复比对。

以下是KMP算法的实现：


public class KMP {
    
    // KMP搜索算法
    public static void KMPSearch(String pat, String txt) {
        int M = pat.length();
        int N = txt.length();

        // 创建lps[]数组,用于存储模式串的最长相等前后缀
        int[] lps = new int[M];
        
        // 预处理模式串,计算lps[]数组
        next(pat, M, lps);

        int i = 0; // txt[]的索引
        int j = 0; // pat[]的索引
        
        while (i < N) {
            // 字符匹配成功,移动两个指针
            if (pat.charAt(j) == txt.charAt(i)) {
                j++;
                i++;
            }
            
            if (j == M) {
                // 找到完整匹配
                System.out.println("在索引 " + (i - j) + " 处找到模式");
                // 移动模式串指针到lps的最后一个值
                j = lps[j - 1];
            }
            
            // 字符不匹配
            else if (i < N && pat.charAt(j) != txt.charAt(i)) {
                if (j != 0)
                    // 不完全回退,利用已匹配的信息
                    j = lps[j - 1];
                else
                    // 完全不匹配,主串指针后移
                    i = i + 1;
            }
        }
    }

    // 计算lps[]数组
    private static void next(String pat, int M, int[] lps) {
        int len = 0; // 上一个最长相等前后缀的长度
        int i = 1;
        lps[0] = 0; // lps[0]始终为0

        while (i < M) {
            if (pat.charAt(i) == pat.charAt(len)) {
                len++;
                lps[i] = len;
                i++;
            } else {
                if (len != 0) {
                    // 回退到上一个最长相等前后缀
                    len = lps[len - 1];
                } else {
                    // 没有可回退的了,从0开始
                    lps[i] = 0;
                    i++;
                }
            }
        }
    }

构建lsp数组的目的就是寻找子串(敏感词)中每个位置前的最长相同前后缀子串，我们逐步解析lsp数组的构建方法：

初始化：
int len = 0; // 第一个字符没有前缀 int i = 1; 从第二个字符开始处理 lps[0] = 0; //当前最长相等前后缀长
逐字符处理并生成next数组:
if (pat.charAt(i) == pat.charAt(len))
{ len++;
lps[i] = len;
i++; }//如果相等就向前移动
else {
if (len != 0) {
// 回退到上一个最长相等前后缀
len = lps[len - 1];
} else {
// 没有可回退的了,从0开始
lps[i] = 0; i++;
} } } }

kmp搜索：

初始化:
M: 子串长度
N: 主串长度
lps[]: 存储子串的最长相等前后缀长度
i: 主串的当前位置,初始为0
j: 子串的当前位置,初始为0
构建lps数组:
调用next方法预处理模式串,计算lps数组。这是KMP算法的关键步骤,它允许我们在匹配失败时知道应该跳转到哪里。
主循环:
while (i < N) 确保我们遍历整个主串。
字符匹配成功的情况:
if (pat.charAt(j) == txt.charAt(i)) {
j++; i++; }

如果字符匹配,两个指针都向前移动。
完全匹配的情况:
if (j == M) { System.out.println("在索引 " + (i - j) + " 处找到模式"); j = lps[j - 1]; }
如果j等于M,说明找到了完整匹配。我们打印匹配位置,然后将j设置为lps[j-1],这样可以继续寻找下一个可能的匹配。
字符不匹配的情况:
else if (i < N && pat.charAt(j) != txt.charAt(i)) {
if (j != 0) j = lps[j - 1];
else i = i + 1;
}
如果j不为0,我们利用lps数组跳转到模式串的某个位置继续匹配。
如果j为0,说明当前位置完全不匹配,我们只能将主串指针i向前移动。

完整AC自动机的实现

import java.util.*;

public class ACAutomaton {

    private class Node {
        Map<Character, Node> children;
        boolean isEndOfWord;
        Node fail;
        List<String> output;

        Node() {
            children = new HashMap<>();
            isEndOfWord = false;
            fail = null;
            output = new ArrayList<>();
        }
    }

    private Node root;

    public ACAutomaton() {
        root = new Node();
    }

    // 构建Trie树
    public void addKeyword(String keyword) {
        Node current = root;
        for (char c : keyword.toCharArray()) {
            current.children.putIfAbsent(c, new Node());
            current = current.children.get(c);
        }
        current.isEndOfWord = true;
        current.output.add(keyword);
    }

    // 构建失败指针
    public void buildFailurePointers() {
        Queue<Node> queue = new LinkedList<>();
        
        for (Node child : root.children.values()) {
            child.fail = root;
            queue.offer(child);
        }

        while (!queue.isEmpty()) {
            Node current = queue.poll();

            for (Map.Entry<Character, Node> entry : current.children.entrySet()) {
                char c = entry.getKey();
                Node child = entry.getValue();
                Node failNode = current.fail;

                while (failNode != null && !failNode.children.containsKey(c)) {
                    failNode = failNode.fail;
                }

                if (failNode == null) {
                    child.fail = root;
                } else {
                    child.fail = failNode.children.get(c);
                    child.output.addAll(child.fail.output);
                }

                queue.offer(child);
            }
        }
    }

    // 敏感词过滤
    public String filter(String text) {
        StringBuilder result = new StringBuilder(text);
        Node current = root;

        for (int i = 0; i < text.length(); i++) {
            char c = text.charAt(i);

            while (current != root && !current.children.containsKey(c)) {
                current = current.fail;
            }

            if (current.children.containsKey(c)) {
                current = current.children.get(c);
            } else {
                continue;
            }

            if (!current.output.isEmpty()) {
                for (String keyword : current.output) {
                    int start = i - keyword.length() + 1;
                    for (int j = start; j <= i; j++) {
                        result.setCharAt(j, '*');
                    }
                }
            }
        }

        return result.toString();
    }