字典树的实际应用场景之敏感词过滤（PHP 和 Java）

最新推荐文章于 2024-04-10 15:17:12 发布

Hopeful Qiang

最新推荐文章于 2024-04-10 15:17:12 发布

阅读量563

点赞数 1

分类专栏： php Java 数据结构文章标签： php java 算法数据结构字典

本文链接：https://blog.csdn.net/fennud_xiaoqiang/article/details/122455793

版权

php 同时被 3 个专栏收录

3 篇文章 0 订阅

订阅专栏

Java

3 篇文章 0 订阅

订阅专栏

数据结构

1 篇文章 0 订阅

订阅专栏

字典树的实际应用场景之敏感词过滤（PHP 和 Java）

Trie（字典树，前缀树）的介绍及特性

介绍

Trie又称单词查找树，是一种树形结构，是哈希树的变种。典型应用是用于统计，排序和保存大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。
优点：非常适合操作字符串，利用字符串的公共前缀来减少查询时间，最大限度地减少无谓的字符串比较，查询效率比哈希树高。
缺点：虽然不同单词共享前缀，但其实trie是一个以空间换时间的算法，每个结点只存储一个字符浪费了

Trie树的一些特性：

1.根节点不包含字符，除根节点外每一个节点都只包含一个字符，每个结点拥有26个子结点j
2.从根节点到某一节点，路径上经过的字符连接起来，为该节点对应的字符串
3.每个节点的所有子节点包含的字符都不相同
4.如果字符的种数为n，则每个结点的出度为n，这也是空间换时间的体现，浪费了很多的空间
5.插入查找的复杂度为O(n)，n为字符串长度。

## 敏感词过滤

在很多场景都会用到敏感词过滤，比如在网站提交的内容，游戏中的聊天等等…那么这些敏感词是如何被过滤掉的呢？其实这就是一个字符串的匹配过程，我们很容易就可以想到的就是准备一个敏感词库，然后用每一个敏感词去要过滤的文本中匹配，匹配成功则用 *** 来代替敏感词，不同的匹配算法的时间复杂度是不一样的，比如敏感词的长度为m，待过滤文本的长度为 n，有如下几种匹配算法：

暴力破解法，利用两层循环来实现，时间复杂度为O(m * n);
KMP算法：时间复杂度为 O(n + m)；
字典树：时间复杂度为O(n)；

Trie 实现敏感词的过滤

首先，在进行敏感词过滤之前我们需要有一个敏感词库，用这个敏感词库来建立字典树，比如我们现在有两个敏感词：“de”, “bca”，建立字典树，根节点不存放任何东西，其他每个子节点存放一个字符，深红色代表单词结尾，如下：
在这里插入图片描述

建立完字典树之后，我们就要利用这棵由敏感词组成的字典树匹配字符串了。

现在，我们有一段文本 “abcadef" ，目的是将其中的 “de”, “bca”,过滤掉，具体算法如下：

为了遍历字符串和字典树，我们假设有三个指针，p1,p2,p3，其中p1指向字典树根节点，p2 和 p3 指向字符串的第一个字符，如下：
然后从字符串的 a 开始，检测有没有以 a 作为前缀的敏感词，直接判断 p1 的孩子节点中是否有 a 这个节点就可以了，显然这里没有。接着把指针 p2 和 p3 向右移动一格。
然后从字符串 b 开始查找，看看是否有以 b 作为前缀的字符串，p1 的孩子节点中有 b，这时，我们把 p1 指向节点 b，由于此时 b 不是单词的结尾，所以p3 向右移动一格，不过，p2 不动。
判断 p1 的孩子节点中是否存在 p3 指向的字符 c，显然有。我们把 p1 指向节点 c，p3 向右移动一格，p2 不动。
判断 p1 的孩子节点中是否存在 p3 指向的字符 a，显然有，且 a 是字符串 “bca” 的结尾。这意味着，p2 到 p3之间为敏感词 “bca”，把 p2 和 p3 指向的区间那些字符替换成 *。这时我们把 p2 和 p3 都移向字符 d，p1 还是还原到最开始指向 root。
和前面的步骤一样，判断有没以 d 作为前缀的字符串，显然这里有 “de”，所以把 p2 和 p3 移到字符 f。
因为根节点没有子节点 f，所以匹配结束。

在 Java 中可以利用 HashMap 来存放一层树结点，则每个敏感词的查找时间复杂度是 O (1)，字符串的长度为 n，我们需要遍历 1 遍，所以敏感词查找这个过程的时间复杂度是 O (n * 1)。如果每个敏感词的平均长度为 m，有 t 个敏感词的话，构建 trie 树的时间复杂度是 O (t * m)。

但在实际的应用中，构建 trie 树的时间复杂度可以忽略，因为 trie 树我们可以在一开始就构建了，以后可以无数次重复利用的了。

Java示例

import java.util.HashMap;
import java.util.Map;

import org.apache.commons.lang.CharUtils;

public class MyTrie {
	private class TrieNode {
        /**
         * 是否敏感词的结尾
         */
        private boolean isEnd = false;

        /**
         * 下一层节点
         */
        Map<Character, TrieNode> subNodes = new HashMap<>();

        /**
         * 添加下一个敏感词节点
         *
         * @param c
         * @param node
         */
        public void addSubNode(Character c, TrieNode node) {
            subNodes.put(c, node);
        }

        /**
         * 获取下一个敏感词节点
         *
         * @param c
         * @return
         */
        public TrieNode getSubNode(Character c) {
            return subNodes.get(c);
        }

        public void setEnd(boolean end) {
            this.isEnd = end;
        }

        public boolean getIsEnd() {
            return this.isEnd;
        }


    }

    private static final String DEFAULT_REPLACEMENT = "***";

    /**
     * 根节点
     */
    private TrieNode rootNode;


    public MyTrie() {
        rootNode = new TrieNode();
    }

    /**
     * 识别特殊符号
     */
    private boolean isSymbol(Character c) {
        int ic = (int) c;
        // 0x2E80-0x9FFF 东亚文字范围
        return !CharUtils.isAsciiAlphanumeric(c) && (ic < 0x2E80 || ic > 0x9FFF);
    }


    /**
     * 将敏感词添加到字典树中
     * @param text
     */
    public void addWord(String text) {
        if (text == null || text.trim().equals(""))
            return;

        TrieNode curNode = rootNode;
        int length = text.length();
        //遍历每个字
        for (int index = 0; index < length; ++index) {
            Character c = text.charAt(index);
            /**
             * 过滤特殊字符
             */
            if (isSymbol(c))
                continue;

            TrieNode nextNode = curNode.getSubNode(c);
            // 第一次添加的节点
            if (nextNode == null) {
                nextNode = new TrieNode();
                curNode.addSubNode(c, nextNode);
            }
            curNode = nextNode;

            // 设置敏感词标识
            if (index == length - 1) {
                curNode.setEnd(true);
            }
        }
    }

    /**
     * 过滤敏感词
     * @param text
     * @return
     */
    public String filter(String text) {
        if (text == null || text.trim().equals(""))
            return text;

        String replacement = DEFAULT_REPLACEMENT;
        StringBuilder result = new StringBuilder();

        TrieNode curNode = rootNode;
        int begin = 0; // 回滚位置
        int position = 0;// 当前位置

        while (position < text.length()) {
            Character c = text.charAt(position);

            // 过滤空格等
            if (isSymbol(c)) {

                if (curNode == rootNode) {
                    result.append(c);
                    ++begin;
                }

                ++position;
                continue;
            }

            curNode = curNode.getSubNode(c);

            // 当前位置的匹配结束
            if (curNode == null) {
                // 以begin开始的字符串不存在敏感词
                result.append(text.charAt(begin));
                // 跳到下一个字符开始测试
                position = begin + 1;
                begin = position;
                // 回到树初始节点,重新匹配
                curNode = rootNode;

            } else if (curNode.getIsEnd()) {
                // 发现敏感词，从begin到position的位置用replacement替换掉
                result.append(replacement);
                position = position + 1;
                begin = position;
                // 回到树初始节点,重新匹配
                curNode = rootNode;
            } else {
                ++position;
            }

        }

        result.append(text.substring(begin));

        return result.toString();
    }

    /**
     * 测试
     * @param args
     */
    public static void main(String[] args) {

       MyTrie tree = new MyTrie();
        //添加敏感词
        tree.addWord("de");
        tree.addWord("bca");

        //过滤敏感词
        String res = tree.filter("ab cad ef");
        System.out.println("敏感词：de bca \n输入数据：ab cad ef \n过滤结果："+res);
    }
}

Java示例运行结果
经过测试本示例代码对汉字过滤也同样有效（注：只做了部分测试，请勿直接使用在项目中）
在这里插入图片描述

PHP示例


/**
 * 字典树-敏感词过滤
 */
class TreeMap
{
    public $data;  // 节点字符
    public $children = [];  // 存放子节点引用（因为有任意个子节点，所以靠数组来存储）
    public $isEndingChar = false;  // 是否是字符串结束字符
 
    public function __construct($data)
    {
        $this->data = $data;
    }
}
 
class TrieTree
{
    /**
     * 敏感词数组
     * 
     * @var array
     * @author qpf
     */
    public $trieTreeMap = array();
 
     public function __construct()
    {
        $this->trieTreeMap = new TreeMap('/');
    }
 
    /**
     * 获取敏感词Map
     * 
     * @return array
     * @author qpf
     */
    public function getTreeMap()
    {
        return $this->trieTreeMap;
    }
 
    /**
     * 添加敏感词
     * 
     * @param array $txtWords
     * @author qpf
     */
    public function addWords(array $wordsList)
    {
        foreach ($wordsList as $words) {
            $trieTreeMap = $this->trieTreeMap;
            $len = mb_strlen($words);
            for ($i = 0; $i < $len; $i++) {
                $word = mb_substr($words, $i, 1);
                if(!isset($trieTreeMap->children[$word])){
                    $newNode = new TreeMap($word);
                    $trieTreeMap->children[$word] = $newNode;
                }
                $trieTreeMap = $trieTreeMap->children[$word];
            }
            $trieTreeMap->isEndingChar = true;
        }
    }
 

    /**
     * 查找对应敏感词
     * 
     * @param string $txt
     * @return array
     * @author qpf
     */
    public function search($txt)
    {
    	if(empty($txt)) return [];
    	// $txt = str_replace(' ', '', $txt);
        $wordsList = array();
        $txtLength = mb_strlen($txt);
        for ($i = 0; $i < $txtLength; $i++) {
            $wordLength = $this->checkWord($txt, $i, $txtLength);
            if($wordLength > 0) {
                // echo $wordLength;
                $words = mb_substr($txt, $i, $wordLength);
                $wordsList[] = $words;
                $i += $wordLength - 1;
            }
        }
        return $wordsList;
    }
 
    /**
     * 敏感词检测
     * 
     * @param $txt
     * @param $beginIndex
     * @param $length
     * @return int
     */
    private function checkWord($txt, $beginIndex, $length)
    {
        $flag = false;
        $wordLength = 0;
        $trieTree = $this->trieTreeMap; //获取敏感词树
        for ($i = $beginIndex; $i < $length; $i++) {
            $word = mb_substr($txt, $i, 1); //检验单个字
            if (ctype_space($word)) {//检查是否有空格
            	$wordLength++; 
            	continue;
            }
            if (!isset($trieTree->children[$word])) { //如果树中不存在，结束        
                break;
            }
            //如果存在
            $wordLength++; 
            $trieTree = $trieTree->children[$word];
            if ($trieTree->isEndingChar === true) {  
                $flag = true;
                break;
            }
        }
        if($beginIndex > 0) {
            $flag || $wordLength = 0; //如果$flag == false  赋值$wordLenth为0
        }
        return $wordLength;
    }
    
}
 
$data = ['de', 'bca'];
$wordObj = new TrieTree();
$wordObj->addWords($data);
 
$txt = "ab cad ef";
$words = $wordObj->search($txt);

echo "<pre>-------getTreeMap----------";
print_r($wordObj->getTreeMap());
echo "<br>";

echo "<pre>----------words-----------";
print_r($words);