Unity | 使用DFA算法过滤敏感词

小小小小羽丶

已于 2022-02-09 15:01:25 修改

阅读量3.4k

点赞数 3

分类专栏：算法

于 2021-11-12 17:51:21 首次发布

本文链接：https://blog.csdn.net/qq_36374904/article/details/121292995

版权

unity 算法

算法专栏收录该内容

3 篇文章

订阅专栏

前言

读了大佬的一篇关于使用DFA算法过滤敏感词的文章，在这里写一下自己的理解。根据理解的思路，重新码了一遍，代码可能与大佬的有些许出入，但达到的效果是一样的。并且考虑了敏感词中间填充无意义字符问题，例如：恶&%心。建议先读一遍大佬的文章，会更容易理解。

大佬的文章：游戏中敏感词的过滤之DFA算法_白菊花瓣的博客-CSDN博客_dfa算法游戏中敏感词的过滤之DFA算法对于一个游戏，如果有聊天功能，那么我们就会希望我们的聊天系统能够对玩家的输入进行判断，如果玩家的输入中含有一些敏感词汇，那么我们就禁止玩家发送聊天，或者把敏感词转换为 * 来替换。为什么要使用 DFA 算法设我们已经有了一个敏感词词库(从相关部门获取到的，或者网上找来的)，那么我们最容易想到的过滤敏感词的方法就是：遍历整个敏感词库，拿到敏感词，再判断玩家输入的字符串中是否有该敏感词，如果有就把敏感词字符替换为 *但这样的方法，我们需要遍历整个敏感词库，并且对玩家输入的https://blog.csdn.net/qq_44857648/article/details/108394857

正题

DFA算法

DFA，全称 Deterministic Finite Automaton 即确定有穷自动机：从一个状态通过一系列的事件转换到另一个状态，即 state -> event -> state，也可理解为 state+event = nextState。

确定：状态以及引起状态转换的事件都是可确定的，不存在“意外”。
有穷：状态以及事件的数量都是可穷举的。

算法的关键

这个算法的关键在于树的构建，或者说是森林的构建，然后根据用户的输入，再从树中查找到对应的敏感词。

为什么要使用？

事实上，每个字符每次从敏感词库中遍历，是非常费时的，尤其是上万词汇库中。而这种树形结构，查找是非常迅速的。

思路

开始先将过滤词构建成森林

遍历的话，从“你” 开始，再由IsEnd判断是否是尾结点，如果否，继续往下遍历，直到尾结点IsEnd为true。我这里使用0代表false，1代表true。

构建森林的具体实现（建议结合代码或打断点分析），是用哈希表，当然也可以使用字典，但不如哈希表方便。

代码如下（未限制中间填充无意义字符）

using System.Collections;
using System.Collections.Generic;
using System.Text;
using UnityEngine;

/// <summary>
/// 
/// * Writer：June
/// 
/// * Data：2021.11.10
/// 
/// * Function：DFA算法
/// 
/// * Remarks：用于过滤敏感词
/// 
/// </summary>


public class DFAAlgorithm : MonoBehaviour
{
    private Hashtable hashtable;
    public List<string> filterList = new List<string>();
    [TextArea(1, 3)] public string speakStr;

    private void Start()
    {
        InitFilter(filterList);
    }


    private void Update()
    {
        if (Input.GetKeyDown(KeyCode.S))
        {
            string resStr = StringCheckAndReplace(speakStr); 
            Debug.Log($"输出的结果：{resStr}");
        }
    }
    

    /// <summary>
    /// 初始化筛选器
    /// </summary>
    /// <param name="wordList">目标过滤词容器（链表）</param>
    private void InitFilter(List<string> wordList)
    {
        //初始化哈希表
        hashtable = new Hashtable(wordList.Count);
        //根据过滤词容器，确定外循环次数
        for (int i = 0; i < wordList.Count; i++)
        {
            //局部临时哈希表
            Hashtable tmpHs = hashtable;
            for (int j = 0; j < wordList[i].Length; j++)
            {
                //将字符串拆分成单个字符
                char ch = wordList[i][j];
                //判断哈希表中，是否已经包含有当前字符作为的键值
                if (tmpHs.ContainsKey(ch))
                {
                    tmpHs = (Hashtable)tmpHs[ch];
                }
                else
                {
                    Hashtable newHs = new Hashtable();
                    newHs.Add("IsEnd", 0);  //默认添加0，表示当前字符不是最后一个字符
                    tmpHs.Add(ch, newHs);   //将新的哈希表作为值，存在当前字符作为键中的哈希表（即哈希表中嵌套哈希表）
                    //取带有IsEnd哈希表，根据当前是否是最后一个字符，重新修改值
                    tmpHs = newHs;
                }
                if (j == (wordList[i].Length - 1))
                {
                    if (tmpHs.ContainsKey("IsEnd")) tmpHs["IsEnd"] = 1;
                    else tmpHs.Add("IsEnd", 1);
                }
            }
        }
    }


    /// <summary>
    /// 字符串检测并替换
    /// </summary>
    /// <param name="targetStr">目标字符串</param>
    /// <returns></returns>
    private string StringCheckAndReplace(string targetStr)
    {
        StringBuilder stringBuilder = new StringBuilder(targetStr);
        int len = 0;
        for (int i = 0; i < targetStr.Length; )
        {
            len = SensitiveWordsLength(targetStr, i);
            //判定如果没有过滤词则不做处理
            if (len == 0)
            {
                i++;
                continue;
            }
            for (int j = 0; j < len; j++)
            {
                stringBuilder[i + j] = '*';
            }
            i += len;
        }
        return stringBuilder.ToString();
    }


    /// <summary>
    /// 敏感词长度
    /// </summary>
    /// <param name="targetStr">目标字符串</param>
    /// <param name="beginIndex">开始遍历的索引</param>
    /// <returns></returns>
    private int SensitiveWordsLength(string targetStr, int beginIndex)
    {
        //当前所在的哈希表（节点）
        Hashtable curHs = hashtable;
        //记录长度
        int len = 0;
        //索引从给定的开始
        for (int i = beginIndex; i < targetStr.Length; i++)
        {
            char ch = targetStr[i];
            //新建一个临时哈希表,指向子哈希表(子节点)
            Hashtable newtmpHs = (Hashtable)curHs[ch];
            if (newtmpHs != null)
            {
                //判定是否是末节点
                if ((int)newtmpHs["IsEnd"] == 1) len = i + 1 - beginIndex;
                else curHs = newtmpHs;  //指向子节点(子哈希表)
            }
            else break;
        }
        return len;
    }
}

要做中间无意义字符的判定，首先想到的是，直接把用户输入的字符串，去除所有特殊字符。当然如果只是普通对话，字数极少的情况下，也是可以的。要是考虑到字数较多的情况，那就得修改字符检查的方式，多一重判定，当前字符是否有意义。

完整代码（限制中间插入无意义字符）

是根据ASCII码进行的限制，限制无意义的数字、字母、符号等。

using System.Collections;
using System.Collections.Generic;
using System.Text;
using UnityEngine;

/// <summary>
/// 
/// * Writer：June
/// 
/// * Data：2021.11.10
/// 
/// * Function：DFA算法
/// 
/// * Remarks：用于过滤敏感词
/// 
/// </summary>


public class DFAAlgorithm : MonoBehaviour
{
    private Hashtable hashtable;
    public List<string> filterList = new List<string>();
    [TextArea(1, 3)] public string speakStr;

    private void Start() => InitFilter(filterList);


    private void Update()
    {
        if (Input.GetKeyDown(KeyCode.S))
        {
            string resStr = StringCheckAndReplace(speakStr); 
            Debug.Log($"输出的结果：{resStr}");
        }
    }
    

    /// <summary>
    /// 初始化筛选器
    /// </summary>
    /// <param name="wordList">目标过滤词容器（链表）</param>
    private void InitFilter(List<string> wordList)
    {
        //初始化哈希表
        hashtable = new Hashtable(wordList.Count);
        //根据过滤词容器，确定外循环次数
        for (int i = 0; i < wordList.Count; i++)
        {
            //局部临时哈希表
            Hashtable tmpHs = hashtable;
            for (int j = 0; j < wordList[i].Length; j++)
            {
                //将字符串拆分成单个字符
                char ch = wordList[i][j];
                //判断哈希表中，是否已经包含有当前字符作为的键值
                if (tmpHs.ContainsKey(ch))
                {
                    tmpHs = (Hashtable)tmpHs[ch];
                }
                else
                {
                    Hashtable newHs = new Hashtable();
                    newHs.Add("IsEnd", 0);  //默认添加0，表示当前字符不是最后一个字符
                    tmpHs.Add(ch, newHs);   //将新的哈希表作为值，存在当前字符作为键中的哈希表（即哈希表中嵌套哈希表）
                    //取带有IsEnd哈希表，根据当前是否是最后一个字符，重新修改值
                    tmpHs = newHs;
                }
                if (j == (wordList[i].Length - 1))
                {
                    if (tmpHs.ContainsKey("IsEnd")) tmpHs["IsEnd"] = 1;
                    else tmpHs.Add("IsEnd", 1);
                }
            }
        }
    }


    /// <summary>
    /// 字符串检测并替换
    /// </summary>
    /// <param name="targetStr">目标字符串</param>
    /// <returns></returns>
    private string StringCheckAndReplace(string targetStr)
    {
        StringBuilder stringBuilder = new StringBuilder(targetStr);
        int len = 0;
        for (int i = 0; i < targetStr.Length; )
        {
            len = SensitiveWordsLength(targetStr, i);
            //判定如果没有过滤词则不做处理
            if (len == 0)
            {
                i++;
                continue;
            }
            for (int j = 0; j < len; j++)
            {
                stringBuilder[i + j] = '*';
            }
            i += len;
        }
        return stringBuilder.ToString();
    }


    /// <summary>
    /// 敏感词长度
    /// </summary>
    /// <param name="targetStr">目标字符串</param>
    /// <param name="beginIndex">开始遍历的索引</param>
    /// <returns></returns>
    private int SensitiveWordsLength(string targetStr, int beginIndex)
    {
        //当前所在的哈希表（节点）
        Hashtable curHs = hashtable;
        //记录长度
        int len = 0;
        //索引从给定的开始
        for (int i = beginIndex; i < targetStr.Length; i++)
        {
            char ch = targetStr[i];
            //判断当前字符是否有效   使用ASCII判断
            if (ch > 32 && ch < 126) continue;
            //新建一个临时哈希表,指向子哈希表(子节点)
            Hashtable newtmpHs = (Hashtable)curHs[ch];
            if (newtmpHs != null)
            {
                //判定是否是末节点
                if ((int)newtmpHs["IsEnd"] == 1) len = i + 1 - beginIndex;
                else curHs = newtmpHs;  //指向子节点(子哈希表)
            }
            else break;
        }
        return len;
    }
}