使用DFA算法模型构建敏感词过滤器

最新推荐文章于 2024-05-24 19:05:33 发布

智_永无止境

最新推荐文章于 2024-05-24 19:05:33 发布

阅读量1.5k

点赞数

分类专栏：算法文章标签： DFA 敏感词过滤算法

本文链接：https://blog.csdn.net/static_coder/article/details/103054633

版权

算法专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、前言

开发过程使用文本编辑器上传文本时，有时候我们需求检测敏感字。对于少量的文字，我直接使用indexOf或者正则去判断敏感词是否存在。但是当文字量较大时，性能就会变得很差。这里简单学习了DFA算法模型来创建索引，将敏感词变成树形结构，方便查找，提高查询效率。

2、DFA算法的数据结构

DFA算法的目的就是讲词语分词整个，形成一个树形结构。每一个关键词一定是的从根节点到尾结点的，有且只有一个结束标志。具体可以移步【传送门】

数据结构如图：

     *   中 = {
     *      isEnd = 0
     *      国 = {
     *           isEnd = 1
     *           人 = {isEnd = 0
     *                民 = {isEnd = 1}
     *                }
     *           男  = {
     *                  isEnd = 0
     *                   人 = {
     *                        isEnd = 1
     *                       }
     *               }
     *           }
     *      }
     *  五 = {
     *      isEnd = 0
     *      星 = {
     *          isEnd = 0
     *          红 = {
     *              isEnd = 0
     *              旗 = {
     *                   isEnd = 1
     *                  }
     *              }
     *          }
     *      }

3、简单实现敏感字的分词以及过滤方法

/**
 * 敏感词过滤工具类
 * 
 * @author shuai.wang
 */
public class SensitiveWordTool {
	// 结束标识
	private static final String END_FLAG = "isEnd";
	// 结束标志-value
	private static final Integer END_VAL = 1;
	// 未结束标志-value
	private static final Integer NOT_END_VAL = 0;
	// 敏感词库
	private Map<String, Object> sensitiveWordResult;
	// 匹配到的敏感词[不去重]
	private List<String> sensitiveWordList;
	
	public SensitiveWordTool() {}
	
	/**
	 * 给敏感词创建索引
	 * 
	 * @param elements
	 */
	public SensitiveWordTool(String... elements) {
		initWordsIndex(Sets.newHashSet(elements));
	}

	/**
	 * 为敏感字添加索引 ：读取敏感词库，将敏感词放入HashSet中，构建一个DFA算法模型：<br>
     * <pre>
     * 中 = {
     *      isEnd = 0
     *      国 = {
     *           isEnd = 1
     *           人 = {isEnd = 0
     *                民 = {isEnd = 1}
     *                }
     *           男  = {
     *                  isEnd = 0
     *                   人 = {
     *                        isEnd = 1
     *                       }
     *               }
     *           }
     *      }
     *  五 = {
     *      isEnd = 0
     *      星 = {
     *          isEnd = 0
     *          红 = {
     *              isEnd = 0
     *              旗 = {
     *                   isEnd = 1
     *                  }
     *              }
     *          }
     *      }
	 * 
	 * @param sensitiveWords
	 */
	@SuppressWarnings("unchecked")
	public void initWordsIndex(Set<String> sensitiveWords) {
		// 建立敏感词库
		sensitiveWordResult = Maps.newHashMapWithExpectedSize(sensitiveWords.size());
		// 当前的操作的Map
		Map<String, Object> currentMap = null;
		// 创建新结果集
		Map<String, Object> newMap = null;
		// 建立索引
		if (CollectionUtils.isNotEmpty(sensitiveWords)) {
			for (String sensitiveWord : sensitiveWords) {
				currentMap = sensitiveWordResult;
				int len = sensitiveWord.length();
				for (int i = 0; i < len; i++) {
					String key = String.valueOf(sensitiveWord.charAt(i));
					Object result = currentMap.get(key);
					if (result != null) {
						// 当期字符已经建立索引，获取当前字符的后续索引链
						currentMap = (Map<String, Object>)result;
					}else {
						// 建立索引
						newMap = Maps.newHashMapWithExpectedSize(2);
						newMap.put(END_FLAG, NOT_END_VAL);
						currentMap.put(key, newMap);
						// 当前操作Map修改成新建的索引链
						currentMap = newMap;
					}
					//设置敏感字的结束标识
					if (i == len-1) {
						currentMap.put(END_FLAG, END_VAL);
					}
				}
			}
		}
	}
	
	/**
	 * 检索敏感词
	 * 
	 * @param text
	 * @return
	 */
	@SuppressWarnings("unchecked")
	public void filterSensitiveWordList(String text){
		List<String> appearList = Lists.newArrayList(); 
		Map<String, Object> currentResult = sensitiveWordResult;
		int matchIndex = 0;
		if (currentResult != null && !currentResult.isEmpty()) {
			String key = null;
			if (StringUtils.isNotBlank(text)) {
				int len = text.length();
				StringBuilder sb = new StringBuilder();
				for (int i = 0; i < len; i++) {
					key = String.valueOf(text.charAt(i));
					Object object = currentResult.get(key);
					if (object != null) {
						// 匹配到第一个关键字，需要记录检索的位置并标记
						matchIndex++;
						sb.append(key);
						currentResult =(Map<String, Object>)object;
						if (Objects.equal(currentResult.get(END_FLAG), END_VAL)) {
							appearList.add(sb.toString());
							
                                                        /*
                                                        // 匹配不到数据时才退出重新匹配
                                                        sb = new StringBuilder();
							currentResult = sensitiveWordResult;
							// 修改匹配到第一个关键字的索引位置
							i = i - matchIndex + 1;
							matchIndex = 0;
                                                        */
						}
					}else {
						if (StringUtils.isNotBlank(sb.toString())) {
							sb = new StringBuilder();
							currentResult = sensitiveWordResult;
							// 修改匹配到第一个关键字的索引位置
							i = i - matchIndex;
							matchIndex = 0;
						}
					}
				}
			}
		}
		
		sensitiveWordList = appearList;
	}
	
	/**
	 * 敏感词去重
	 * 
	 * @return
	 */
	public Set<String> getSensitiveWordSet(){
		return Sets.newHashSet(sensitiveWordList);
	}
	
	/**
	 * 统计关键词的个数
	 * 
	 * @param key
	 * @return
	 */
	public int getCount(String key){
		Multiset<String> countGroup = HashMultiset.create();
		countGroup.addAll(sensitiveWordList);
		return countGroup.count(key);
	}
}

代码中引入了guava工具类，可直接使用。

4、测试类

@Test
public void test04() {
    String[] keywords = {"那些", "大漠", "无耻"};
    SensitiveWordTool tool = new SensitiveWordTool(keywords);
    String text = "大漠漫漫长河，没有江南风荷的淡然；只希望遇一人牧马塞外，看大雪纷飞后红梅缀雪的静染！"
		+"那时，烧一只陶埙，与那大漠、胡杨共同吹起风沙雪落就已经是最安然的生活……有事感叹：人生像秋风扫落叶般，有时是那么的无情，有时反而给人一种缠绵的感觉！ "
	    +"看到秋菊凌寒而来，夕阳吐霞而归，那些，那些所谓的伤悲是不是可以放下？塞外牧马，食毡饮雪，虽然所有的一切看似不堪，可对于我来说，或许，那就是最好的去处！"
		+"不入红尘土，何染尘世泥？大漠孤烟直，长河落日圆，这又是一种怎样的心情呢？";
    tool.filterSensitiveWordList(text);
    System.out.println("关键字为："+ tool.getSensitiveWordSet());
    System.out.println("关键字【大漠】次数为："+ tool.getCount("大漠"));
    System.out.println("关键字【那些】次数为："+ tool.getCount("那些"));
}

5、结果

6、参考文档：

技术博客： https://www.cnblogs.com/chenssy/p/3751221.html

智_永无止境

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用DFA算法模型构建敏感词过滤器

1、前言开发过程使用文本编辑器上传文本时，有时候我们需求检测敏感字。对于少量的文字，我直接使用indexOf或者正则去判断敏感词是否存在。但是当文字量较大时，性能就会变得很差。这里简单学习了DFA算法模型来创建索引，将敏感词变成树形结构，方便查找，提高查询效率。2、DFA算法的数据结构 DFA算法的目的就是讲词语分词整个，形成一个树形结构。每一个关键词一定是的从根节点到尾结...
复制链接

扫一扫

专栏目录