- 本文章摘编、转载需要注明来源 http://blog.csdn.net/shadowsick/article/details/8891939
在应用项目中很多时候都需要用到过滤敏感词的功能,自己写个遍历明显在小数据的时候还能凑合下,但是对于大数据的时候就有点力不从心了,这里推荐使用ac多模匹配算法
先来写个应用类
- /**
- * AC多模匹配敏感字符工具类实现类
- *
- * @author shadow
- * @email 124010356@qq.com
- * @create 2012.04.28
- */
- public class AcUtilImpl implements AcUtil {
- public String contrast(String filters, String word, String regex) {
- if (null == filters || "".equals(filters) || null == word
- || "".equals(word))
- return "";
- AhoCorasick ac = new AhoCorasick();
- String[] strings = StringUtils.split(filters, regex);
- for (String string : strings)
- ac.add(string.getBytes(), string);
- ac.prepare();
- return matching(ac, word);
- }
- public String contrast(String[] filters, String word) {
- if (null == filters || filters.length <= 0 || null == word
- || "".equals(word))
- return "";
- AhoCorasick ac = new AhoCorasick();
- for (int i = 0, len = filters.length; i < len; i++) {
- ac.add(filters[i].getBytes(), filters[i]);
- }
- ac.prepare();
- return matching(ac, word);
- }
- public String contrast(List<String> filters, String word) {
- if (null == filters || filters.size() <= 0 || null == word
- || "".equals(word))
- return "";
- AhoCorasick ac = new AhoCorasick();
- for (int i = 0, len = filters.size(); i < len; i++) {
- ac.add(filters.get(i).getBytes(), filters.get(i));
- }
- ac.prepare();
- return matching(ac, word);
- }
- private String matching(AhoCorasick ac, String word) {
- StringBuffer buffer = new StringBuffer();
- Iterator<?> iterator = ac.search(word.getBytes());
- while (iterator.hasNext()) {
- SearchResult result = (SearchResult) iterator.next();
- buffer.append(result.getOutputs()).append(",");
- }
- return buffer.length() > 0 ? buffer.substring(0, buffer.length() - 1)
- : "";
- }
- public static void main(String[] args) {
- String filters = "or,world,33,dd,test";
- String word = "hello world, how are you!";
- String regex = ",";
- String result = new AcUtilImpl().contrast(filters, word, regex);
- System.out.println(result);
- }
- }
然后运行main函数测试下,获得的结果是
[or],[world]
这个插件的性能,匹配度也灰常不错,AhoCorasick这个类自己下载放到项目里就可以了