主要用到的是java的
Matcher m = pattern.matcher(str);
str = m.replaceAll("");
根据对读入敏感词列表文件的每一行做匹配,来过滤敏感词
下面是具体代码:
import java.io.IOException; import java.io.InputStream; import java.util.Enumeration; import java.util.Properties; import java.util.regex.Pattern; import java.util.regex.Matcher; public class KeywordFilter { private static Pattern pattern = null; public static void initPattern() { StringBuffer patternBuf = new StringBuffer(""); try { InputStream in = KeywordFilter.class.getClassLoader().getResourceAsStream("words.properties"); Properties pro = new Properties(); pro.load(in); Enumeration enu = pro.propertyNames(); while(enu.hasMoreElements()) { patternBuf.append((String)enu.nextElement()+"|"); } patternBuf.deleteCharAt(patternBuf.length()-1); pattern = Pattern.compile(new String(patternBuf.toString().getBytes("ISO-8859-1"), "UTF-8")); //System.out.println(new String(patternBuf.toString().getBytes("ISO-8859-1"), "gb2312")); //pattern = Pattern.compile(new String(patternBuf.toString().getBytes("ISO-8859-1"), "gb2312")); } catch(IOException ioEx) { ioEx.printStackTrace(); } } public static String doFilter(String str) { System.out.println("str:"+ str); try { Matcher m = pattern.matcher(str); str = m.replaceAll(""); } catch (Exception e) { e.printStackTrace(); } return str; } public static void main(String[] args) { String str = "心在跳情在烧共产党"; //String str = "�����д�һԺѧλ������д���ı�����ʾ���дʶ�,�й������,ë������"; System.out.println("str:"+str); initPattern(); //Date d1 = new Date(); //SimpleDateFormat formatter = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss:SSS Z"); //System.out.println("start:"+formatter.format(d1)); try { System.out.println( KeywordFilter.doFilter( str )); } catch (Exception e) { e.printStackTrace(); } //Date d2 = new Date(); //System.out.println("end:"+formatter.format(d2)); } }
words.properties 为敏感词文件.
对中文和UNICODE的UTF-8都做过测试,都可以完成过滤.
发表于 @ 2009年04月14日 15:52:00 | 评论( loading... ) | 编辑| 举报| 收藏