使用IkAnalyzer分词器统计中文高频词汇
一.引入jar包:
<dependency> <groupId>com.janeluo</groupId> <artifactId>ikanalyzer</artifactId> <version>2012_u6</version> </dependency>
二.在src目录下配置IKAnalyzer.cfg.xml ( 还可添加自己的扩展词典/停用词典,如下图 )
内容如下:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">ext.dic;</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">stopword.dic;</entry> </properties>
三.根据分词统计词频:
1.传入文本进行分词,并统计每个词的频率。代码如下:
/** * @Description 获得高频词汇集合 * @Date 2018/3/9 11:28 * @Param [text 要分词的文本] * @Return java.util.Map<java.lang.String, java.lang.Integer> * @Throws IOException */ public static Map<String, Integer> getHighFrequencyVocabularyMap(String text) throws IOException { Analyzer anal = new IKAnalyzer(true); StringReader reader = new StringReader(text); TokenStream ts = anal.tokenStream("", reader); ts.reset(); CharTermAttribute term = ts.getAttribute(CharTermAttribute.class); Map<String, Integer> map = new HashMap<>(); while (ts.incrementToken()) { map.put(term.toString(), map.containsKey(term.toString()) ? map.get(term.toString()) + 1 : 1); } reader.close(); return map; }
2.传入文件进行分词,并统计每个词的频率。代码如下:
/** * @Description 获得高频词汇 * @Date 2018/3/9 11:30 * @Param [fileName 要分词的文件] * @Return java.util.Map<java.lang.String, java.lang.Integer> * @Throws Exception */ public static Map<String, Integer> getHighFrequencyVocabulary(String fileName) throws Exception { URL url = Objects.requireNonNull(IkUtils.class.getClassLoader().getResource(fileName)); StringBuilder sb = new StringBuilder(); BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(new File(url.getPath())), "UTF8")); String str; while ((str = in.readLine()) != null) { sb.append(str); } in.close(); Map<String, Integer> map = new HashMap<>(); StringReader re = new StringReader(sb.toString()); //其中IKSegmenter是分词的主要类,构造函数ture代表只能分词,改成false则为最细粒度分词 IKSegmenter ik = new IKSegmenter(re, true); Lexeme lex; while ((lex = ik.next()) != null) { map.put(lex.getLexemeText(), map.containsKey(lex.getLexemeText()) ? map.get(lex.getLexemeText()) + 1 : 1); } return map; }
3.按词频高低排序,代码如下:
public static List<Map.Entry<String, Integer>> mapSort(Map<String, Integer> map) { List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet()); //然后通过比较器来实现排序 //升序排序 使用IKAnalyzer中文分词器进行分词统计词频 //list.sort(Comparator.comparing(Map.Entry::getValue)); //list.sort(Comparator.comparingInt(Map.Entry::getValue)); //降序排序 list.sort((e1, e2) -> e2.getValue() - e1.getValue()); return list; }
最后统计的分词展示效果: