File f=new File(path);
Mapmap=new HashMap<>();
Version matchVersion = Version.LUCENE_31;
Analyzer analyzer = new StopAnalyzer(matchVersion);
BufferedReader br = new BufferedReader(new FileReader(f));//读取文件
TokenStream ts = analyzer.tokenStream(null, br);//用analyzer分词,得到token流
ts = new PorterStemFilter(ts);//过滤器提取词干
CharTermAttribute ca = ts.addAttribute(CharTermAttribute.class);//ca存储了ts的文本信息
ts.reset();//必须的
while(ts.incrementToken()){
String term = ca.toString();
if(!map.keySet().contains(term)){
map.put(term, 1);
}else
{
map.put(term, map.get(term)+1);
}
}
ts.end();
ts.close();
analyzer.close();
br.close();
StringBuilder sb=new StringBuilder();
File gh=new File(path);
for(String key:map.keySet()){
sb.append(key+" "+map.get(key)+"\r\n");
}
BufferedWriter bw=new BufferedWriter(new FileWriter(gh));
bw.write(sb.toString());
bw.flush();
bw.close();
该博客展示了使用Java进行文本词频统计的代码。通过读取文件,利用分词器分词、过滤器提取词干,将每个词及其出现次数存储在Map中,最后将结果写入文件。
4963

被折叠的 条评论
为什么被折叠?



