利用ElasticSearch的分词和聚合功能来对文本中的关键词进行词云统计
本文主要针对微博上的新闻来进行分词和词频统计,最后生成词云。具体代码如下:
public List wordCloudCount(Class clazz,String keywords){
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
boolQuery.must(QueryBuilders.queryStringQuery(keywords));
TermsAggregationBuilder builder = AggregationBuilders.terms("word_count").field("content").size(30);
Document document = (Document) clazz.getAnnotation(Document.class);
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withIndices(document.indexName())
.withTypes(document.type())
.withQuery(boolQuery)
.addAggregation(builder)
.build();
Aggregations aggregation = elasticsearchTemplate.query(searchQuery, new ResultsExtractor<Aggregations>() {
@Override
public Aggregations extract(SearchResponse searchResponse) {
return searchResponse.getAggregations();
}
});
StringTerms typeTerm = (StringTerms) aggregation.asMap().get("word_count");
List<StringTerms.Bucket> bucketList = typeTerm.getBuckets();
LinkedList<String> wordList = new LinkedList<>();
for (StringTerms.Bucket bucket1 : bucketList) {
String type_name = bucket1.getKeyAsString();
wordList.add(type_name);
}
try {
FileReader fReader = new FileReader("stopwords.txt");
BufferedReader bufferedReader = new BufferedReader(fReader);
List<String> list = new ArrayList<String>();
String readline = "";
while ((readline=bufferedReader.readLine())!=null){
list.add(readline);
}
wordList.removeIf(list::contains);
} catch (IOException e) {
log.info("读取停用词失败");
}
return wordList;
}
采用ES 的tempalte引擎来进行分词和聚合,需要强调的是,分完词之后要对停用词进行过滤,即stopwords.txt中的停用词,最后返回关键词的频率。