如同每一种程序语言的入门都要设计一个输出"Hello,World!"的小例子一样,几乎每一种流式数据处理框架都有一个WordCount的入门例程。MapReduce是一个标准,包括Spark和Flink都提供map和reduce算子,可以很方便地实现单词统计。Storm好像没有发现这个,不过实现起来却也很容易。
基于上次修改的程序https://blog.csdn.net/xxkalychen/article/details/117058030?spm=1001.2014.3001.5501,我们把从Kafka获得的数据经过清洗写入ElasticSearch。现在我们在此基础之上做一点修改,来实现WordCount。
一、创建一个WordCount数据包装类。用这个类把单词和统计总数写入ElasticSearch数据库。
package com.chris.storm.model;
import java.io.Serializable;
/**
* @author Chris Chan
* Create on 2021/5/21 12:56
* Use for:
* Explain:
*/
public class WordCount implements Serializable {
private String word;
private long count;
public WordCount() {
}
public WordCount(String word, long count) {
this.word = word;
this.count = count;
}
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
public long getCount() {
return count;
}
public void setCount(long count) {
this.count = count;
}
}
二、修改ElasticSearchUtil工具类。需要添加写入WordCount和查询旧数据的方法。我们要统计单词的出现频率,每个单词在数据库中要保持唯一性,所以要特殊处理,给每一条数据设定一个可识别的唯一ID,为使用方便,我们就用单词本身作为ID,当然也可以使用Base64或者其他散列算法编码之后作为ID。
package com.chris.storm.utils;
import com.chris.storm.model.WordCount;
import com.google.gson.Gson;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import java.io.IOException;
import java.util.UUID;
import java.util.concurrent.ConcurrentHashMap;
/**
* @author Chris Chan
* Create on 2021/5/19 7:37
* Use for:
* Explain:
*/
public class ElasticSearchUtil {
private static RestHighLevelClient client = null;
private static Gson gson = new Gson().newBuilder().create();
static {
ElasticSearchUtil.client = new RestHighLevelClient(RestClient.builder(new HttpHost("192.168.0.52", 9200, "http")));
}
public static RestHighLevelClient getClient() {
return client;
}
public static void close() {
try {
ElasticSearchUtil.client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public static boolean isIndexExists(String indexName) {
try {
return ElasticSearchUtil.client.indices().exists(new GetIndexRequest(indexName), RequestOptions.DEFAULT);
} catch (IOException e) {
e.printStackTrace();
}
return false;
}
public static void createIndex(String indexName) {
if (isIndexExists(indexName)) {
return;
}
try {
ElasticSearchUtil.client.indices().create(new CreateIndexRequest(indexName), RequestOptions.DEFAULT);
} catch (IOException e) {
e.printStackTrace();
}
}
public static <T> IndexResponse add(T obj, String indexName) {
IndexRequest indexRequest = new IndexRequest(indexName).id(UUID.randomUUID().toString());
indexRequest.source(new Gson().toJson(obj), XContentType.JSON);
try {
return ElasticSearchUtil.client.index(indexRequest, RequestOptions.DEFAULT);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
/**
* 初始化本地单词计数收集器
* 程序启动时,需要到ES去读取旧的数据,以便累计单词总数
*
* @param indexName
* @param wordCountMap
*/
public static void initWordCountMap(String indexName, ConcurrentHashMap<String, Long> wordCountMap) {
SearchRequest request = new SearchRequest(indexName);
SearchSourceBuilder builder = new SearchSourceBuilder();
builder.query(QueryBuilders.matchAllQuery());
request.source(builder);
try {
SearchResponse response = ElasticSearchUtil.client.search(request, RequestOptions.DEFAULT);
SearchHit[] hits = response.getHits().getHits();
for (SearchHit hit : hits) {
WordCount wordCount = gson.fromJson(hit.getSourceAsString(), WordCount.class);
wordCountMap.put(wordCount.getWord(), wordCount.getCount());
}
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* 向ES写入单词计数结果
*
* @param wordCount
* @param indexName
*/
public static IndexResponse addWordCount(WordCount wordCount, String indexName) {
IndexRequest indexRequest = new IndexRequest(indexName).id(wordCount.getWord());
indexRequest.source(new Gson().toJson(wordCount), XContentType.JSON);
try {
return ElasticSearchUtil.client.index(indexRequest, RequestOptions.DEFAULT);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
三、修改CountBolt。这个类当初我命名的时候就是打算写WordCount的。
package com.chris.storm.bolt;
import com.chris.storm.model.WordCount;
import com.chris.storm.utils.ElasticSearchUtil;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Tuple;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
/**
* @author Chris Chan
* Create on 2021/5/19 9:44
* Use for:
* Explain:
*/
public class CountBolt extends BaseRichBolt {
//在ElasticSearch中创建的索引名称
public static final String INDEX_NAME = "storm_word_count";
//单词计数器
private static ConcurrentHashMap<String, Long> wordCountMap;
static {
ElasticSearchUtil.createIndex(INDEX_NAME);
}
@Override
public void prepare(Map<String, Object> map, TopologyContext topologyContext, OutputCollector outputCollector) {
wordCountMap = new ConcurrentHashMap<>(16);
ElasticSearchUtil.createIndex(INDEX_NAME);
ElasticSearchUtil.initWordCountMap(INDEX_NAME, wordCountMap);
}
@Override
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = wordCountMap.get(word);
if (null == count) {
count = 1L;
} else {
count++;
}
wordCountMap.put(word, count);
System.out.printf("%s: %d\n", word, count);
//输出到ElasticSearch
ElasticSearchUtil.addWordCount(new WordCount(word, count), INDEX_NAME);
}
/**
* 这是流水线上的终点,不需要在发给下一环,所以无须再定义元组字段
*
* @param outputFieldsDeclarer
*/
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
}
}
这个例子设计本地计数器,在初始化的时候会去ES读取旧数据(是全量读取哦),然后流式累计,这样保证了Strom任务挂掉之后统计数据不会被覆盖。不过这样的问题就是内存开销会很大。但是不这样做就得每次获得一个单词都要去ES数据库IO,这样也会影响性能。当然我们也可以把ES数据读取到Redis来缓存,跟Redis进行IO总比跟ES来IO要快。
还有一个问题,每进来一个单词,我们除了本地要更新缓存之外,还要写入ES,这个频率是不是也太高了。其实我们也可以在初始化的时候不用读取ES,而是直接在Redis缓存新数据,然后异步将Redis的数据刷入ES叠加,初始化Redis。只要时机拿捏得合适,Redis集群可用性好就行。
这些针对细节的逻辑我这里就不用实现了,我们只是体现一下思路就可以了。
其他的都不用修改,直接测试。记得打开各种服务器,并且Kafka等待输入。
在Kafka输入几行句子。
查看ElasticSearch数据。
出现了两个hello?原来大小写不一致。哈哈。这就是我们想要的效果了。