Java整合Storm实现WordCount单词统计

最新推荐文章于 2022-05-27 22:14:53 发布

宝华的小岛

最新推荐文章于 2022-05-27 22:14:53 发布

阅读量383

点赞数

分类专栏：大数据 Java 技术文章标签： storm java es 大数据 WordCount

本文链接：https://blog.csdn.net/xxkalychen/article/details/117136261

版权

Java 同时被 3 个专栏收录

74 篇文章 1 订阅

订阅专栏

技术

29 篇文章 1 订阅

订阅专栏

大数据

26 篇文章 2 订阅

订阅专栏

如同每一种程序语言的入门都要设计一个输出"Hello,World!"的小例子一样，几乎每一种流式数据处理框架都有一个WordCount的入门例程。MapReduce是一个标准，包括Spark和Flink都提供map和reduce算子，可以很方便地实现单词统计。Storm好像没有发现这个，不过实现起来却也很容易。

基于上次修改的程序https://blog.csdn.net/xxkalychen/article/details/117058030?spm=1001.2014.3001.5501，我们把从Kafka获得的数据经过清洗写入ElasticSearch。现在我们在此基础之上做一点修改，来实现WordCount。

一、创建一个WordCount数据包装类。用这个类把单词和统计总数写入ElasticSearch数据库。

package com.chris.storm.model;

import java.io.Serializable;

/**
 * @author Chris Chan
 * Create on 2021/5/21 12:56
 * Use for:
 * Explain:
 */
public class WordCount implements Serializable {
    private String word;
    private long count;

    public WordCount() {
    }

    public WordCount(String word, long count) {
        this.word = word;
        this.count = count;
    }

    public String getWord() {
        return word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public long getCount() {
        return count;
    }

    public void setCount(long count) {
        this.count = count;
    }
}

二、修改ElasticSearchUtil工具类。需要添加写入WordCount和查询旧数据的方法。我们要统计单词的出现频率，每个单词在数据库中要保持唯一性，所以要特殊处理，给每一条数据设定一个可识别的唯一ID，为使用方便，我们就用单词本身作为ID，当然也可以使用Base64或者其他散列算法编码之后作为ID。

package com.chris.storm.utils;

import com.chris.storm.model.WordCount;
import com.google.gson.Gson;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;

import java.io.IOException;
import java.util.UUID;
import java.util.concurrent.ConcurrentHashMap;

/**
 * @author Chris Chan
 * Create on 2021/5/19 7:37
 * Use for:
 * Explain:
 */
public class ElasticSearchUtil {
    private static RestHighLevelClient client = null;
    private static Gson gson = new Gson().newBuilder().create();

    static {
        ElasticSearchUtil.client = new RestHighLevelClient(RestClient.builder(new HttpHost("192.168.0.52", 9200, "http")));
    }

    public static RestHighLevelClient getClient() {
        return client;
    }

    public static void close() {
        try {
            ElasticSearchUtil.client.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static boolean isIndexExists(String indexName) {
        try {
            return ElasticSearchUtil.client.indices().exists(new GetIndexRequest(indexName), RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return false;
    }

    public static void createIndex(String indexName) {
        if (isIndexExists(indexName)) {
            return;
        }
        try {
            ElasticSearchUtil.client.indices().create(new CreateIndexRequest(indexName), RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static <T> IndexResponse add(T obj, String indexName) {
        IndexRequest indexRequest = new IndexRequest(indexName).id(UUID.randomUUID().toString());
        indexRequest.source(new Gson().toJson(obj), XContentType.JSON);

        try {
            return ElasticSearchUtil.client.index(indexRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

    /**
     * 初始化本地单词计数收集器
     * 程序启动时，需要到ES去读取旧的数据，以便累计单词总数
     *
     * @param indexName
     * @param wordCountMap
     */
    public static void initWordCountMap(String indexName, ConcurrentHashMap<String, Long> wordCountMap) {
        SearchRequest request = new SearchRequest(indexName);
        SearchSourceBuilder builder = new SearchSourceBuilder();
        builder.query(QueryBuilders.matchAllQuery());
        request.source(builder);
        try {
            SearchResponse response = ElasticSearchUtil.client.search(request, RequestOptions.DEFAULT);
            SearchHit[] hits = response.getHits().getHits();
            for (SearchHit hit : hits) {
                WordCount wordCount = gson.fromJson(hit.getSourceAsString(), WordCount.class);
                wordCountMap.put(wordCount.getWord(), wordCount.getCount());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }


    }

    /**
     * 向ES写入单词计数结果
     *
     * @param wordCount
     * @param indexName
     */
    public static IndexResponse addWordCount(WordCount wordCount, String indexName) {
        IndexRequest indexRequest = new IndexRequest(indexName).id(wordCount.getWord());
        indexRequest.source(new Gson().toJson(wordCount), XContentType.JSON);

        try {
            return ElasticSearchUtil.client.index(indexRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
}

三、修改CountBolt。这个类当初我命名的时候就是打算写WordCount的。

package com.chris.storm.bolt;

import com.chris.storm.model.WordCount;
import com.chris.storm.utils.ElasticSearchUtil;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Tuple;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

/**
 * @author Chris Chan
 * Create on 2021/5/19 9:44
 * Use for:
 * Explain:
 */
public class CountBolt extends BaseRichBolt {
    //在ElasticSearch中创建的索引名称
    public static final String INDEX_NAME = "storm_word_count";
    //单词计数器
    private static ConcurrentHashMap<String, Long> wordCountMap;

    static {
        ElasticSearchUtil.createIndex(INDEX_NAME);
    }

    @Override
    public void prepare(Map<String, Object> map, TopologyContext topologyContext, OutputCollector outputCollector) {
        wordCountMap = new ConcurrentHashMap<>(16);
        ElasticSearchUtil.createIndex(INDEX_NAME);
        ElasticSearchUtil.initWordCountMap(INDEX_NAME, wordCountMap);
    }

    @Override
    public void execute(Tuple tuple) {
        String word = tuple.getStringByField("word");
        Long count = wordCountMap.get(word);
        if (null == count) {
            count = 1L;
        } else {
            count++;
        }
        wordCountMap.put(word, count);
        System.out.printf("%s: %d\n", word, count);

        //输出到ElasticSearch
        ElasticSearchUtil.addWordCount(new WordCount(word, count), INDEX_NAME);

    }

    /**
     * 这是流水线上的终点，不需要在发给下一环，所以无须再定义元组字段
     *
     * @param outputFieldsDeclarer
     */
    @Override
    public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {

    }
}

这个例子设计本地计数器，在初始化的时候会去ES读取旧数据（是全量读取哦），然后流式累计，这样保证了Strom任务挂掉之后统计数据不会被覆盖。不过这样的问题就是内存开销会很大。但是不这样做就得每次获得一个单词都要去ES数据库IO，这样也会影响性能。当然我们也可以把ES数据读取到Redis来缓存，跟Redis进行IO总比跟ES来IO要快。

还有一个问题，每进来一个单词，我们除了本地要更新缓存之外，还要写入ES，这个频率是不是也太高了。其实我们也可以在初始化的时候不用读取ES，而是直接在Redis缓存新数据，然后异步将Redis的数据刷入ES叠加，初始化Redis。只要时机拿捏得合适，Redis集群可用性好就行。

这些针对细节的逻辑我这里就不用实现了，我们只是体现一下思路就可以了。

其他的都不用修改，直接测试。记得打开各种服务器，并且Kafka等待输入。

在Kafka输入几行句子。

查看ElasticSearch数据。

出现了两个hello？原来大小写不一致。哈哈。这就是我们想要的效果了。

宝华的小岛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Java整合Storm实现WordCount单词统计

如同每一种程序语言的入门都要设计一个输出"Hello,World!"的小例子一样，几乎每一种流式数据处理框架都有一个WordCount的入门例程。MapReduce是一个标准，包括Spark和Flink都提供map和reduce算子，可以很方便地实现单词统计。Storm好像没有发现这个，不过实现起来却也很容易。基于上次修改的程序https://blog.csdn.net/xxkalychen/article/details/117058030?spm=1001.2014.3001.5501，我们把从Ka
复制链接

扫一扫

专栏目录