storm-[2]-storm基本模块编程

最新推荐文章于 2023-08-15 15:32:29 发布

hjw199089

最新推荐文章于 2023-08-15 15:32:29 发布

阅读量322

点赞数

分类专栏： [14]storm 文章标签： storm

[14]storm 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

学习参考《Strom分布式实时计算模式》

点击了解strom基本概念： Strom基本概念

一、基本框架

单词计算

SentenceSpout：模拟produce消息

SplitSentenceBolt：单词切分

WordCountBolt单词统计

ReportBolt单词统计

SentenceSpout：模拟produce消息

BaseRichSpout是ISpout和IComponent的一个简单实现
declareOutputFields:在IComponent中定义,所有spout和bolt必须实现,主要声明stream流中tuple中的key
Open():ISoupt接口中,Spout初识化调用,map为Storm配置信息,TopologyContext为topology的信息,SpoutOutputCollector提供emit方法

public class SentenceSpout extends BaseRichSpout {
 
    private SpoutOutputCollector collector;
    private String[] sentences = {
        "Work all done, care laid by",
        "Never fear no more",
        "Shadows gone, break of day",
        "Real life just begun"
    };
    private int index = 0;
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("sentence"));
    }
 
    public void open(Map config, TopologyContext context,
            SpoutOutputCollector collector) {
        this.collector = collector;
    }
 
    public void nextTuple() {
        this.collector.emit(new Values(sentences[index]));
        index++;
        if (index >= sentences.length) {
            index = 0;
        }
        Utils.waitForMillis(1);
    }
}

SplitSentenceBolt：单词切分

BaseRichBolt:是ICompoent和IBolt的实现
prepare是初识化,通常在此处初识化不可以序列化的对象(在构造函数中对基本数据类型和可序列化的对象进行复制和实例化)
execute:当从订阅的流中接收一个Tuple时都会调用

/*
BaseRichBolt:是ICompoent和IBolt的实现
prepare是初识化,通常在此处初识化不可以序列化的对象(在构造函数中对基本数据类型和可序列化的对象进行复制和实例化)
execute:当从订阅的流中接收一个Tuple时都会调用
 */
public class SplitSentenceBolt extends BaseRichBolt{
    private OutputCollector collector;
 
    public void prepare(Map config, TopologyContext context, OutputCollector collector) {
        this.collector = collector;
    }
 
    public void execute(Tuple tuple) {
        String sentence = tuple.getStringByField("sentence");
        String[] words = sentence.split(" ");
        for(String word : words){
            this.collector.emit(new Values(word));
        }
    }
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word"));
    }
}

WordCountBolt：单词计数

public class WordCountBolt extends BaseRichBolt{
    private OutputCollector collector;
    private HashMap<String, Long> counts = null;
 
    public void prepare(Map config, TopologyContext context,
            OutputCollector collector) {
        this.collector = collector;
        this.counts = new HashMap<String, Long>();
    }
 
    public void execute(Tuple tuple) {
        String word = tuple.getStringByField("word");
        Long count = this.counts.get(word);
        if(count == null){
            count = 0L;
        }
        count++;
        this.counts.put(word, count);
        this.collector.emit(new Values(word, count));
    }
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word", "count"));
    }
}

ReportBolt：信息打印

cleanup仅在topoloy退出之前执行一次，在集群环境可能失效，在本地环境可用来打印最终的统计结果

public class ReportBolt extends BaseRichBolt {
 
    private HashMap<String, Long> counts = null;
 
    public void prepare(Map config, TopologyContext context, OutputCollector collector) {
        this.counts = new HashMap<String, Long>();
    }
 
    public void execute(Tuple tuple) {
        String word = tuple.getStringByField("word");
        Long count = tuple.getLongByField("count");
        //System.out.println("word: " + word + "\tcount: " + count);
        this.counts.put(word, count);
    }
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        // this bolt does not emit anything
    }
 
    @Override
    public void cleanup() {
        System.out.println("--- FINAL COUNTS ---");
        List<String> keys = new ArrayList<String>();
        keys.addAll(this.counts.keySet());
        Collections.sort(keys);
        for (String key : keys) {
            System.out.println(key + " : " + this.counts.get(key));
        }
        System.out.println("--------------");
    }
}

WordCountTopology

用来定义topoloy的结构

Stream groupings

grouping defines how that stream should be partitioned among the bolt's tasks.

implement a custom stream grouping by implementing the CustomStreamGrouping interface:

Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
None grouping: Currently, none groupings are equivalent to shuffle groupings.
Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

Resources:

TopologyBuilder: use this class to define topologies
InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt's input streams and how those streams should be grouped

public class WordCountTopology {
 
    private static final String SENTENCE_SPOUT_ID = "sentence-spout";
    private static final String SPLIT_BOLT_ID = "split-bolt";
    private static final String COUNT_BOLT_ID = "count-bolt";
    private static final String REPORT_BOLT_ID = "report-bolt";
    private static final String TOPOLOGY_NAME = "word-count-topology";
 
    public static void main(String[] args) throws Exception {
 
        SentenceSpout spout = new SentenceSpout();
        SplitSentenceBolt splitBolt = new SplitSentenceBolt();
        WordCountBolt countBolt = new WordCountBolt();
        ReportBolt reportBolt = new ReportBolt();
 
 
        TopologyBuilder builder = new TopologyBuilder();
 
        builder.setSpout(SENTENCE_SPOUT_ID, spout);
        // SentenceSpout --> SplitSentenceBolt
        builder.setBolt(SPLIT_BOLT_ID, splitBolt)
                .shuffleGrouping(SENTENCE_SPOUT_ID);
        // SplitSentenceBolt --> WordCountBolt
        builder.setBolt(COUNT_BOLT_ID, countBolt)
                .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
        // WordCountBolt --> ReportBolt
        builder.setBolt(REPORT_BOLT_ID, reportBolt)
                .globalGrouping(COUNT_BOLT_ID);
 
        Config config = new Config();
 
        LocalCluster cluster = new LocalCluster();
 
        cluster.submitTopology(TOPOLOGY_NAME, config, builder.createTopology());
        waitForSeconds(10);
        cluster.killTopology(TOPOLOGY_NAME);
        cluster.shutdown();
    }
}

二、并行度

Understanding the Parallelism of a Storm Topology

1-执行一个topoloy的主要组成部分

Worker processes
Executors (threads)
Tasks

Here is a simple illustration of their relationships:

The relationships of worker processes, executors (threads) and tasks in Storm

2-配置topology的并行度

Number of worker processes

Description: How many worker processes to create for the topology across machines in the cluster.
Configuration option: TOPOLOGY_WORKERS
How to set in your code (examples):
- Config#setNumWorkers

Number of executors (threads)

Description: How many executors to spawn per component.
Configuration option: None (pass parallelism_hint parameter to setSpout or setBolt)
How to set in your code (examples):
- TopologyBuilder#setSpout()
- TopologyBuilder#setBolt()
- Note that as of Storm 0.8 the parallelism_hint parameter now specifies the initial number of executors (not tasks!) for that bolt.

Number of tasks

Description: How many tasks to create per component.
Configuration option: TOPOLOGY_TASKS
How to set in your code (examples):
- ComponentConfigurationDeclarer#setNumTasks()

Here is an example code snippet to show these settings in practice:

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
               .setNumTasks(4)
               .shuffleGrouping("blue-spout");

Run the bolt GreenBolt with：2 executors、 4 associated tasks. 每个executor 运行两个task

一个 topology并行度的例子

The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and two bolts called GreenBolt and YellowBolt. The components are linked such that BlueSpout sends its output to GreenBolt, which in turns sends its own output to YellowBolt.

Example of a running topology in Storm

The GreenBolt was configured as per the code snippet above whereas BlueSpout and YellowBolt only set the parallelism hint (number of executors). Here is the relevant code:

Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes

topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
               .setNumTasks(4)
               .shuffleGrouping("blue-spout");

topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
               .shuffleGrouping("green-bolt");

StormSubmitter.submitTopology(
        "mytopology",
        conf,
        topologyBuilder.createTopology()
    );

也可以 通过配置项控制parallelism:

TOPOLOGY_MAX_TASK_PARALLELISM: a single component的最大并行度. 一般用于限制local mode threads 的数目，设置方式 e.g. Config#setMaxTaskParallelism().

改变运行态topoloy的并行度

rebalancing：不需要重启cluster或topology重置并行度

You have two options to rebalance a topology:

Use the Storm web UI to rebalance the topology.
Use the CLI tool storm rebalance as described below.

Here is an example of using the CLI tool:

## Reconfigure the topology "mytopology" to use 5 worker processes,
## the spout "blue-spout" to use 3 executors and
## the bolt "yellow-bolt" to use 10 executors.
$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

重置单词分割的并行度

public class WordCountTopology {
 
    private static final String SENTENCE_SPOUT_ID = "sentence-spout";
    private static final String SPLIT_BOLT_ID = "split-bolt";
    private static final String COUNT_BOLT_ID = "count-bolt";
    private static final String REPORT_BOLT_ID = "report-bolt";
    private static final String TOPOLOGY_NAME = "word-count-topology";
 
    public static void main(String[] args) throws Exception {
 
        SentenceSpout spout = new SentenceSpout();
        SplitSentenceBolt splitBolt = new SplitSentenceBolt();
        WordCountBolt countBolt = new WordCountBolt();
        ReportBolt reportBolt = new ReportBolt();
 
 
        TopologyBuilder builder = new TopologyBuilder();
 
        builder.setSpout(SENTENCE_SPOUT_ID, spout, 2);
        // SentenceSpout --> SplitSentenceBolt
        builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2)
                .setNumTasks(4)
                .shuffleGrouping(SENTENCE_SPOUT_ID);
        // SplitSentenceBolt --> WordCountBolt
        builder.setBolt(COUNT_BOLT_ID, countBolt, 4)
                .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
        // WordCountBolt --> ReportBolt
        builder.setBolt(REPORT_BOLT_ID, reportBolt)
                .globalGrouping(COUNT_BOLT_ID);
 
        Config config = new Config();
        config.setNumWorkers(2);
 
        LocalCluster cluster = new LocalCluster();
 
        cluster.submitTopology(TOPOLOGY_NAME, config, builder.createTopology());
        waitForSeconds(10);
        cluster.killTopology(TOPOLOGY_NAME);
        cluster.shutdown();
    }
}

三、可靠的流处理ack和fail

Guaranteeing Message Processing

Spout实现可靠消费

以KestrelSpout消费kestrel消息队列为例：

KestrelSpout从Kestrel queue open一条消息，但并不意味这消息在queue被拿走，只是标识为“pending”状态（等待ACK），“pending”状态的消息不能被其他consumers消费KestrelSpout从Kestrel queue open一条消息Kestrel，Kestrel会反馈一条标识id的消息，然后Kestrel call KestrelSpout ack or

fail。KestrelSpout会根据消息是否被消费或timeout未成功消费进行 ack or fail，已确定从Kestrel queue真正拿走消费，或是失败则把消费放回Kestrel queue（去除pending状态，供其他consumers消费）

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sentences", new KestrelSpout("kestrel.backtype.com",
                                               22133,
                                               "sentence_queue",
                                               new StringScheme()));
builder.setBolt("split", new SplitSentence(), 10)
        .shuffleGrouping("sentences");
builder.setBolt("count", new WordCount(), 20)
        .fieldsGrouping("split", new Fields("word"));

如何应用API实现可靠消费

需要做两点：

anchor告知Storm对tuples 树已建立了一个新link（通过call anchoring link 一个tuple，anchoring在emit调用时将输入tuple最为第一个参数即可： _collector.emit(tuple, new Values(word))），正将输入tuple和即将发射的tuple产生anchored关系
处理完tuple后告知Storm

1-anchor告知Storm对tuples 树已建立了一个新link

Specifying a link in the tuple tree is called anchoring. Anchoring is done at the same time you emit a new tuple. Let's use the following bolt as an example. This bolt splits a tuple containing a sentence into a tuple for each word:

public class SplitSentence extends BaseRichBolt {
        OutputCollector _collector;

        public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
            _collector = collector;
        }

        public void execute(Tuple tuple) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                _collector.emit(tuple, new Values(word));
            }
            _collector.ack(tuple);
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(new Fields("word"));
        }        
    }
 一个输出tuple可以anchored到多个输入tuple（在streaming joins or aggregations场景中），当输出tuple失败时将调起spouts中多个tuples重新分配
List<Tuple> anchors = new ArrayList<Tuple>();
anchors.add(tuple1);
anchors.add(tuple2);
_collector.emit(anchors, new Values(1, 2, 3));

2-处理完tuple后告知Storm

OutputCollector的ack and fail方法告知Storm。正如SplitSentence的例子里在所有word tuples被emitted，调用一次acke _collector.ack(tuple);

fail用于告知 spout tuple 下游tuple的失败信息，可以选择捕获的exception标识为错误信息，这样spout tuple就不用等到 time-out之后才得知失败

Storm会利用内存跟踪每一个tuple，处理的每一个tuple必须acked or failed，否则最终会run out of memory

通常bolts都是读取一个input tuple，在input tuple基础上emitting tuples，这时只需在execute方法的最后调用一次acking，IBasicBolt 接口（不支持多anchored的情况）即已内置了这种处理方式。实现BasicBolt接口的SplitSentence如下：

public class SplitSentence extends BaseBasicBolt {
        public void execute(Tuple tuple, BasicOutputCollector collector) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                collector.emit(new Values(word));
            }
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(new Fields("word"));
        }        
    }

修改单词统计支持可靠ack

为发送的每条产生一个uuid,当下游tuple正确ack后从ConcurrentHashMap中删除该uuid,若错误则重发改tuple

public class SentenceSpout extends BaseRichSpout {
    private ConcurrentHashMap<UUID, Values> pending;
    private SpoutOutputCollector collector;
    private String[] sentences = {
            "Work all done, care laid by",
            "Never fear no more",
            "Shadows gone, break of day",
            "Real life just begun"
    };
    private int index = 0;
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("sentence"));
    }
    public void open(Map config, TopologyContext context,
            SpoutOutputCollector collector) {
        this.collector = collector;
        this.pending = new ConcurrentHashMap<UUID, Values>();
    }
 
    public void nextTuple() {
        Values values = new Values(sentences[index]);
        UUID msgId = UUID.randomUUID();
        this.pending.put(msgId, values);
        this.collector.emit(values, msgId);
        index++;
        if (index >= sentences.length) {
            index = 0;
        }
        Utils.waitForMillis(1);
    }
    public void ack(Object msgId) {
        this.pending.remove(msgId);
    }
    public void fail(Object msgId) {
        this.collector.emit(this.pending.get(msgId), msgId);
    }   
}
  
  
=========SplitSentenceBolt中修改=========
public void execute(Tuple tuple) {
    String sentence = tuple.getStringByField("sentence");
    String[] words = sentence.split(" ");
    for(String word : words){
        this.collector.emit(tuple, new Values(word));//第一个参数标识输入tuple
    }
    this.collector.ack(tuple);//完成是ack上游输入tuple
}
=========WordCountBolt中修改=========
public void execute(Tuple tuple) {
    String word = tuple.getStringByField("word");
    Long count = this.counts.get(word);
    if(count == null){
        count = 0L;
    }
    count++;
    this.counts.put(word, count);
    this.collector.ack(tuple);
    this.collector.emit(tuple, new Values(word, count));
}
 
 
=========ReportBolt中修改=========
public void execute(Tuple tuple) {
    String word = tuple.getStringByField("word");
    Long count = tuple.getLongByField("count");
    this.counts.put(word, count);
    this.collector.ack(tuple);
}