storm-[2]-storm基本模块编程

学习参考《Strom分布式实时计算模式》
点击了解strom基本概念: Strom基本概念

一、基本框架

单词计算

SentenceSpout:模拟produce消息

SplitSentenceBolt:单词切分

WordCountBolt单词统计
ReportBolt单词统计

SentenceSpout:模拟produce消息

  • BaseRichSpout是ISpout和IComponent的一个简单实现
  • declareOutputFields:在IComponent中定义,所有spout和bolt必须实现,主要声明stream流中tuple中的key
  • Open():ISoupt接口中,Spout初识化调用,map为Storm配置信息,TopologyContext为topology的信息,SpoutOutputCollector提供emit方法
public class SentenceSpout extends BaseRichSpout {
 
    private SpoutOutputCollector collector;
    private String[] sentences = {
        "Work all done, care laid by",
        "Never fear no more",
        "Shadows gone, break of day",
        "Real life just begun"
    };
    private int index = 0;
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("sentence"));
    }
 
    public void open(Map config, TopologyContext context,
            SpoutOutputCollector collector) {
        this.collector = collector;
    }
 
    public void nextTuple() {
        this.collector.emit(new Values(sentences[index]));
        index++;
        if (index >= sentences.length) {
            index = 0;
        }
        Utils.waitForMillis(1);
    }
}

SplitSentenceBolt:单词切分

  • BaseRichBolt:是ICompoent和IBolt的实现
  • prepare是初识化,通常在此处初识化不可以序列化的对象(在构造函数中对基本数据类型和可序列化的对象进行复制和实例化)
  • execute:当从订阅的流中接收一个Tuple时都会调用
/*
BaseRichBolt:是ICompoent和IBolt的实现
prepare是初识化,通常在此处初识化不可以序列化的对象(在构造函数中对基本数据类型和可序列化的对象进行复制和实例化)
execute:当从订阅的流中接收一个Tuple时都会调用
 */
public class SplitSentenceBolt extends BaseRichBolt{
    private OutputCollector collector;
 
    public void prepare(Map config, TopologyContext context, OutputCollector collector) {
        this.collector = collector;
    }
 
    public void execute(Tuple tuple) {
        String sentence = tuple.getStringByField("sentence");
        String[] words = sentence.split(" ");
        for(String word : words){
            this.collector.emit(new Values(word));
        }
    }
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word"));
    }
}

WordCountBolt:单词计数

public class WordCountBolt extends BaseRichBolt{
    private OutputCollector collector;
    private HashMap<String, Long> counts = null;
 
    public void prepare(Map config, TopologyContext context,
            OutputCollector collector) {
        this.collector = collector;
        this.counts = new HashMap<String, Long>();
    }
 
    public void execute(Tuple tuple) {
        String word = tuple.getStringByField("word");
        Long count = this.counts.get(word);
        if(count == null){
            count = 0L;
        }
        count++;
        this.counts.put(word, count);
        this.collector.emit(new Values(word, count));
    }
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word", "count"));
    }
}

ReportBolt:信息打印

cleanup仅在topoloy退出之前执行一次,在集群环境可能失效,在本地环境可用来打印最终的统计结果

public class ReportBolt extends BaseRichBolt {
 
    private HashMap<String, Long> counts = null;
 
    public void prepare(Map config, TopologyContext context, OutputCollector collector) {
        this.counts = new HashMap<String, Long>();
    }
 
    public void execute(Tuple tuple) {
        String word = tuple.getStringByField("word");
        Long count = tuple.getLongByField("count");
        //System.out.println("word: " + word + "\tcount: " + count);
        this.counts.put(word, count);
    }
 
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        // this bolt does not emit anything
    }
 
    @Override
    public void cleanup() {
        System.out.println("--- FINAL COUNTS ---");
        List<String> keys = new ArrayList<String>();
        keys.addAll(this.counts.keySet());
        Collections.sort(keys);
        for (String key : keys) {
            System.out.println(key + " : " + this.counts.get(key));
        }
        System.out.println("--------------");
    }
}

WordCountTopology

用来定义topoloy的结构

Stream groupings

 grouping defines how that stream should be partitioned among the bolt's tasks.

 implement a custom stream grouping by implementing the CustomStreamGrouping interface:

  1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
  2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
  3. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
  4. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
  5. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
  6. None grouping: Currently, none groupings are equivalent to shuffle groupings. 
  7. Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
  8. Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

Resources:

  • TopologyBuilder: use this class to define topologies
  • InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt's input streams and how those streams should be grouped
public class WordCountTopology {
 
    private static final String SENTENCE_SPOUT_ID = "sentence-spout";
    private static final String SPLIT_BOLT_ID = "split-bolt";
    private static final String COUNT_BOLT_ID = "count-bolt";
    private static final String REPORT_BOLT_ID = "report-bolt";
    private static final String TOPOLOGY_NAME = "word-count-topology";
 
    public static void main(String[] args) throws Exception {
 
        SentenceSpout spout = new SentenceSpout();
        SplitSentenceBolt splitBolt = new SplitSentenceBolt();
        WordCountBolt countBolt = new WordCountBolt();
        ReportBolt reportBolt = new ReportBolt();
 
 
        TopologyBuilder builder = new TopologyBuilder();
 
        builder.setSpout(SENTENCE_SPOUT_ID, spout);
        // SentenceSpout --> SplitSentenceBolt
        builder.setBolt(SPLIT_BOLT_ID, splitBolt)
                .shuffleGrouping(SENTENCE_SPOUT_ID);
        // SplitSentenceBolt --> WordCountBolt
        builder.setBolt(COUNT_BOLT_ID, countBolt)
                .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
        // WordCountBolt --> ReportBolt
        builder.setBolt(REPORT_BOLT_ID, reportBolt)
                .globalGrouping(COUNT_BOLT_ID);
 
        Config config = new Config();
 
        LocalCluster cluster = new LocalCluster();
 
        cluster.submitTopology(TOPOLOGY_NAME, config, builder.createTopology());
        waitForSeconds(10);
        cluster.killTopology(TOPOLOGY_NAME);
        cluster.shutdown();
    }
}

二、并行度

Understanding the Parallelism of a Storm Topology

  1. Worker processes
  2. Executors (threads)
  3. Tasks

Here is a simple illustration of their relationships:

The relationships of worker processes, executors (threads) and tasks in Storm

2-配置topology的并行度

Number of worker processes
  • Description: How many worker processes to create for the topology across machines in the cluster.
  • Configuration option: TOPOLOGY_WORKERS
  • How to set in your code (examples):
Number of executors (threads)
  • Description: How many executors to spawn per component.
  • Configuration option: None (pass parallelism_hint parameter to setSpout or setBolt)
  • How to set in your code (examples):
Number of tasks

Here is an example code snippet to show these settings in practice:

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
               .setNumTasks(4)
               .shuffleGrouping("blue-spout");

Run the bolt GreenBolt with:2 executors、 4 associated tasks.  每个executor 运行两个task

一个 topology并行度的例子

The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and two bolts called GreenBolt and YellowBolt. The components are linked such that BlueSpout sends its output to GreenBolt, which in turns sends its own output to YellowBolt.

Example of a running topology in Storm

The GreenBolt was configured as per the code snippet above whereas BlueSpout and YellowBolt only set the parallelism hint (number of executors). Here is the relevant code:

Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes

topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
               .setNumTasks(4)
               .shuffleGrouping("blue-spout");

topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
               .shuffleGrouping("green-bolt");

StormSubmitter.submitTopology(
        "mytopology",
        conf,
        topologyBuilder.createTopology()
    );

也可以 通过配置项控制parallelism:

改变运行态topoloy的并行度

rebalancing:不需要重启cluster或topology重置并行度

You have two options to rebalance a topology:

  1. Use the Storm web UI to rebalance the topology.
  2. Use the CLI tool storm rebalance as described below.

Here is an example of using the CLI tool:

## Reconfigure the topology "mytopology" to use 5 worker processes,
## the spout "blue-spout" to use 3 executors and
## the bolt "yellow-bolt" to use 10 executors.
$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

重置单词分割的并行度

public class WordCountTopology {
 
    private static final String SENTENCE_SPOUT_ID = "sentence-spout";
    private static final String SPLIT_BOLT_ID = "split-bolt";
    private static final String COUNT_BOLT_ID = "count-bolt";
    private static final String REPORT_BOLT_ID = "report-bolt";
    private static final String TOPOLOGY_NAME = "word-count-topology";
 
    public static void main(String[] args) throws Exception {
 
        SentenceSpout spout = new SentenceSpout();
        SplitSentenceBolt splitBolt = new SplitSentenceBolt();
        WordCountBolt countBolt = new WordCountBolt();
        ReportBolt reportBolt = new ReportBolt();
 
 
        TopologyBuilder builder = new TopologyBuilder();
 
        builder.setSpout(SENTENCE_SPOUT_ID, spout, 2);
        // SentenceSpout --> SplitSentenceBolt
        builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2)
                .setNumTasks(4)
                .shuffleGrouping(SENTENCE_SPOUT_ID);
        // SplitSentenceBolt --> WordCountBolt
        builder.setBolt(COUNT_BOLT_ID, countBolt, 4)
                .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
        // WordCountBolt --> ReportBolt
        builder.setBolt(REPORT_BOLT_ID, reportBolt)
                .globalGrouping(COUNT_BOLT_ID);
 
        Config config = new Config();
        config.setNumWorkers(2);
 
        LocalCluster cluster = new LocalCluster();
 
        cluster.submitTopology(TOPOLOGY_NAME, config, builder.createTopology());
        waitForSeconds(10);
        cluster.killTopology(TOPOLOGY_NAME);
        cluster.shutdown();
    }
}

三、可靠的流处理ack和fail

Guaranteeing Message Processing

Spout实现可靠消费

以KestrelSpout消费kestrel消息队列为例: 

KestrelSpout从Kestrel queue open一条消息,但并不意味这消息在queue被拿走,只是标识为“pending”状态(等待ACK),“pending”状态的消息不能被其他consumers消费KestrelSpout从Kestrel queue open一条消息Kestrel,Kestrel会反馈一条标识id的消息,然后Kestrel call KestrelSpout   ack or  fail。KestrelSpout会根据消息是否被消费或timeout未成功消费进行 ack or fail,已确定从Kestrel queue真正拿走消费,或是失败则把消费放回Kestrel queue(去除pending状态,供其他consumers消费)
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sentences", new KestrelSpout("kestrel.backtype.com",
                                               22133,
                                               "sentence_queue",
                                               new StringScheme()));
builder.setBolt("split", new SplitSentence(), 10)
        .shuffleGrouping("sentences");
builder.setBolt("count", new WordCount(), 20)
        .fieldsGrouping("split", new Fields("word"));

如何应用API实现可靠消费

需要做两点:
  • anchor告知Storm对tuples 树已建立了一个新link(通过call anchoring link 一个tuple,anchoring在emit调用时将输入tuple最为第一个参数即可: _collector.emit(tuple, new Values(word))),正将输入tuple和即将发射的tuple产生anchored关系
  • 处理完tuple后告知Storm

1-anchor告知Storm对tuples 树已建立了一个新link

Specifying a link in the tuple tree is called anchoring. Anchoring is done at the same time you emit a new tuple. Let's use the following bolt as an example. This bolt splits a tuple containing a sentence into a tuple for each word:

public class SplitSentence extends BaseRichBolt {
        OutputCollector _collector;

        public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
            _collector = collector;
        }

        public void execute(Tuple tuple) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                _collector.emit(tuple, new Values(word));
            }
            _collector.ack(tuple);
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(new Fields("word"));
        }        
    }
 一个输出tuple可以anchored到多个输入tuple(在streaming joins or aggregations场景中),当输出tuple失败时将调起spouts中多个tuples重新分配
List<Tuple> anchors = new ArrayList<Tuple>();
anchors.add(tuple1);
anchors.add(tuple2);
_collector.emit(anchors, new Values(1, 2, 3));

2-处理完tuple后告知Storm

OutputCollector的ack and fail方法告知Storm。正如SplitSentence的例子里在所有word tuples被emitted,调用一次acke  _collector.ack(tuple);

fail用于告知 spout tuple 下游tuple的失败信息,可以选择捕获的exception标识为错误信息,这样spout tuple就不用等到 time-out之后才得知失败

Storm会利用内存跟踪每一个tuple,处理的每一个tuple必须acked or failed,否则最终会run out of memory

通常bolts都是读取一个input tuple,在input tuple基础上emitting tuples,这时只需在execute方法的最后调用一次acking,IBasicBolt 接口(不支持多anchored的情况)即已内置了这种处理方式。实现BasicBolt接口的SplitSentence如下: 

public class SplitSentence extends BaseBasicBolt {
        public void execute(Tuple tuple, BasicOutputCollector collector) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                collector.emit(new Values(word));
            }
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(new Fields("word"));
        }        
    }

修改单词统计支持可靠ack

为发送的每条产生一个uuid,当下游tuple正确ack后从ConcurrentHashMap中删除该uuid,若错误则重发改tuple

public class SentenceSpout extends BaseRichSpout {
    private ConcurrentHashMap<UUID, Values> pending;
    private SpoutOutputCollector collector;
    private String[] sentences = {
            "Work all done, care laid by",
            "Never fear no more",
            "Shadows gone, break of day",
            "Real life just begun"
    };
    private int index = 0;
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("sentence"));
    }
    public void open(Map config, TopologyContext context,
            SpoutOutputCollector collector) {
        this.collector = collector;
        this.pending = new ConcurrentHashMap<UUID, Values>();
    }
 
    public void nextTuple() {
        Values values = new Values(sentences[index]);
        UUID msgId = UUID.randomUUID();
        this.pending.put(msgId, values);
        this.collector.emit(values, msgId);
        index++;
        if (index >= sentences.length) {
            index = 0;
        }
        Utils.waitForMillis(1);
    }
    public void ack(Object msgId) {
        this.pending.remove(msgId);
    }
    public void fail(Object msgId) {
        this.collector.emit(this.pending.get(msgId), msgId);
    }   
}
  
  
=========SplitSentenceBolt中修改=========
public void execute(Tuple tuple) {
    String sentence = tuple.getStringByField("sentence");
    String[] words = sentence.split(" ");
    for(String word : words){
        this.collector.emit(tuple, new Values(word));//第一个参数标识输入tuple
    }
    this.collector.ack(tuple);//完成是ack上游输入tuple
}
=========WordCountBolt中修改=========
public void execute(Tuple tuple) {
    String word = tuple.getStringByField("word");
    Long count = this.counts.get(word);
    if(count == null){
        count = 0L;
    }
    count++;
    this.counts.put(word, count);
    this.collector.ack(tuple);
    this.collector.emit(tuple, new Values(word, count));
}
 
 
=========ReportBolt中修改=========
public void execute(Tuple tuple) {
    String word = tuple.getStringByField("word");
    Long count = tuple.getLongByField("count");
    this.counts.put(word, count);
    this.collector.ack(tuple);
}





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值