一、基本框架
单词计算
SentenceSpout:模拟produce消息
SplitSentenceBolt:单词切分
SentenceSpout:模拟produce消息
- BaseRichSpout是ISpout和IComponent的一个简单实现
- declareOutputFields:在IComponent中定义,所有spout和bolt必须实现,主要声明stream流中tuple中的key
- Open():ISoupt接口中,Spout初识化调用,map为Storm配置信息,TopologyContext为topology的信息,SpoutOutputCollector提供emit方法
public class SentenceSpout extends BaseRichSpout {
private SpoutOutputCollector collector;
private String[] sentences = {
"Work all done, care laid by",
"Never fear no more",
"Shadows gone, break of day",
"Real life just begun"
};
private int index = 0;
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("sentence"));
}
public void open(Map config, TopologyContext context,
SpoutOutputCollector collector) {
this.collector = collector;
}
public void nextTuple() {
this.collector.emit(new Values(sentences[index]));
index++;
if (index >= sentences.length) {
index = 0;
}
Utils.waitForMillis(1);
}
}
SplitSentenceBolt:单词切分
- BaseRichBolt:是ICompoent和IBolt的实现
- prepare是初识化,通常在此处初识化不可以序列化的对象(在构造函数中对基本数据类型和可序列化的对象进行复制和实例化)
- execute:当从订阅的流中接收一个Tuple时都会调用
/*
BaseRichBolt:是ICompoent和IBolt的实现
prepare是初识化,通常在此处初识化不可以序列化的对象(在构造函数中对基本数据类型和可序列化的对象进行复制和实例化)
execute:当从订阅的流中接收一个Tuple时都会调用
*/
public class SplitSentenceBolt extends BaseRichBolt{
private OutputCollector collector;
public void prepare(Map config, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
public void execute(Tuple tuple) {
String sentence = tuple.getStringByField("sentence");
String[] words = sentence.split(" ");
for(String word : words){
this.collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
WordCountBolt:单词计数
public class WordCountBolt extends BaseRichBolt{
private OutputCollector collector;
private HashMap<String, Long> counts = null;
public void prepare(Map config, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
this.counts = new HashMap<String, Long>();
}
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = this.counts.get(word);
if(count == null){
count = 0L;
}
count++;
this.counts.put(word, count);
this.collector.emit(new Values(word, count));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
ReportBolt:信息打印
cleanup仅在topoloy退出之前执行一次,在集群环境可能失效,在本地环境可用来打印最终的统计结果
public class ReportBolt extends BaseRichBolt {
private HashMap<String, Long> counts = null;
public void prepare(Map config, TopologyContext context, OutputCollector collector) {
this.counts = new HashMap<String, Long>();
}
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = tuple.getLongByField("count");
//System.out.println("word: " + word + "\tcount: " + count);
this.counts.put(word, count);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// this bolt does not emit anything
}
@Override
public void cleanup() {
System.out.println("--- FINAL COUNTS ---");
List<String> keys = new ArrayList<String>();
keys.addAll(this.counts.keySet());
Collections.sort(keys);
for (String key : keys) {
System.out.println(key + " : " + this.counts.get(key));
}
System.out.println("--------------");
}
}
WordCountTopology
用来定义topoloy的结构
Stream groupings
grouping defines how that stream should be partitioned among the bolt's tasks.
implement a custom stream grouping by implementing the CustomStreamGrouping interface:
- Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
- Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
- Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
- All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
- Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
- None grouping: Currently, none groupings are equivalent to shuffle groupings.
- Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the
emit
method in OutputCollector (which returns the task ids that the tuple was sent to). - Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.
Resources:
- TopologyBuilder: use this class to define topologies
- InputDeclarer: this object is returned whenever
setBolt
is called onTopologyBuilder
and is used for declaring a bolt's input streams and how those streams should be grouped
public class WordCountTopology {
private static final String SENTENCE_SPOUT_ID = "sentence-spout";
private static final String SPLIT_BOLT_ID = "split-bolt";
private static final String COUNT_BOLT_ID = "count-bolt";
private static final String REPORT_BOLT_ID = "report-bolt";
private static final String TOPOLOGY_NAME = "word-count-topology";
public static void main(String[] args) throws Exception {
SentenceSpout spout = new SentenceSpout();
SplitSentenceBolt splitBolt = new SplitSentenceBolt();
WordCountBolt countBolt = new WordCountBolt();
ReportBolt reportBolt = new ReportBolt();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(SENTENCE_SPOUT_ID, spout);
// SentenceSpout --> SplitSentenceBolt
builder.setBolt(SPLIT_BOLT_ID, splitBolt)
.shuffleGrouping(SENTENCE_SPOUT_ID);
// SplitSentenceBolt --> WordCountBolt
builder.setBolt(COUNT_BOLT_ID, countBolt)
.fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
// WordCountBolt --> ReportBolt
builder.setBolt(REPORT_BOLT_ID, reportBolt)
.globalGrouping(COUNT_BOLT_ID);
Config config = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(TOPOLOGY_NAME, config, builder.createTopology());
waitForSeconds(10);
cluster.killTopology(TOPOLOGY_NAME);
cluster.shutdown();
}
}
Understanding the Parallelism of a Storm Topology
1-执行一个topoloy的主要组成部分
- Worker processes
- Executors (threads)
- Tasks
Here is a simple illustration of their relationships:
2-配置topology的并行度
Number of worker processes
- Description: How many worker processes to create for the topology across machines in the cluster.
- Configuration option: TOPOLOGY_WORKERS
- How to set in your code (examples):
Number of executors (threads)
- Description: How many executors to spawn per component.
- Configuration option: None (pass
parallelism_hint
parameter tosetSpout
orsetBolt
) - How to set in your code (examples):
- TopologyBuilder#setSpout()
- TopologyBuilder#setBolt()
- Note that as of Storm 0.8 the
parallelism_hint
parameter now specifies the initial number of executors (not tasks!) for that bolt.
Number of tasks
- Description: How many tasks to create per component.
- Configuration option: TOPOLOGY_TASKS
- How to set in your code (examples):
Here is an example code snippet to show these settings in practice:
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-spout");
Run the bolt GreenBolt
with:2 executors、 4 associated tasks. 每个executor 运行两个task
一个 topology并行度的例子
The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout
and two bolts called GreenBolt
and YellowBolt
. The components are linked such that BlueSpout
sends its output to GreenBolt
, which in turns sends its own output to YellowBolt
.
The GreenBolt
was configured as per the code snippet above whereas BlueSpout
and YellowBolt
only set the parallelism hint (number of executors). Here is the relevant code:
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-spout");
topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
.shuffleGrouping("green-bolt");
StormSubmitter.submitTopology(
"mytopology",
conf,
topologyBuilder.createTopology()
);
也可以
通过配置项控制parallelism:
-
TOPOLOGY_MAX_TASK_PARALLELISM: a single component的最大并行度. 一般用于限制local mode threads 的数目, 设置方式 e.g. Config#setMaxTaskParallelism().
改变运行态topoloy的并行度
rebalancing:不需要重启cluster或topology重置并行度
You have two options to rebalance a topology:
- Use the Storm web UI to rebalance the topology.
- Use the CLI tool storm rebalance as described below.
Here is an example of using the CLI tool:
## Reconfigure the topology "mytopology" to use 5 worker processes,
## the spout "blue-spout" to use 3 executors and
## the bolt "yellow-bolt" to use 10 executors.
$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10
重置单词分割的并行度
public class WordCountTopology {
private static final String SENTENCE_SPOUT_ID = "sentence-spout";
private static final String SPLIT_BOLT_ID = "split-bolt";
private static final String COUNT_BOLT_ID = "count-bolt";
private static final String REPORT_BOLT_ID = "report-bolt";
private static final String TOPOLOGY_NAME = "word-count-topology";
public static void main(String[] args) throws Exception {
SentenceSpout spout = new SentenceSpout();
SplitSentenceBolt splitBolt = new SplitSentenceBolt();
WordCountBolt countBolt = new WordCountBolt();
ReportBolt reportBolt = new ReportBolt();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(SENTENCE_SPOUT_ID, spout, 2);
// SentenceSpout --> SplitSentenceBolt
builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2)
.setNumTasks(4)
.shuffleGrouping(SENTENCE_SPOUT_ID);
// SplitSentenceBolt --> WordCountBolt
builder.setBolt(COUNT_BOLT_ID, countBolt, 4)
.fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
// WordCountBolt --> ReportBolt
builder.setBolt(REPORT_BOLT_ID, reportBolt)
.globalGrouping(COUNT_BOLT_ID);
Config config = new Config();
config.setNumWorkers(2);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(TOPOLOGY_NAME, config, builder.createTopology());
waitForSeconds(10);
cluster.killTopology(TOPOLOGY_NAME);
cluster.shutdown();
}
}
三、可靠的流处理ack和fail
Guaranteeing Message Processing
Spout实现可靠消费
以KestrelSpout消费kestrel消息队列为例:
ack
or
fail。KestrelSpout会根据消息是否被消费或timeout未成功消费进行 ack or fail,已确定从Kestrel queue真正拿走消费,或是失败则把消费放回Kestrel queue(去除pending状态,供其他consumers消费)
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sentences", new KestrelSpout("kestrel.backtype.com",
22133,
"sentence_queue",
new StringScheme()));
builder.setBolt("split", new SplitSentence(), 10)
.shuffleGrouping("sentences");
builder.setBolt("count", new WordCount(), 20)
.fieldsGrouping("split", new Fields("word"));
如何应用API实现可靠消费
- anchor告知Storm对tuples 树已建立了一个新link(通过call anchoring link 一个tuple,anchoring在emit调用时将输入tuple最为第一个参数即可: _collector.emit(tuple, new Values(word))),正将输入tuple和即将发射的tuple产生anchored关系
- 处理完tuple后告知Storm
1-anchor告知Storm对tuples 树已建立了一个新link
Specifying a link in the tuple tree is called anchoring. Anchoring is done at the same time you emit a new tuple. Let's use the following bolt as an example. This bolt splits a tuple containing a sentence into a tuple for each word:
public class SplitSentence extends BaseRichBolt {
OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
一个输出tuple可以anchored到多个输入tuple(在streaming joins or aggregations场景中),当输出tuple失败时将调起spouts中多个tuples重新分配
List<Tuple> anchors = new ArrayList<Tuple>();
anchors.add(tuple1);
anchors.add(tuple2);
_collector.emit(anchors, new Values(1, 2, 3));
2-处理完tuple后告知Storm
OutputCollector的ack
and fail方法告知Storm。正如
SplitSentence的例子里在所有word tuples被emitted,调用一次acke _collector.ack(tuple);
fail用于告知 spout tuple 下游tuple的失败信息,可以选择捕获的exception标识为错误信息,这样spout tuple就不用等到 time-out之后才得知失败
Storm会利用内存跟踪每一个tuple,处理的每一个tuple必须acked or failed,否则最终会run out of memory
通常bolts都是读取一个input tuple,在input tuple基础上emitting tuples,这时只需在execute方法的最后调用一次acking,IBasicBolt
接口(不支持多anchored的情况)即已内置了这种处理方式。实现BasicBolt接口的SplitSentence如下:
public class SplitSentence extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
修改单词统计支持可靠ack
为发送的每条产生一个uuid,当下游tuple正确ack后从ConcurrentHashMap中删除该uuid,若错误则重发改tuple
public class SentenceSpout extends BaseRichSpout {
private ConcurrentHashMap<UUID, Values> pending;
private SpoutOutputCollector collector;
private String[] sentences = {
"Work all done, care laid by",
"Never fear no more",
"Shadows gone, break of day",
"Real life just begun"
};
private int index = 0;
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("sentence"));
}
public void open(Map config, TopologyContext context,
SpoutOutputCollector collector) {
this.collector = collector;
this.pending = new ConcurrentHashMap<UUID, Values>();
}
public void nextTuple() {
Values values = new Values(sentences[index]);
UUID msgId = UUID.randomUUID();
this.pending.put(msgId, values);
this.collector.emit(values, msgId);
index++;
if (index >= sentences.length) {
index = 0;
}
Utils.waitForMillis(1);
}
public void ack(Object msgId) {
this.pending.remove(msgId);
}
public void fail(Object msgId) {
this.collector.emit(this.pending.get(msgId), msgId);
}
}
=========SplitSentenceBolt中修改=========
public void execute(Tuple tuple) {
String sentence = tuple.getStringByField("sentence");
String[] words = sentence.split(" ");
for(String word : words){
this.collector.emit(tuple, new Values(word));//第一个参数标识输入tuple
}
this.collector.ack(tuple);//完成是ack上游输入tuple
}
=========WordCountBolt中修改=========
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = this.counts.get(word);
if(count == null){
count = 0L;
}
count++;
this.counts.put(word, count);
this.collector.ack(tuple);
this.collector.emit(tuple, new Values(word, count));
}
=========ReportBolt中修改=========
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = tuple.getLongByField("count");
this.counts.put(word, count);
this.collector.ack(tuple);
}