前言:现在写博客主要是总结与以后从忘记中快速回温,人老了,脑子越来越不好使了,很多东西刚看的就忘了
集群搭建方法
参考:搭建storm集群(apache-storm-0.9.5.tar.gz)
注意:这个搭建的storm集群不是 storm on yarn
这个过程我犯的错误:
- 在配置storm的环境变量,即定义STORM_HOME和PATH的时候:
export PATH = $PATH:$STORM_HOME/bin 注意:$PATH要放在等号右边第一个,不然会报错(centos 6.5) - storm集群开启与hadoop集群会相互干扰,因为不是storm on yarn,而且如果共用一套zookeeper集群的话,我关闭其中一者,另外一个才能正常工作
然后就搭建成功了,由于其中一个虚拟机之前被我搞坏了呜呜,就把它从集群列表里面删除了,剩2台成功以后的图上2张:
Hello world Topology
第一个拓扑是目标是统计单词数(源自:《storm入门这本书》),哈哈,是不是跟map reduce一样呐,但是这个拓扑已经把整个框架讲得很完整了。源码地址 以下是详细说来
spout:
第一个被调用的spout方法都是
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector)。它接收如下参数:
配置对象,在定义topology对象是创建;
TopologyContext对象,包含所有拓扑数据;
还有SpoutOutputCollector对象,它能让我们发布交给bolts处理的数据。
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
try {
this.context = context;
this.fileReader = new FileReader(conf.get("wordsFile").toString());
} catch (FileNotFoundException e) {
throw new RuntimeException("Error reading file ["+conf.get("wordFile")+"]");
}
this.collector = collector;
}
spout核心的方法:public void nextTuple()
/**
* 这个方法做的惟一一件事情就是分发文件中的文本行
* nextTuple()会被task一直调用
* 没有任务时它必须释放对线程的控制,其它方法才有机会得以执行。
* 因此nextTuple的第一行就要检查是否已处理完成。如果完成了,为了降低处理器负载,
* 会在返回前休眠一毫秒。如果任务完成了,文件中的每一行都已被读出并分发了。
*/
public void nextTuple() {
/**
* 这个方法会不断的被调用,直到整个文件都读完了,我们将等待并返回。
*/
if (completed) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// 什么也不做
}
return;
}
String str;
// 创建reader
BufferedReader reader = new BufferedReader(fileReader);
try {
// 读所有文本行
while ((str = reader.readLine()) != null) {
/*
* 按行发布一个新值
* NOTE: Values是一个ArrarList实现,它的元素就是传入构造器的参数
* emit一次就会调用一次ack()或fail(),storm日志能看到
* List<Integer> emit(List<Object> tuple, Object messageId)
* Emits a new tuple to the default output stream with the given message ID.
*
*/
this.collector.emit(new Values(str), str);
}
} catch (Exception e) {
throw new RuntimeException("Error reading tuple", e);
} finally {
completed = true;//标志任务已经完成
}
}
/**
* The tuple emitted by this spout with the msgId identifier has failed to
* be fully processed. Typically, an implementation of this method will put
* that message back on the queue to be replayed at a later time.
*/
public void fail(Object msgId)
{
System.out.println("FAIL:" + msgId);
}
/**
* Storm has determined that the tuple emitted by this spout with the msgId
* identifier has been fully processed. Typically, an implementation of this
* method will take that message off the queue and prevent it from being
* replayed.
*/
public void ack(Object msgId)
{
//在storm日志能看到输出
System.out.println("OK:" + msgId);
}
/**
* Called when an ISpout is going to be shutdown.
* There is no guarentee that close will be called, because the supervisor
* kill -9's worker processes on the cluster.
* The one context where close is guaranteed to be called is a
* topology is killed when running Storm in local mode.
*/
public void close()
{
}
/**
* 这个方法做的惟一一件事情就是分发文件中的文本行
* nextTuple()会被task一直调用
* 没有任务时它必须释放对线程的控制,其它方法才有机会得以执行。
* 因此nextTuple的第一行就要检查是否已处理完成。如果完成了,为了降低处理器负载,
* 会在返回前休眠一毫秒。如果任务完成了,文件中的每一行都已被读出并分发了。
*/
public void nextTuple() {
/**
* 这个方法会不断的被调用,直到整个文件都读完了,我们将等待并返回。
*/
if (completed) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// 什么也不做
}
return;
}
String str;
// 创建reader
BufferedReader reader = new BufferedReader(fileReader);
try {
// 读所有文本行
while ((str = reader.readLine()) != null) {
/*
* 按行发布一个新值
* NOTE: Values是一个ArrarList实现,它的元素就是传入构造器的参数
* emit一次就会调用一次ack()或fail(),storm日志能看到
* List<Integer> emit(List<Object> tuple, Object messageId)
* Emits a new tuple to the default output stream with the given message ID.
*
*/
this.collector.emit(new Values(str), str);
}
} catch (Exception e) {
throw new RuntimeException("Error reading tuple", e);
} finally {
completed = true;//标志任务已经完成
}
}
/**
* The tuple emitted by this spout with the msgId identifier has failed to
* be fully processed. Typically, an implementation of this method will put
* that message back on the queue to be replayed at a later time.
*/
public void fail(Object msgId)
{
System.out.println("FAIL:" + msgId);
}
/**
* Storm has determined that the tuple emitted by this spout with the msgId
* identifier has been fully processed. Typically, an implementation of this
* method will take that message off the queue and prevent it from being
* replayed.
*/
public void ack(Object msgId)
{
//在storm日志能看到输出
System.out.println("OK:" + msgId);
}
/**
* Called when an ISpout is going to be shutdown.
* There is no guarentee that close will be called, because the supervisor
* kill -9's worker processes on the cluster.
* The one context where close is guaranteed to be called is a
* topology is killed when running Storm in local mode.
*/
public void close()
{
}
Bolts
现在我们有了一个spout,用来按行读取文件并每行发布一个元组,还要创建两个bolts,用来处理它们。bolts实现了接口backtype.storm.topology.IRichBolt。bolt最重要的方法:void execute(Tuple input),每次接收到元组时都会被调用一次,还会再发布若干个元组。
NOTE: 只要必要,bolt或spout会发布若干元组。当调用execute或nextTuple时,它们可能会发布0个、1个或许多个元组。
第一个bolt,WordNormalizer,负责得到并标准化每行文本。它把文本行切分成单词,大写转化成小写,去掉头尾空白符。
首先我们要声明bolt的出参:
public void declareOutputFields(OutputFieldsDeclarer declarer)
{
declarer.declare(new Fields("word"));
}
这里我们声明bolt将发布一个名为“word”的域。
下一步我们实现public void execute(Tuple input),处理传入的元组:
public void execute(Tuple input){
String sentence=input.getString(0);
String[] words=sentence.split(" ");
for(String word : words){
word=word.trim();
if(!word.isEmpty()){
word=word.toLowerCase();
//发布这个单词
collector.emit(new Values(word));
}
}
//对元组做出应答:表示成功执行这个tuple
collector.ack(input);
}
第一行从元组读取值。值可以按位置或名称读取。接下来值被处理并用collector对象发布。最后,每次都调用collector对象的ack()方法确认已成功处理了一个元组。
其他方法:
/**
* @param stormConf - The Storm configuration for this bolt. This is the
* configuration provided to the topology merged in with cluster
* configuration on this machine.
* @param context - This object can be used to get information about this
* task's place within the topology, including the task id and
* component id of this task, input and output information, etc.
* @param collector - The collector is used to emit tuples from this bolt.
* Tuples can be emitted at any time, including the prepare and
* cleanup methods. The collector is thread-safe and should be saved
* as an instance variable of this bolt object.
*/
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
}
/**
* called when an IBolt is going to be shutdown. There is no guarentee that
* cleanup will be called, because the supervisor kill -9's worker processes
* on the cluster.
* The one context where cleanup is guaranteed to be called is when a
* topology is killed when running Storm in local mode.
*/
public void cleanup()
{
}
下一个bolt,WordCounter,负责为单词计数。这个拓扑结束时(cleanup()方法被调用时),我们将显示每个单词的数量,每个方法的含义同理上一个bolt,业务实现的功能不一样。
*NOTE: *这个例子的bolt什么也没发布,它把数据保存在map里,但是在真实的场景中可以把数据保存到数据库或者消息中间件kafka、mcq之类
public class WordCounter implements IRichBolt {
Integer id;
String name;
Map<String, Integer> counters;
private OutputCollector collector;
/**
* 这个spout结束时(集群关闭的时候),我们会显示单词数量
*/
public void cleanup() {
System.out.println("-- 单词数 【" + name + "-" + id + "】 --");
for (Map.Entry<String, Integer> entry : counters.entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
/**
* 为每个单词计数
*/
public void execute(Tuple input) {
String str = input.getString(0).trim();
/**
* 如果单词尚不存在于map,我们就创建一个,如果已在,我们就为它加1
*/
if (!counters.containsKey(str)) {
counters.put(str, 1);
} else {
Integer c = counters.get(str) + 1;
counters.put(str, c);
}
// 对元组做出应答
collector.ack(input);
}
/**
* 初始化
*/
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.counters = new HashMap<String, Integer>();
this.collector = collector;
this.name = context.getThisComponentId();
this.id = context.getThisTaskId();
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
//null
}
public Map<String, Object> getComponentConfiguration() {
// TODO 自动生成的方法存根
return null;
}
}
主类
可以在主类中创建拓扑和一个本地集群对象,以便于在本地测试和调试。LocalCluster可以通过Config对象,让你尝试不同的集群配置。比如,当使用不同数量的工作进程测试你的拓扑时,如果不小心使用了某个全局变量或类变量,你就能够发现错误。
NOTE:所有拓扑节点的各个进程必须能够独立运行,而不依赖共享数据(也就是没有全局变量或类变量),因为当拓扑运行在真实的集群环境时,这些进程可能会运行在不同的机器上。
public class TopologyMain
{
public static void main(String[] args) throws InterruptedException
{
//定义拓扑
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-reader", new WordReader());
builder.setBolt("word-normalizer", new WordNormalizer()).shuffleGrouping("word-reader");
builder.setBolt("word-counter", new WordCounter(),2).fieldsGrouping("word-normalizer", new Fields("word"));
//配置
Config conf = new Config();
conf.put("wordsFile", args[0]);
conf.setDebug(false);
//运行拓扑
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("hello storm", conf, builder.createTopology());
Thread.sleep(30000);
cluster.shutdown();
}
}