storm之hello world

最新推荐文章于 2020-09-14 17:38:13 发布

SnailDove

最新推荐文章于 2020-09-14 17:38:13 发布

阅读量686

点赞数

文章标签： storm

本文链接：https://blog.csdn.net/you1314520me/article/details/51626805

版权

分布式专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言：现在写博客主要是总结与以后从忘记中快速回温，人老了，脑子越来越不好使了，很多东西刚看的就忘了

集群搭建方法

参考：搭建storm集群（apache-storm-0.9.5.tar.gz)
注意：这个搭建的storm集群不是 storm on yarn

这个过程我犯的错误：

在配置storm的环境变量，即定义STORM_HOME和PATH的时候:
export PATH = $PATH:$STORM_HOME/bin 注意：$PATH要放在等号右边第一个，不然会报错(centos 6.5)
storm集群开启与hadoop集群会相互干扰，因为不是storm on yarn，而且如果共用一套zookeeper集群的话，我关闭其中一者，另外一个才能正常工作

然后就搭建成功了，由于其中一个虚拟机之前被我搞坏了呜呜，就把它从集群列表里面删除了，剩2台成功以后的图上2张：

Hello world Topology

第一个拓扑是目标是统计单词数（源自：《storm入门这本书》），哈哈，是不是跟map reduce一样呐，但是这个拓扑已经把整个框架讲得很完整了。源码地址以下是详细说来

spout：

第一个被调用的spout方法都是
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector)。它接收如下参数：
配置对象，在定义topology对象是创建；
TopologyContext对象，包含所有拓扑数据；
还有SpoutOutputCollector对象，它能让我们发布交给bolts处理的数据。

public void open(Map conf, TopologyContext context,
        SpoutOutputCollector collector) {
        try {
            this.context = context;
            this.fileReader = new FileReader(conf.get("wordsFile").toString());
        } catch (FileNotFoundException e) {
            throw new RuntimeException("Error reading file ["+conf.get("wordFile")+"]");
        }
        this.collector = collector;
}

spout核心的方法：public void nextTuple()

/**
 * 这个方法做的惟一一件事情就是分发文件中的文本行
 * nextTuple()会被task一直调用
 * 没有任务时它必须释放对线程的控制，其它方法才有机会得以执行。
 * 因此nextTuple的第一行就要检查是否已处理完成。如果完成了，为了降低处理器负载，
 * 会在返回前休眠一毫秒。如果任务完成了，文件中的每一行都已被读出并分发了。
 */
public void nextTuple() {
	/**
	 * 这个方法会不断的被调用，直到整个文件都读完了，我们将等待并返回。
	 */
	if (completed) {
		try {
			Thread.sleep(1000);
		} catch (InterruptedException e) {
			// 什么也不做
		}
		return;
	}
	String str;
	// 创建reader
	BufferedReader reader = new BufferedReader(fileReader);
	try {
		// 读所有文本行
		while ((str = reader.readLine()) != null) {
                      /*
                       * 按行发布一个新值
                       * NOTE: Values是一个ArrarList实现，它的元素就是传入构造器的参数
                       * emit一次就会调用一次ack()或fail()，storm日志能看到
                       * List<Integer> emit(List<Object> tuple, Object messageId)
                       * Emits a new tuple to the default output stream with the given message ID.
                       *
                       */
		       this.collector.emit(new Values(str), str);
		}
	} catch (Exception e) {
		throw new RuntimeException("Error reading tuple", e);
	} finally {
		completed = true;//标志任务已经完成
	}
}

/**
 * The tuple emitted by this spout with the msgId identifier has failed to
 * be fully processed. Typically, an implementation of this method will put
 * that message back on the queue to be replayed at a later time.
 */
 public void fail(Object msgId)
 {
       System.out.println("FAIL:" + msgId);
 }

 /**
 * Storm has determined that the tuple emitted by this spout with the msgId
 * identifier has been fully processed. Typically, an implementation of this
 * method will take that message off the queue and prevent it from being
 * replayed.
 */
 public void ack(Object msgId)
 {
       //在storm日志能看到输出
       System.out.println("OK:" + msgId);
 }

/**
 * Called when an ISpout is going to be shutdown.
 * There is no guarentee that close will be called, because the supervisor
 * kill -9's worker processes on the cluster.
 * The one context where close is guaranteed to be called is a
 * topology is killed when running Storm in local mode.
 */
 public void close()
 {

 }

Bolts

现在我们有了一个spout，用来按行读取文件并每行发布一个元组，还要创建两个bolts，用来处理它们。bolts实现了接口backtype.storm.topology.IRichBolt。bolt最重要的方法：void execute(Tuple input)，每次接收到元组时都会被调用一次，还会再发布若干个元组。

NOTE: 只要必要，bolt或spout会发布若干元组。当调用execute或nextTuple时，它们可能会发布0个、1个或许多个元组。

第一个bolt，WordNormalizer，负责得到并标准化每行文本。它把文本行切分成单词，大写转化成小写，去掉头尾空白符。

首先我们要声明bolt的出参：

public void declareOutputFields(OutputFieldsDeclarer declarer)
{
       declarer.declare(new Fields("word"));
}

这里我们声明bolt将发布一个名为“word”的域。

下一步我们实现public void execute(Tuple input)，处理传入的元组：

public void execute(Tuple input){
     String sentence=input.getString(0);
     String[] words=sentence.split(" ");
     for(String word : words){
                word=word.trim();
               if(!word.isEmpty()){
                   word=word.toLowerCase();
                  //发布这个单词
                   collector.emit(new Values(word));
               }
     }
     //对元组做出应答：表示成功执行这个tuple
     collector.ack(input);
}

第一行从元组读取值。值可以按位置或名称读取。接下来值被处理并用collector对象发布。最后，每次都调用collector对象的ack()方法确认已成功处理了一个元组。

其他方法：

/**
	 * @param stormConf - The Storm configuration for this bolt. This is the
	 *        configuration provided to the topology merged in with cluster
	 *        configuration on this machine.
     * @param context - This object can be used to get information about this
     *        task's place within the topology, including the task id and
     *        component id of this task, input and output information, etc.
	 * @param collector - The collector is used to emit tuples from this bolt.
	 *        Tuples can be emitted at any time, including the prepare and
	 *        cleanup methods. The collector is thread-safe and should be saved
	 *        as an instance variable of this bolt object.
	 */
	public void prepare(Map stormConf, TopologyContext context,
			OutputCollector collector) {
		this.collector = collector;
	}

	/**
	 * called when an IBolt is going to be shutdown. There is no guarentee that
	 * cleanup will be called, because the supervisor kill -9's worker processes
	 * on the cluster.
     * The one context where cleanup is guaranteed to be called is when a
     * topology is killed when running Storm in local mode.
	 */
	public void cleanup()
	{

	}

下一个bolt，WordCounter，负责为单词计数。这个拓扑结束时（cleanup()方法被调用时），我们将显示每个单词的数量，每个方法的含义同理上一个bolt，业务实现的功能不一样。

*NOTE: *这个例子的bolt什么也没发布，它把数据保存在map里，但是在真实的场景中可以把数据保存到数据库或者消息中间件kafka、mcq之类

public class WordCounter implements IRichBolt {
	Integer id;
	String name;
	Map<String, Integer> counters;
	private OutputCollector collector;

	/**
	 * 这个spout结束时（集群关闭的时候），我们会显示单词数量
	 */
	public void cleanup() {
		System.out.println("-- 单词数 【" + name + "-" + id + "】 --");
		for (Map.Entry<String, Integer> entry : counters.entrySet()) {
			System.out.println(entry.getKey() + ": " + entry.getValue());
		}
	}

	/**
	 * 为每个单词计数
	 */
	public void execute(Tuple input) {
		String str = input.getString(0).trim();
		/**
		 * 如果单词尚不存在于map，我们就创建一个，如果已在，我们就为它加1
		 */
		if (!counters.containsKey(str)) {
			counters.put(str, 1);
		} else {
			Integer c = counters.get(str) + 1;
			counters.put(str, c);
		}
		// 对元组做出应答
		collector.ack(input);
	}

	/**
	 * 初始化
	 */
	public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
		this.counters = new HashMap<String, Integer>();
		this.collector = collector;
		this.name = context.getThisComponentId();
		this.id = context.getThisTaskId();
	}

	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		//null
	}

	public Map<String, Object> getComponentConfiguration() {
		// TODO 自动生成的方法存根
		return null;
	}
}

主类

可以在主类中创建拓扑和一个本地集群对象，以便于在本地测试和调试。LocalCluster可以通过Config对象，让你尝试不同的集群配置。比如，当使用不同数量的工作进程测试你的拓扑时，如果不小心使用了某个全局变量或类变量，你就能够发现错误。

NOTE：所有拓扑节点的各个进程必须能够独立运行，而不依赖共享数据（也就是没有全局变量或类变量），因为当拓扑运行在真实的集群环境时，这些进程可能会运行在不同的机器上。

public class TopologyMain
{
	public static void main(String[] args) throws InterruptedException
	{
        //定义拓扑
        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("word-reader", new WordReader());
        builder.setBolt("word-normalizer", new WordNormalizer()).shuffleGrouping("word-reader");
        builder.setBolt("word-counter", new WordCounter(),2).fieldsGrouping("word-normalizer", new Fields("word"));

        //配置
        Config conf = new Config();
        conf.put("wordsFile", args[0]);
        conf.setDebug(false);

        //运行拓扑
        conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
        LocalCluster cluster = new LocalCluster();
        cluster.submitTopology("hello storm", conf, builder.createTopology());
        Thread.sleep(30000);
        cluster.shutdown();
    }
}