Storm-kafka-hbase基础编程一

上一篇文章介绍了Storm如果如何保证消息传送的,通过message ID以及anchered tuple机制来跟踪消息,如果完成了,返回ack,失败了返回fail以便重发消息。但是即使如此,大家也知道不能保证exactly once, 为什么? 大家去思考一下,因此本章所编写的程序是不保证exactly once 的,如果需要保证,需要使用Trident 接口,这个下次再进行介绍。


Storm的程序编写相比比较简单,但是管理依赖包却是一个非常烦的事情,我目前还没有找到一个很好的方法,通常是把程序打一个全包,因此包比较大,即使如此,还会出现某些类找不到,如果是有一定经验之后,你就会知道一个storm程序大概需要哪些包。

今天要展示的程序是从kafka读取数据,然后通过一个 bolt过滤数据后传送给另外一个bolt,最后存储到HBASE,所以需要准备相关 pom.xml。

pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<repositories>
		<repository>
			<id>cloudera</id>
			<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
		</repository>
	</repositories>
	<groupId>com.isesol</groupId>
	<artifactId>storm</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<packaging>jar</packaging>

	<name>storm</name>
	<url>http://maven.apache.org</url>

	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>

	<dependencies>
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>3.8.1</version>
			<scope>test</scope>
		</dependency>

		<dependency>
			<groupId>org.apache.storm</groupId>
			<artifactId>storm-core</artifactId>
			<version>1.0.2</version>
			<scope>provided</scope>
		</dependency>
		<!-- 0.9.0-kafka-2.0.1 -->
		<dependency>
			<groupId>org.apache.kafka</groupId>
			<artifactId>kafka_2.10</artifactId>
			<version>0.9.0-kafka-2.0.1</version>
			<exclusions>
				<exclusion>
					<groupId>org.slf4j</groupId>
					<artifactId>slf4j-log4j12</artifactId>
				</exclusion>
				<!-- <exclusion> <groupId>org.apache.zookeeper</groupId> <artifactId>zookeeper</artifactId> 
					</exclusion> -->
			</exclusions>
		</dependency>
		<!-- https://mvnrepository.com/artifact/org.apache.storm/storm-kafka -->
		<dependency>
			<groupId>org.apache.storm</groupId>
			<artifactId>storm-kafka</artifactId>
			<version>1.0.0</version>
		</dependency>

		<dependency>
			<groupId>org.apache.kafka</groupId>
			<artifactId>kafka-clients</artifactId>
			<version>0.9.0-kafka-2.0.0</version>
		</dependency>


		<!-- https://mvnrepository.com/artifact/org.apache.storm/storm-hbase -->
		<dependency>
			<groupId>org.apache.storm</groupId>
			<artifactId>storm-hbase</artifactId>
			<version>1.1.0</version>
			<exclusions>
				<exclusion>
					<groupId>jdk.tools</groupId>
					<artifactId>jdk.tools</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
	</dependencies>

	<build>
		<plugins>
			<plugin>
				<artifactId>maven-assembly-plugin</artifactId>
				<version>2.6</version>
				<configuration>
					<archive>
						<manifest>
							<mainClass>com.isesol.storm.getKafka</mainClass>
						</manifest>

					</archive>
					<!-- <descriptor>assembly.xml</descriptor> -->

					<descriptorRefs>
						<descriptorRef>jar-with-dependencies</descriptorRef>
					</descriptorRefs>
				</configuration>
			</plugin>
		</plugins>
	</build>
</project>


pom.xml文件准备好之后,就开始编写程序了。 从kafka 读数据是通过storm-kafka的集成包完成的,写入HBASE通过storm-hbase集成包,整个流程就是kafkaspout->bolt->bolt->storm-hbase, 我们先来介绍一下编程的基本概念,先从bolt着手:

class kafkaBolt extends BaseRichBolt {

	private Map conf;
	private TopologyContext context;
	private OutputCollector collector;

	public void execute(Tuple input) {
		// TODO Auto-generated method stub
		try{
			String line = input.getString(0);
			collector.emit(input, new Values(line));
			collector.ack(input);
		} catch (Exception ex){
			collector.fail(input);
		}


	}

	public void prepare(Map arg0, TopologyContext arg1, OutputCollector arg2) {
		// TODO Auto-generated method stub
		this.conf = arg0;
		this.context = arg1;
		this.collector = arg2;
	}

	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		// TODO Auto-generated method stub

		declarer.declare(new Fields("line"));

	}

kafkabolt集成BasicRichBolt, 一共3个方法prepare, execute, declareOutputFields,   prepare的作用是获取相关的系统变量,基本没有什么可处理的,直接赋值即可,execute作用是用来处理数据,declareOutputFields用来发送数据到下一个bolt, 如果你只有一个bolt,很显然这个地方什么也不用写,因为你不需要去发送数据。


这里着重说一下 collector.emit(input, new Values(line)) 以及 new Fields("line") :

collector.emit(input, new Values(line))  的input用来构建anchered tuple,new Values表示你要发送的数据,不同字段数据用逗号隔开,这里只有一个字段,就是line,假设你有2个字段,那么就应该是这样 new Values(line, line1).

new Fields("line")  的意思是你要发送的数据,和 new Values对应。 如果发送2个字段,那么就是 new Fields("line", "line1")  , 里面的名字可以随便给的,这个之后会介绍它的作用。


kafkabolt之后我还有一个 bolt, 用来发送数据到HBASEbolt:

class transferBolt extends BaseRichBolt {

	private Map conf;
	private TopologyContext context;
	private OutputCollector collector;

	public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
		// TODO Auto-generated method stub

		this.conf = stormConf;
		this.context = context;
		this.collector = collector;

	}

	public void execute(Tuple input) {
		// TODO Auto-generated method stub

		try {
			String line = input.getString(0);
			collector.emit(input,
					new Values(UUID.randomUUID().toString() + "-test1", UUID.randomUUID().toString(), line));
			collector.ack(input);
		} catch (Exception ex) {
			collector.fail(input);
		}

	}

	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		// TODO Auto-generated method stub
		declarer.declare(new Fields("rowkey", "linetest", "line"));
	}

}

其实大家看到,这2个 bolt其实没干什么,kafkabolt根本没做处理,直接把接受到tuple直接就发送出来,transferbolt因为会把数据发给hbasebolt, 因此我做了一个简单的处理,添加rowkey. 


整个程序如下:


package com.isesol.storm;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import org.apache.storm.*;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.hbase.bolt.HBaseBolt;
import org.apache.storm.hbase.bolt.mapper.SimpleHBaseMapper;
import org.apache.storm.shade.com.google.common.collect.Maps;
import org.apache.storm.spout.SchemeAsMultiScheme;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import org.apache.storm.kafka.BrokerHosts;
import org.apache.storm.kafka.KafkaSpout;
import org.apache.storm.kafka.SpoutConfig;
import org.apache.storm.kafka.StringScheme;
import org.apache.storm.kafka.ZkHosts;

public class getKafka {

	public static void main(String[] args)
			throws AlreadyAliveException, InvalidTopologyException, AuthorizationException {
		String zkConnString = "datanode01.isesol.com:2181,datanode02.isesol.com:2181,datanode03.isesol.com:2181,datanode04.isesol.com:2181";
		String topicName = "2001";
		String zkRoot = "/data/storm";
		BrokerHosts hosts = new ZkHosts(zkConnString);

		SpoutConfig spoutConfig = new SpoutConfig(hosts, topicName, zkRoot, "jlwang");
		spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
		spoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime();
		KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
		TopologyBuilder builder = new TopologyBuilder();

		List<String> fieldNameList = new ArrayList<String>();
		fieldNameList.add("linetest");
		fieldNameList.add("line");
		builder.setSpout("kafka-reader", kafkaSpout, 1);                                        //kafka读取数据,spout为kafka-reader
		builder.setBolt("word-splitter", new kafkaBolt(), 2).shuffleGrouping("kafka-reader");   //kafkabolt获取kafka-reader数据,bolt名字是word-splitter
		builder.setBolt("word-transfer", new transferBolt(), 2).shuffleGrouping("word-splitter"); //transferbolt获取word-splitter数据
		Config conf = new Config();
		Map<String, String> HBConfig = Maps.newHashMap();
		HBConfig.put("hbase.zookeeper.property.clientPort", "2181");
		HBConfig.put("hbase.zookeeper.quorum",
				"datanode01.isesol.com:2181,datanode02.isesol.com:2181,datanode03.isesol.com:2181,datanode04.isesol.com:2181");
		HBConfig.put("zookeeper.znode.parent", "/hbase");

		conf.put("HBCONFIG", HBConfig);
		SimpleHBaseMapper mapper = new SimpleHBaseMapper();
		mapper.withColumnFamily("cf");                       //设置hbase columnfamily
		mapper.withColumnFields(new Fields(fieldNameList));  //设置hbase的字段,这个值是从transferbolt,根据new Fields定义的名字获取的  
		mapper.withRowKeyField("rowkey");                    //设置rowkey,这个值是从transferbolt的"rowkey"获取的
		HBaseBolt hBaseBolt = new HBaseBolt("test3", mapper).withConfigKey("HBCONFIG");   //test3为hbase表
		hBaseBolt.withFlushIntervalSecs(10);                  //hbase定义的flush时间,10秒
		builder.setBolt("hbase", hBaseBolt, 3).shuffleGrouping("word-transfer");  //hbasebolt 从word-transfer 获取数据进行存储,存储方式按照上面定义的
                 column, column family, rowkey来存储
		String name = getKafka.class.getSimpleName();

		if (args != null && args.length > 0) {

			conf.setNumWorkers(2);
			// conf.setMessageTimeoutSecs(900);
			LocalCluster localCluster = new LocalCluster();
			localCluster.submitTopology(name, conf, builder.createTopology());
			// StormSubmitter.submitTopology(name, conf,
			// builder.createTopology());
			Utils.sleep(9999999);

		}

	}
}

class transferBolt extends BaseRichBolt {

	private Map conf;
	private TopologyContext context;
	private OutputCollector collector;

	public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
		// TODO Auto-generated method stub

		this.conf = stormConf;
		this.context = context;
		this.collector = collector;

	}

	public void execute(Tuple input) {
		// TODO Auto-generated method stub

		try {
			String line = input.getString(0);
			collector.emit(input,
					new Values(UUID.randomUUID().toString() + "-test1", UUID.randomUUID().toString(), line));
			collector.ack(input);
		} catch (Exception ex) {
			collector.fail(input);
		}

	}

	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		// TODO Auto-generated method stub
		declarer.declare(new Fields("rowkey", "linetest", "line"));
	}

}

class kafkaBolt extends BaseRichBolt {

	private Map conf;
	private TopologyContext context;
	private OutputCollector collector;

	public void execute(Tuple input) {
		// TODO Auto-generated method stub
		try{
			String line = input.getString(0);
			collector.emit(input, new Values(line));
			collector.ack(input);
		} catch (Exception ex){
			collector.fail(input);
		}


	}

	public void prepare(Map arg0, TopologyContext arg1, OutputCollector arg2) {
		// TODO Auto-generated method stub
		this.conf = arg0;
		this.context = arg1;
		this.collector = arg2;
	}

	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		// TODO Auto-generated method stub

		declarer.declare(new Fields("line"));

	}

}



对比上面这幅图,和我们写的程序,解释一下上图几个概念:

1. emit表示发送了多少数据

2. process latency表示处理一个数据平均花费多少时间

3. complete latency 表示一条消息从发送到处理完成,并接收到ack完成的总时间

4. ack表示确定多少条数据完全处理完成

很显然通过对比emit以及ack以及complete latency就可以知道整个处理量




通过上图的一目了然就知道处理了多少数据,以及效率, storm一定看清楚,你发送了多少数据,处理了多少,这样你可以知道spout以及 bolt速度,如果发送了1W,你才处理了100条,很显然有很大性能问题,因为数据在堆积,这样迟早会OOM, 所以关注发送和处理效率很重要。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

tom_fans

谢谢打赏

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值