Apache Storm
Storm是什么?
Storm是免费开源的分布式实时计算系统,改系统在2.0.0
之前改架构核心实现使用Clojure编程实现,在本次版本以后Storm底层实现做了重大的调整使用Java8重构了Storm。Storm是一个实时的流处理引擎,能实现对记录的亚秒级的延迟处理。Storm在 realtime analytics、online machine learning、continuous computation、distributed RPC、 ETL等领域都有应用。每秒中一个计算节点可以处理100万个Tuple记录。除此之外Storm还可以和现有的数据(RDBMS/NoSQL)以及 消息队列集成(Kafka)。
流计算
:将大规模流动数据在不断变化的运动过程中实现数据的实时分析,捕捉到可能有用的信息,并把结果发送到下一计算节点。
主流流计算框架:Kafka Streaming、Apache Storm、Spark Streaming、Flink DataStream等。
- Kafka Streaming:是一套基于Kafka-Streaming库的一套流计算工具jar包,具有简单容易集成等特点。
- Apache Storm/Jstorm:流处理框架实现对流数据流的处理和状态管理等操作。
- Spark Streaming:构建在Spark批处理之上的流处理框架,微观批处理,因此诟病 延迟较高。
- Flink DataStream/Blink:属于第三代流计算框架,吸取了Spark和Storm设计经验,在实时性和应用性上以及性能都有很大的提升,是目前为止最强的流计算引擎。
架构概述
Apache Storm提供了一种基于Topology流计算概念,该概念等价于hadoop的mapreduce计算,但是不同于MapReduce计算因为MR计算会最终终止,但是Topology计算会一直运行下去,除非用户执行storm kill指令该计算才会终止.Storm提供了高可靠/可扩展/高度容错的流计算服务 ,该服务可以保证数据|Tuple可靠性处理(至少一次|精确1次)处理机制.可以方便的和现用户的服务进行集成,例如:HDFS/Kafka/Hbase/Redis/Memcached/Yarn等服务集成.Storm的单个阶段每秒钟可以处理100万条数据|Tuple
nimbus
:计算任务的主节点,负责分发代码/分配任务/故障检测 Supervisor任务执行.
supervisor
:接受来自Nimbus的任务分配,启动Worker进程执行计算任务.
zookeeper
:负责Nimbus和Supervisor协调,Storm会使用zookeeper存储nimbus和supervisor进程状态信息,这就导致了Nimbus和Supervisor是无状态的可以实现任务快速故障恢复,即而让流计算达到难以置信的稳定。
Worker
:是Supervisor专门为某一个Topology任务启动的一个Java 进程,Worker进程通过执行Executors(线程)完成任务的执行,每个任务会被封装成一个个Task。
集群构建
- 同步时钟
[root@CentOSX ~]# yum install -y ntp
[root@CentOSX ~]# service ntpd start
[root@CentOSX ~]# ntpdate cn.pool.ntp.org
17 Jun 16:06:14 ntpdate[22184]: step time server 120.25.115.20 offset 12.488129 sec
- 安装zookeeper集群
[root@CentOSX ~]# tar -zxf zookeeper-3.4.6.tar.gz -C /usr/
[root@CentOSX ~]# mkdir zkdata
[root@CentOSX ~]# cp /usr/zookeeper-3.4.6/conf/zoo_sample.cfg /usr/zookeeper-3.4.6/conf/zoo.cfg
[root@CentOSX ~]# vi /usr/zookeeper-3.4.6/conf/zoo.cfg
tickTime=2000
dataDir=/root/zkdata
clientPort=2181
[root@CentOSX ~]# /usr/zookeeper-3.4.6/bin/zkServer.sh start zoo.cfg
JMX enabled by default
Using config: /usr/zookeeper-3.4.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@CentOSX ~]# /usr/zookeeper-3.4.6/bin/zkServer.sh status zoo.cfg
JMX enabled by default
Using config: /usr/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: standalone
- 安装JDK8+
[root@CentOSX ~]# rpm -ivh jdk-8u171-linux-x64.rpm
Preparing... ########################################### [100%]
1:jdk1.8 ########################################### [100%]
Unpacking JAR files...
tools.jar...
plugin.jar...
javaws.jar...
deploy.jar...
rt.jar...
jsse.jar...
charsets.jar...
localedata.jar...
[root@CentOSX ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
[root@CentOSX ~]# source .bashrc
- 配置主机名和IP的映射关系
[root@CentOSX ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.111.128 CentOSA
192.168.111.129 CentOSB
192.168.111.130 CentOSC
- 关闭防火墙
[root@CentOSX ~]# vi /etc/hosts
[root@CentOSX ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
[root@CentOSX ~]# chkconfig iptables off
- 安装配置Storm
[root@CentOSX ~]# tar -zxf apache-storm-2.0.0.tar.gz -C /usr/
[root@CentOSX ~]# vi .bashrc
STORM_HOME=/usr/apache-storm-2.0.0
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$STORM_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export STORM_HOME
[root@CentOSX ~]# source .bashrc
[root@CentOSX ~]# storm version
如果是Storm-2.0.0需要二外安装
yum install -y python-argparse
否则 storm指令无法正常使用
Traceback (most recent call last):
File "/usr/apache-storm-2.0.0/bin/storm.py", line 20, in <module >
import argparse
ImportError: No module named argparse
- 修改storm.yaml 配置文件
[root@CentOSX ~]# vi /usr/apache-storm-2.0.0/conf/storm.yaml
########### These MUST be filled in for a storm configuration
storm.zookeeper.servers:
- "CentOSA"
- "CentOSB"
- "CentOSC"
storm.local.dir: "/usr/storm-stage"
nimbus.seeds: ["CentOSA","CentOSB","CentOSC"]
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
注意 ymal配置格式前面
空格
- 启动Storm进程
[root@CentOSX ~]# nohup storm nimbus >/dev/null 2>&1 & -- 启动 主节点
[root@CentOSX ~]# nohup storm supervisor >/dev/null 2>&1 & --启动 计算节点
[root@CentOSA ~]# nohup storm ui >/dev/null 2>&1 & --启动web ui界面
- 启动成功后访问访问:http://CentOSA:8080
Topology概念
Topology:Storm topology编织数据流计算的流程。Storm拓扑类似于MapReduce作业。一个关键的区别是MapReduce作业最终完成,而拓扑结构永远运行(当然,直到你杀死它)。
Streams:流是无限的Tuple(等价与Kafka Streaming的Record)序列,以分布式方式并行处理和创建。Streams是使用Schema定义的,该Schema命名流的Tuple中的字段。
Tuple:是Storm中一则记录,该记录存储是一个数组元素,Tuple元素都是只读的,不允许修改.
Tuple t=new Tuple(new Object[]{1,"zs",true})// readOnly
Spouts:负责产生Tuple,是Streams源头.通常是通过Spout读取外围系统的数据,并且将数据封装成Tuple,并且将封装Tuple发射|emit到Topology中.IRichSpout
|BaseRichSpout
Bolts:所有的Topology中的Tuple是通过Bolt处理,Bolt作用是用于过滤/聚合/函数处理/join/存储数据到DB中等.
IRichBolt|BaseRichBolt
At Most Once机制,IBasicBolt|BaseBasicBolt
At Least Once ,IStatefulBolt | BaseStatefulBolt
有状态计算。
快速入门案例
- pom.xml
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-client</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
- 编写 Spout
public class WordCountSpout extends BaseRichSpout {
private String[] lines={"this is a demo","hello Storm","ni hao"};
//该类负责将数据发送给下游
private SpoutOutputCollector collector;
public void open(Map<String, Object> conf, TopologyContext context, SpoutOutputCollector collector) {
this.collector=collector;
}
//向下游发送Tuple ,改Tuple的Schemal在declareOutputFields声明
public void nextTuple() {
Utils.sleep(1000);//休息1s钟
String line=lines[new Random().nextInt(lines.length)];
collector.emit(new Values(line));
}
//对emit中的tuple做字段的描述
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("line"));
}
}
- 编写 Bolt
LineSplitBolt
public class LineSplitBolt extends BaseRichBolt {
//该类负责将数据发送给下游
private OutputCollector collector;
public void prepare(Map<String, Object> topoConf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
}
public void execute(Tuple input) {
String line = input.getStringByField("line");
String[] tokens = line.split("\\W+");
for (String token : tokens) {
collector.emit(new Values(token,1));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word","count"));
}
}
WordCountBolt
public class WordCountBolt extends BaseRichBolt {
//存储状态
private Map<String,Integer> keyValueState;
//该类负责将数据发送给下游
private OutputCollector collector;
public void prepare(Map<String, Object> topoConf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
keyValueState=new HashMap<String, Integer>();
}
public void execute(Tuple input) {
String key = input.getStringByField("word");
int count=0;
if(keyValueState.containsKey(key)){
count=keyValueState.get(key);
}
//更新状态
int currentCount=count+1;
keyValueState.put(key,currentCount);
//将最后结果输出给下游
collector.emit(new Values(key,currentCount));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("key","result"));
}
}
WordPrintBolt
public class WordPrintBolt extends BaseRichBolt {
public void prepare(Map<String, Object> topoConf, TopologyContext context, OutputCollector collector) {
}
public void execute(Tuple input) {
String word=input.getStringByField("key");
Integer result=input.getIntegerByField("result");
System.out.println(input+"\t"+word+" , "+result);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
- 编写Topology
import org.apache.storm.Config;
import org.apache.storm.StormSubmitter;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.tuple.Fields;
public class WordCountTopology {
public static void main(String[] args) throws Exception {
//1.创建TopologyBuilder
TopologyBuilder builder = new TopologyBuilder();
//2.编织流处理逻辑- 重点(Spout、Bolt、连接方式)
builder.setSpout("WordCountSpout",new WordCountSpout(),1);
builder.setBolt("LineSplitBolt",new LineSplitBolt(),3)
.shuffleGrouping("WordCountSpout");//设置 LineSplitBolt 接收上游数据通过 随机
builder.setBolt("WordCountBolt",new WordCountBolt(),3)
.fieldsGrouping("LineSplitBolt",new Fields("word"));
builder.setBolt("WordPrintBolt",new WordPrintBolt(),4)
.fieldsGrouping("WordCountBolt",new Fields("key"));
//3.提交流计算
Config conf= new Config();
conf.setNumWorkers(3); //设置Topology运行所需的Worker资源,JVM个数
conf.setNumAckers(0); //关闭Storm应答,可靠性有关
StormSubmitter.submitTopology("worldcount",conf,builder.createTopology());
}
}
shuffleGrouping
:表示下游的LineSplitBolt会随机的接收上游的Spout发出的数据流。
fieldsGrouping
: 表示相同的Fields数据总会发送给同一个Task Bolt节点。
- 任务提交
使用mvn package打包应用,然后将打包的jar包上传到 集群中的任意一台机器
[root@CentOSA ~]# storm jar /root/storm-lowlevel-1.0-SNAPSHOT.jar com.baizhi.demo01.WordCountTopology
....
16:27:28.207 [main] INFO o.a.s.StormSubmitter - Finished submitting topology: worldcount
提交成功后,用户可以查看Storm UI界面查看程序的执行效果http://centosa:8080/
- 查看任务列表
[root@CentOSA ~]# storm list
...
Topology_name Status Num_tasks Num_workers Uptime_secs Topology_Id Owner
----------------------------------------------------------------------------------------
worldcount ACTIVE 11 3 66 worldcount-2-1560760048 root
- 杀死Topology
[root@CentOSX ~]# storm kill worldcount
任务的并行度理解
- 并行度和线程是一一对应的
- conf.setNumWorkers(3);
决定了当前的Topology计算所需的Work进程,每一个Worker只能属于某一个Topology。每一个Worker就代表一个计算资源称为Slot
一个Supervisor最多启动/管理4个Woker/Slot 进程。Woker不能跨Topology共享,每一个流计算任务在启动前就已经分配好了JVM计算进程。Storm是通过Worker/Slot实现计算资源的隔离。
- Task和Executor关系
Task实际是Spout和Blot的实例。默认情况下一个线程只运行一个Task(一个线程中只有一个Spout或者Bolt)。例如一下代码
builder.setBolt("LineSplitBolt",new LineSplitBolt(),3)
.setNumTasks(5)
.shuffleGrouping("WordCountSpout");//设置 LineSplitBolt 接收上游数据通过 随机
LineSplitBolt
组件会占用3线程,系统会实例化5个LineSplitBolt
的实例。其中2、2、1分配方式分配三个线程。
Topology并行度:Worker(进程)、Executors(线程)和Task(实例)相关
思考
:是不是Worker越多效率越高?
- storm rebalance
[root@CentOSA ~]# storm rebalance
usage: storm rebalance [-h] [-w WAIT_TIME_SECS] [-n NUM_WORKERS]
[-e EXECUTORS] [-r RESOURCES] [-t TOPOLOGY_CONF]
[--config CONFIG]
[-storm_config_opts STORM_CONFIG_OPTS]
topology-name
修改Worker数目
[root@CentOSX ~]# storm rebalance -w 10 -n 6 wordcount02
修改某个组件的并行度,一般不能超过Task个数
[root@CentOSX ~]# storm rebalance -w 10 -n 3 -e LineSplitBolt=5 wordcount02
Tuple可靠性处理
Storm 消息Tuple可以通过一个叫做__ackerBolt
去监测整个Tuple Tree是否能够被完整消费,如果消费超时或者失败该__ackerBolt
会调用Spout组件(发送改Tuple的Spout组件)的fail方法,要求Spout重新发送Tuple.默认__ackerBolt
并行度是和Worker数目一致,用户可以通过config.setNumAckers(0);关闭Storm的Acker机制。
如何使用可靠性发送
- Spout端
- Spout在发射 tuple 的时候必须提供msgID
- 同时覆盖ack和fail方法
public class WordCountSpout extends BaseRichSpout {
private String[] lines={"this is a demo","hello Storm","ni hao"};
private SpoutOutputCollector collector;
public void open(Map<String, Object> conf, TopologyContext context, SpoutOutputCollector collector) {
this.collector=collector;
}
public void nextTuple() {
Utils.sleep(5000);//休息1s钟
int msgId = new Random().nextInt(lines.length);
String line=lines[msgId];
//发送 Tuple 指定 msgId
collector.emit(new Values(line),msgId);
}
//对emit中的tuple做字段的描述
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("line"));
}
//发送成功回调 AckerBolt
@Override
public void ack(Object msgId) {
System.out.println("发送成功:"+msgId);
}
//发送失败回调 AckerBolt
@Override
public void fail(Object msgId) {
String line = lines[(Integer) msgId];
System.out.println("发送失败:"+msgId+"\t"+line);
}
}
- Bolt端
- 将当前的子Tuple 锚定到父Tuple上
- 向上游应答当前父Tuple的状态,应答有两种方式 collector.ack(input);|collector.fail(input);
public void execute(Tuple 父Tuple) {
try {
//do sth
//锚定当前父Tuple
collector.emit(父Tuple,子Tuple);
//向上游应答当前父Tuple的状态
collector.ack(父Tuple);
} catch (Exception e) {
collector.fail(父Tuple);
}
}
可靠性机制检测原理
更多请参考:http://storm.apache.org/releases/2.0.0/Guaranteeing-message-processing.html
BasicBolt|BaseBasicBolt
许多Bolt遵循读取输入元组的共同模式(锚定、ack出错fail),基于它发出元组,然后在执行方法结束时执行元组。因此Storm给我们提供了一套规范,如果用户使用Ack机制,在编写Bolt的时候只需要实现BasicBolt接口或者继承BaseBasicBolt类即可。
public class WordCountBolt extends BaseBasicBolt {
//存储状态
private Map<String,Integer> keyValueState;
@Override
public void prepare(Map<String, Object> topoConf, TopologyContext context) {
keyValueState=new HashMap<String, Integer>();
}
public void execute(Tuple input, BasicOutputCollector collector) {
String key = input.getStringByField("word");
int count=0;
if(keyValueState.containsKey(key)){
count=keyValueState.get(key);
}
//更新状态
int currentCount=count+1;
keyValueState.put(key,currentCount);
//将最后结果输出给下游
collector.emit(new Values(key,currentCount));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("key","result"));
}
}
关闭Acker机制
- 设置NumAcker数目为0
- 在Spout发送的是不提供MsgID
- 在Bolt 不使用锚定
优点:可以提示Storm处理性能,减少延迟。
Storm 状态管理
Storm提供了一种机制使得Bolt可以存储和查询自己的操作的状态,目前Storm提供了一个默认的实现,该实现基于内存In-Memory实现,除此之外还提供了基于Redis/Memcached和Hbase等的实现.Storm提供了IStatefulBolt|BaseStatefulBolt 用于实现Bolt的状态管理.
public class WordCountBolt extends BaseStatefulBolt<KeyValueState<String,Integer>> {
private KeyValueState<String,Integer> state;
private OutputCollector collector;
public void initState(KeyValueState<String,Integer> state) {
this.state=state;
}
@Override
public void prepare(Map<String, Object> topoConf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
}
public void execute(Tuple input) {
String key = input.getStringByField("word");
Integer count=input.getIntegerByField("count");
Integer historyCount = state.get(key, 0);
Integer currentCount=historyCount+count;
//更新状态
state.put(key,currentCount);
//必须锚定当前的input
collector.emit(input,new Values(key,currentCount));
collector.ack(input);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("key","result"));
}
}
含有State的bolt的Topology必须开启Ack机制。
集成Redis实现状态持久化
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-redis</artifactId>
<version>2.0.0</version>
</dependency>
- 安装Redis
[root@CentOSA ~]# yum install -y gcc-c++
[root@CentOSA ~]# tar -zxf redis-3.2.9.tar.gz
[root@CentOSA ~]# cd redis-3.2.9
[root@CentOSA redis-3.2.9]# vi redis.conf
bind CentOSA
protected-mode no
daemonize yes
[root@CentOSA redis-3.2.9]# ./src/redis-server redis.conf
[root@CentOSA redis-3.2.9]# ps -aux | grep redis-server
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root 41601 0.1 0.5 135648 5676 ? Ssl 14:45 0:00 ./src/redis-server CentOSA:6379
root 41609 0.0 0.0 103260 888 pts/1 S+ 14:45 0:00 grep redis-server
-
官方文档给出的配置模板Map<String,String|Map<String,Object>>
http://storm.apache.org/releases/2.0.0/State-checkpointing.html
单机模式
{
"keyClass": "Optional fully qualified class name of the Key type.",
"valueClass": "Optional fully qualified class name of the Value type.",
"keySerializerClass": "Optional Key serializer implementation class.",
"valueSerializerClass": "Optional Value Serializer implementation class.",
"jedisPoolConfig": {
"host": "localhost",
"port": 6379,
"timeout": 2000,
"database": 0,
"password": "xyz"
}
}
集群模式
{
"keyClass": "Optional fully qualified class name of the Key type.",
"valueClass": "Optional fully qualified class name of the Value type.",
"keySerializerClass": "Optional Key serializer implementation class.",
"valueSerializerClass": "Optional Value Serializer implementation class.",
"jedisClusterConfig": {
"nodes": ["localhost:7379", "localhost:7380", "localhost:7381"],
"timeout": 2000,
"maxRedirections": 5
}
}
-
配置topology
public static void main(String[] args) throws Exception { //1.创建TopologyBuilder TopologyBuilder builder = new TopologyBuilder(); //2.编织流处理逻辑- 重点(Spout、Bolt、连接方式) builder.setSpout("WordCountSpout",new WordCountSpout(),1); builder.setBolt("LineSplitBolt",new LineSplitBolt(),3) //设置 LineSplitBolt 接收上游数据通过 随机 .shuffleGrouping("WordCountSpout"); builder.setBolt("WordCountBolt",new WordCountBolt(),3) .fieldsGrouping("LineSplitBolt",new Fields("word")); builder.setBolt("WordPrintBolt",new WordPrintBolt(),4) .fieldsGrouping("WordCountBolt",new Fields("key")); //3.提交流计算 Config conf= new Config(); //配置Redis conf.put(Config.TOPOLOGY_STATE_PROVIDER,"org.apache.storm.redis.state.RedisKeyValueStateProvider"); //构建redis连接参数 Map<String,Object> stateConfig=new HashMap<String,Object>(); Map<String,Object> redisConfig=new HashMap<String,Object>(); //ip redisConfig.put("host","CentOSF"); redisConfig.put("port",6379); stateConfig.put("jedisPoolConfig",redisConfig); //Config.TOPOLOGY_STATE_PROVIDER_CONFIG参数的值要求设定形式为 // java.lang.String. Object: {"jedisClusterConfig":{"port":"6379","host":"CentOSD"}} //因此使用storm-redis自带的jackson包中com.fasterxml.jackson.databind.ObjectMapper类 //使用mapper.writeValueAsString(stateConfig)方法转换为json串 ObjectMapper objectMapper=new ObjectMapper(); System.out.println(objectMapper.writeValueAsString(stateConfig)); conf.put(Config.TOPOLOGY_STATE_PROVIDER_CONFIG,objectMapper.writeValueAsString(stateConfig)); //设置每次存储状态的间隔时间 conf.put(Config.TOPOLOGY_STATE_CHECKPOINT_INTERVAL,1000); //设置Topology运行所需的Worker资源,JVM个数 conf.setNumWorkers(3); conf.setNumAckers(3); conf.setMessageTimeoutSecs(10000); LocalCluster localCluster = new LocalCluster(); localCluster.submitTopology("wordcount",conf,builder.createTopology()); }
HBase 集成实现状态持久化
- 导入依赖
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-hbase</artifactId>
<version>2.0.0</version>
</dependency>
- 安装Hbase
- 安装Hadoop(略)
- 安装Hbase环境 (略)
- 配置topology
public class WordCountTopology {
public static void main(String[] args) throws Exception {
//1.创建TopologyBuilder
TopologyBuilder builder = new TopologyBuilder();
//2.编织流处理逻辑- 重点(Spout、Bolt、连接方式)
builder.setSpout("WordCountSpout",new WordCountSpout(),1);
builder.setBolt("LineSplitBolt",new LineSplitBolt(),3)
.shuffleGrouping("WordCountSpout");//设置 LineSplitBolt 接收上游数据通过 随机
builder.setBolt("WordCountBolt",new WordCountBolt(),3)
.fieldsGrouping("LineSplitBolt",new Fields("word"));
builder.setBolt("WordPrintBolt",new WordPrintBolt(),4)
.fieldsGrouping("WordCountBolt",new Fields("key"));
//3.提交流计算
Config conf= new Config();
//配置hbase
conf.put(Config.TOPOLOGY_STATE_PROVIDER,"org.apache.storm.hbase.state.HBaseKeyValueStateProvider");
Map<String,Object> hbaseConfig=new HashMap<String,Object>();
hbaseConfig.put("hbase.zookeeper.quorum", "CentOSD");//Hbase zookeeper连接参数
conf.put("hbase.conf", hbaseConfig);
ObjectMapper objectMapper=new ObjectMapper();
Map<String,Object> stateConfig=new HashMap<String,Object>();
stateConfig.put("hbaseConfigKey","hbase.conf");
stateConfig.put("tableName","baizhi:wordcountstate");
stateConfig.put("columnFamily","cf1");
//Config.TOPOLOGY_STATE_PROVIDER_CONFIG参数的值要求设定形式为
// java.lang.String. Object: {"jedisClusterConfig":{"port":"6379","host":"CentOSD"}}
//因此使用storm-redis自带的jackson包中com.fasterxml.jackson.databind.ObjectMapper类
//使用mapper.writeValueAsString(stateConfig)方法转换为json串
conf.put(Config.TOPOLOGY_STATE_PROVIDER_CONFIG,objectMapper.writeValueAsString(stateConfig));
conf.put(Config.TOPOLOGY_STATE_CHECKPOINT_INTERVAL,1000);
//设置Topology运行所需的Worker资源,JVM个数
conf.setNumWorkers(1);
//设置反馈超时时间
conf.setMessageTimeoutSecs(10000);
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("wordcount11",conf,builder.createTopology());
}
检查点机制
检查点由指定topology.state.checkpoint.interval.ms
的内部检查点spout触发。如果拓扑中至少有一个IStatefulBolt,则拓扑构建器会自动添加检查点spout。对于有状态拓扑,拓扑构建器将IStatefulBolt包装在StatefulBoltExecutor中,该处理器在接收检查点元组时处理状态提交。非状态Bolt包装在CheckpointTupleForwarder中,它只转发检查点Tuple,以便检查点元组可以流经拓扑DAG。检查点元组流经单独的内部流,即$ checkpoint。拓扑构建器在整个拓扑中连接检查点流,并在根处设置检查点spout。
注意在配置检查点时间的时候,要求检查点的时间不得大于
topology.message.timeout.secs
时间。
Distributed RPC
Storm的DRPC真正的实现了并行计算.Storm Topology接受用户的参数进行计算,然后最终将计算结果以Tuple形式返回给用户.
- 修改storm.yaml配置文件
vi /usr/apache-storm-2.0.0/conf/storm.yaml
storm.zookeeper.servers:
- "CentOSA"
- "CentOSB"
- "CentOSC"
storm.local.dir: "/usr/apache-storm-1.2.2/storm-stage"
nimbus.seeds: ["CentOSA","CentOSB","CentOSC"]
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
drpc.servers:
- "CentOSA"
- "CentOSB"
- "CentOSC"
storm.thrift.transport: "org.apache.storm.security.auth.plain.PlainSaslTransportPlugin"
注意格式!
- 重启Storm所有服务
[root@CentOSX ~]# nohup storm drpc >/dev/null 2>&1 &
[root@CentOSX ~]# nohup storm nimbus >/dev/null 2>&1 &
[root@CentOSX ~]# nohup storm supervisor >/dev/null 2>&1 &
[root@CentOSA ~]# nohup storm ui >/dev/null 2>&1 &
DRPC案例剖析
引入依赖
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-redis</artifactId>
<version>2.0.0</version>
</dependency>
WordCountRedisLookupMapper
import com.google.common.collect.Lists;
import org.apache.storm.redis.common.mapper.RedisDataTypeDescription;
import org.apache.storm.redis.common.mapper.RedisLookupMapper;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.ITuple;
import org.apache.storm.tuple.Values;
import java.util.List;
public class WordCountRedisLookupMapper implements RedisLookupMapper {
// iTuple 上游发送的iTuple,目的是为了获取id
public List<Values> toTuple(ITuple iTuple, Object value) {
Object id = iTuple.getValue(0);
List<Values> values = Lists.newArrayList();
if(value == null){
value = 0;
}
values.add(new Values(id, value));
return values;
}
//第一个位置的name必须为id,后续的无所谓了
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("id", "num"));
}
//告知数据类型
public RedisDataTypeDescription getDataTypeDescription() {
return new RedisDataTypeDescription(RedisDataTypeDescription.RedisDataType.HASH,"wordcount");
}
public String getKeyFromTuple(ITuple iTuple) {
return iTuple.getString(1);
}
//该方法无需实现,默认是给RedisStoreBolt使用
public String getValueFromTuple(ITuple iTuple) {
return null;
}
}
TopologyDRPCStreeamTest
import org.apache.storm.Config;
import org.apache.storm.StormSubmitter;
import org.apache.storm.drpc.LinearDRPCTopologyBuilder;
import org.apache.storm.redis.bolt.RedisLookupBolt;
import org.apache.storm.redis.common.config.JedisPoolConfig;
public class TopologyDRPCStreeamTest {
public static void main(String[] args) throws Exception {
LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("count");
Config conf = new Config();
conf.setDebug(false);
JedisPoolConfig jedisConfig = new JedisPoolConfig.Builder()
.setHost("CentOSA").setPort(6379).build();
RedisLookupBolt lookupBolt = new RedisLookupBolt(jedisConfig, new WordCountRedisLookupMapper());
builder.addBolt(lookupBolt);
StormSubmitter.submitTopology("drpc-demo", conf, builder.createRemoteTopology());
}
}
- 打包服务
- 提交topology
[root@CentOSA ~]# storm jar storm-lowlevel-1.0-SNAPSHOT.jar com.baizhi.demo07.TopologyDRPCStreeamTest --artifacts 'org.apache.storm:storm-redis:2.0.0'
–artifacts 指定程序运行所需的maven坐标依赖,strom脚本会自动连接网络下载,如果有多个依赖请使用
^
隔开。如果依赖实在私服上用户可以使用--artifactRepositories
[root@CentOSA ~]# storm jar storm-lowlevel-1.0-SNAPSHOT.jar com.baizhi.demo07.TopologyDRPCStreeamTest
--artifacts 'org.apache.storm:storm-redis:2.0.0'
--artifactRepositories 'local^http://192.168.111.1:8081/nexus/content/groups/public/'
Kafka Storm集成
- 依赖
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka-client</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.2.0</version>
</dependency>
- 构建Kafkaspout
public class KafkaTopologyDemo {
public static void main(String[] args) throws Exception {
//创建topologybuilder
TopologyBuilder builder = new TopologyBuilder();
//传入相关参数
String bootstrpserver="CentOSD:9092,CentOSE:9092,CentOSF:9092";
String topicName="topic01";
//配置Kafka相关参数
KafkaSpoutConfig<String, String> kafkaSpoutConfig = KafkaSpoutConfig.builder(bootstrpserver, topicName)
//提供反序列化的类
.setProp(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
.setProp(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
.setProp(ConsumerConfig.GROUP_ID_CONFIG, "g1")//提供组id
.setEmitNullTuples(false)//是否输出空的tuples
.setFirstPollOffsetStrategy(FirstPollOffsetStrategy.LATEST)//设置首次订阅不消费订阅前的数据(订阅之后才可以消费)
//数据至少消费一次才能提交偏移量
.setProcessingGuarantee(KafkaSpoutConfig.ProcessingGuarantee.AT_LEAST_ONCE)
.setMaxUncommittedOffsets(10)//一旦分区积压有10个未提交offset,Spout停止poll数据,解决Storm背压问题
.setRecordTranslator(new MyRecordTranslator<String, String>())
.build();
KafkaSpout<String, String> kafkaSpout = new KafkaSpout<String, String>(kafkaSpoutConfig);
builder.setSpout("KafkaSpout",kafkaSpout,3);//读入
builder.setBolt("KafkaPrintBolt",new KafkaPrintBolt(),1)//打印处理(该类需自定义)
.shuffleGrouping("KafkaSpout");
Config config = new Config();
config.setNumWorkers(3);
LocalCluster localCluster = new LocalCluster();//本地测试
localCluster.submitTopology("topic01",config,builder.createTopology());
}
}
MyRecordTranslator
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.storm.kafka.spout.DefaultRecordTranslator;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import java.util.List;
public class MyRecordTranslator<K, V> extends DefaultRecordTranslator<K, V> {
//读取kafka中的数据作为storm的输入
@Override
public List<Object> apply(ConsumerRecord<K, V> record) {
return new Values(new Object[]{record.topic(),record.partition(),record.offset(),record.key(),record.value(),record.timestamp()});
}
//对输入的数据进行描述
@Override
public Fields getFieldsFor(String stream) {
return new Fields("topic","partition","offset","key","value","timestamp");
}
}
Kafka Hbase Redis 整合
- 导入依赖
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>2.0.0</version>
<scope>provide</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-client</artifactId>
<version>2.0.0</version>
<scope>provide</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-redis</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-hbase</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka-client</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.2.0</version>
</dependency>
WodCountTopology
public class WodCountTopology {
public static void main(String[] args) throws Exception {
//创建TopologyBuilder
TopologyBuilder builder=new TopologyBuilder();
Config conf = new Config();
//Redis 状态管理
conf.put(Config.TOPOLOGY_STATE_PROVIDER,"org.apache.storm.redis.state.RedisKeyValueStateProvider");
Map<String,Object> stateConfig=new HashMap<String,Object>();
Map<String,Object> redisConfig=new HashMap<String,Object>();
//redis连接ip
redisConfig.put("host","CentOSF");
//redis连接端口号
redisConfig.put("port",6379);
stateConfig.put("jedisPoolConfig",redisConfig);
ObjectMapper objectMapper=new ObjectMapper();
//转换配置格式(转为jison)
conf.put(Config.TOPOLOGY_STATE_PROVIDER_CONFIG,objectMapper.writeValueAsString(stateConfig));
//配置Hbase连接参数
Map<String, Object> hbaseConfig = new HashMap<String, Object>();
//zookeeper连接ip
hbaseConfig.put("hbase.zookeeper.quorum", "CentOSD");
conf.put("hbase.conf", hbaseConfig);
//构建KafkaSpout //kafka连接参数
KafkaSpout<String, String> kafkaSpout = buildKafkaSpout("CentOSD:9092,CentOSE:9092,CentOSF:9092", "topic01");
builder.setSpout("KafkaSpout",kafkaSpout,3);
builder.setBolt("LineSplitBolt",new LineSplitBolt(),3)
//分配策略:(这里(KafkaSpout)没有key,随机均衡分配(轮询))
.shuffleGrouping("KafkaSpout");
builder.setBolt("WordCountBolt",new WordCountBolt(),3)
//分配策略:按属性进行分配(对word(即WordCountBolt中传过来的key)hash取模)
.fieldsGrouping("LineSplitBolt",new Fields("word"));
//hbase相关配置
SimpleHBaseMapper mapper = new SimpleHBaseMapper()
//rowkey
.withRowKeyField("key")
//限定符
.withColumnFields(new Fields("key"))
//要求改field的值必须是数值类型
.withCounterFields(new Fields("result"))
//列簇
.withColumnFamily("cf1");
//指定表
HBaseBolt haseBolt = new HBaseBolt("baizhi:wordcount", mapper)
//获取hbase连接参数
.withConfigKey("hbase.conf");
builder.setBolt("HBaseBolt",haseBolt,3)
//分配策略:按属性进行分配(对key(即WordCountBolt中传过来的key)hash取模)
.fieldsGrouping("WordCountBolt",new Fields("key"));
new LocalCluster().submitTopology("wordcount1",conf,builder.createTopology());
}
//获取KafkaSpout 从Kafka中读取数据
public static KafkaSpout<String, String> buildKafkaSpout(String boostrapServers, String topic){
KafkaSpoutConfig<String,String> kafkaspoutConfig=KafkaSpoutConfig.builder(boostrapServers,topic)
//反序列化规则
.setProp(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer")
.setProp(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer")
//指定组名
.setProp(ConsumerConfig.GROUP_ID_CONFIG,"g1")
//关闭对空Tuples的读取
.setEmitNullTuples(false)
//设置订阅后产生的record才能消费
.setFirstPollOffsetStrategy(FirstPollOffsetStrategy.LATEST)
//设置数据至少消费一次
.setProcessingGuarantee(KafkaSpoutConfig.ProcessingGuarantee.AT_LEAST_ONCE)
//一旦分区积压有10个未提交offset,Spout停止poll数据,解决Storm背压问题
.setMaxUncommittedOffsets(10)
.build();
return new KafkaSpout<String, String>(kafkaspoutConfig);
}
WordCountBolt
public class WordCountBolt extends BaseStatefulBolt<KeyValueState<String,Integer>> {
private KeyValueState<String,Integer> state;
private OutputCollector collector;
public void initState(KeyValueState<String,Integer> state) {
this.state=state;
}
@Override
public void prepare(Map<String, Object> topoConf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
}
public void execute(Tuple input) {
String key = input.getStringByField("word");
Integer count=input.getIntegerByField("count");
Integer historyCount = state.get(key, 0);
Integer currentCount=historyCount+count;
//更新状态
state.put(key,currentCount);
//必须锚定当前的input
collector.emit(input,new Values(key,currentCount));
collector.ack(input);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("key","result"));
}
}
LineSplitBolt
public class LineSplitBolt extends BaseBasicBolt {
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word","count"));
}
public void execute(Tuple input, BasicOutputCollector collector) {
String line = input.getStringByField("value");
String[] tokens = line.split("\\W+");
for (String token : tokens) {
//锚定当前Tuple
collector.emit(new Values(token,1));
}
}
}
- maven远程下载
[root@CentOSC ~]# storm jar storm-lowlevel-1.0-SNAPSHOT.jar com.baizhi.demo09.WodCountTopology --artifacts 'org.apache.storm:storm-redis:2.0.0,org.apache.storm:storm-hbase:2.0.0,org.apache.storm:storm-kafka-client:2.0.0,org.apache.kafka:kafka-clients:2.2.0' --artifactRepositories 'local^http://192.168.111.1:8081/nexus/content/groups/public/'
- 在项目中添加插件
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
Storm 窗口函数
Storm核心支持处理窗口内的一组元组。 Windows使用以下两个参数指定(类似Kafka Streaming):
- 窗口长度- the length or duration of the window
- 滑动间隔- the interval at which the windowing slides
Sliding Window(hopping time window)
Tuples以窗口进行分组,窗口每间隔一段滑动间隔滑动出一个新的窗口。例如下面就是一个基于时间滑动的窗口,窗口每间隔10秒钟为一个窗口,每间隔5秒钟滑动一次窗口,从下面的案例中可以看到,滑动窗口是存在一定的重叠,也就是说一个tuple可能属于1~n个窗口 。
........| e1 e2 | e3 e4 e5 e6 | e7 e8 e9 |...
-5 0 5 10 15 -> time
|<------- w1 -->|
|<---------- w2 ----->|
|<-------------- w3 ---->|
public class WodCountTopology {
public static void main(String[] args) throws Exception {
TopologyBuilder builder=new TopologyBuilder();
Config conf = new Config();
//构建KafkaSpout
KafkaSpout<String, String> kafkaSpout = KafkaSpoutUtils.buildKafkaSpout("CentOSA:9092,CentOSB:9092,CentOSC:9092", "topic01");
builder.setSpout("KafkaSpout",kafkaSpout,3);
builder.setBolt("LineSplitBolt",new LineSplitBolt(),3)
.shuffleGrouping("KafkaSpout");
ClickWindowCountBolt clickWindowCountBolt = new ClickWindowCountBolt();
clickWindowCountBolt.withWindow(BaseWindowedBolt.Duration.seconds(5),BaseWindowedBolt.Duration.seconds(2));
builder.setBolt("ClickWindowCountBolt",clickWindowCountBolt,3)
.fieldsGrouping("LineSplitBolt",new Fields("word"));
builder.setBolt("WordPrintBolt",new WordPrintBolt(),3)
.fieldsGrouping("ClickWindowCountBolt",new Fields("key"));
new LocalCluster().submitTopology("wordcount",conf,builder.createTopology());
}
}
Tumbling Window
Tuples以窗口分组,窗口滑动的长度恰好等于窗口长度,这就导致和Tumbling Window和Sliding Window最大的区别是Tumbling Window没有重叠,也就是说一个Tuple只属于固定某一个window。
| e1 e2 | e3 e4 e5 e6 | e7 e8 e9 |...
0 5 10 15 -> time
w1 w2 w3
public class WodCountTopology {
public static void main(String[] args) throws Exception {
TopologyBuilder builder=new TopologyBuilder();
Config conf = new Config();
//构建KafkaSpout
KafkaSpout<String, String> kafkaSpout = KafkaSpoutUtils.buildKafkaSpout("CentOSA:9092,CentOSB:9092,CentOSC:9092", "topic01");
builder.setSpout("KafkaSpout",kafkaSpout,3);
builder.setBolt("LineSplitBolt",new LineSplitBolt(),3)
.shuffleGrouping("KafkaSpout");
ClickWindowCountBolt clickWindowCountBolt = new ClickWindowCountBolt();
//设置滚动窗口
clickWindowCountBolt.withTumblingWindow(BaseWindowedBolt.Duration.seconds(5));
builder.setBolt("ClickWindowCountBolt",clickWindowCountBolt,3)
.fieldsGrouping("LineSplitBolt",new Fields("word"));
builder.setBolt("WordPrintBolt",new WordPrintBolt(),3)
.fieldsGrouping("ClickWindowCountBolt",new Fields("key"));
new LocalCluster().submitTopology("wordcount",conf,builder.createTopology());
}
}
ClickWindowCountBolt
public class ClickWindowCountBolt extends BaseWindowedBolt {
private OutputCollector collector;
@Override
public void prepare(Map<String, Object> topoConf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
}
public void execute(TupleWindow tupleWindow) {
Long startTimestamp = tupleWindow.getStartTimestamp();
Long endTimestamp = tupleWindow.getEndTimestamp();
SimpleDateFormat sdf=new SimpleDateFormat("HH:mm:ss");
System.out.println(sdf.format(startTimestamp)+"\t"+sdf.format(endTimestamp));
HashMap<String,Integer> hashMap=new HashMap<String, Integer>();
List<Tuple> tuples = tupleWindow.get();
for (Tuple tuple : tuples) {
String key = tuple.getStringByField("word");
Integer historyCount = 0;
if (hashMap.containsKey(key)) {
historyCount=hashMap.get(key);
}
int currentCount=historyCount+1;
hashMap.put(key,currentCount);
}
//将数据输出给PrintBolt
for (Map.Entry<String, Integer> entry : hashMap.entrySet()) {
collector.emit(tupleWindow.get(),new Values(entry.getKey(),entry.getValue()));
}
for (Tuple tuple : tupleWindow.get()) {
collector.ack(tuple);
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("key","result"));
}
}
关闭Kafka Client的日志输出
在Storm的本地测试的文件下创建log4j2.xml文件名
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
</Console>
</Appenders>
<Loggers>
<Root level="ERROR">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
Storm窗口的时间采样策略
- 默认情况下,Storm窗口计算时间是根据Tuple抵达Bolt时当前系统时间。只有当记录产生时间和计算时间差非常小的时候,该计算才有意义,通常把这种计算时间的策略称为
Prcessing Time
( //默认的 storm 把 window-bolt 处理这个 tuple 的当前时间作为这个tuple 的时间戳。另外可以通过代码指定tuple的某个字段作为这个tuple的timestamp(java的api是withTimestampField(String fieldName))。) - 通常在实际业务场景中,计算节点的时间往往比数据产生的时间较晚,这个时候基于窗口的就失去了原有的意义。Storm支持通过提取Tuple所携带的时间参数,进行窗口计算。通常把这种计算时间的策略称为
Event Time
.
延迟Tuple处理
- 水位线:watermaker,该值的取值是当前接收Tuple的最新的时间戳减去 延迟lag即可以得到水位线。水位线的作用是为了推进触发窗口的。( //Watermark 是用来评估是否结算窗口(window calculation),每当 window bolt 收到一个 Watermark,都会评估当前的 tuple 是否有需要结算的窗口,可以通withWatermarkInterval(Duration interval) 接口设置 watermark 的发送周期,其默认值是1s)
lag:设置水位线的延迟间隙
public class WodCountTopology {
public static void main(String[] args) throws Exception {
TopologyBuilder builder=new TopologyBuilder();
Config conf = new Config();
//构建KafkaSpout
KafkaSpout<String, String> kafkaSpout = KafkaSpoutUtils.buildKafkaSpout("CentOSA:9092,CentOSB:9092,CentOSC:9092", "topic02");
builder.setSpout("KafkaSpout",kafkaSpout,3);
builder.setBolt("ExtractTimeBolt",new ExtractTimeBolt(),3)
.shuffleGrouping("KafkaSpout");
builder.setBolt("ClickWindowCountBolt",new ClickWindowCountBolt()
.withWindow(BaseWindowedBolt.Duration.seconds(10),BaseWindowedBolt.Duration.seconds(5))
//指定tuple的某个字段作为这个tuple的timestamp
.withTimestampField("timestamp")
//间隙
.withLag(BaseWindowedBolt.Duration.seconds(2))
//设置 watermark 的发送周期
.withWatermarkInterval(BaseWindowedBolt.Duration.seconds(1))
//将迟到的元素读入latestrem流
.withLateTupleStream("latestream")
,1)
.fieldsGrouping("ExtractTimeBolt",new Fields("word"));
//专门用于处理迟到元素的bolt
builder.setBolt("lateBolt",new LateBolt(),3)
.shuffleGrouping("ClickWindowCountBolt",
"latestream");
new LocalCluster().submitTopology("wordcount",conf,builder.createTopology());
}
}
ClickWindowCountBolt
public class ClickWindowCountBolt extends BaseWindowedBolt {
private OutputCollector collector;
@Override
public void prepare(Map<String, Object> topoConf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
}
public void execute(TupleWindow tupleWindow) {
Long startTimestamp = tupleWindow.getStartTimestamp();
Long endTimestamp = tupleWindow.getEndTimestamp();
SimpleDateFormat sdf=new SimpleDateFormat("HH:mm:ss");
System.out.println(sdf.format(startTimestamp)+"\t"+sdf.format(endTimestamp)+" \t"+this);
for (Tuple tuple : tupleWindow.get()) {
collector.ack(tuple);
String key = tuple.getStringByField("word");
System.out.println("\t"+key);
}
}
}
ExtractTimeBolt
public class ExtractTimeBolt extends BaseBasicBolt {
public void execute(Tuple input, BasicOutputCollector collector) {
String line = input.getStringByField("value");
String[] tokens = line.split("\\W+");
SimpleDateFormat sdf=new SimpleDateFormat("HH:mm:ss");
Long ts= Long.parseLong(tokens[1]);
System.out.println("收到:"+tokens[0]+"\t"+sdf.format(ts) );
collector.emit(new Values(tokens[0],ts));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word","timestamp"));
}
}
LateBolt(处理迟到的元素)
public class LateBolt extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
System.out.println("迟到的元素:"+tuple);
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
}
}