1.storm事务的基本概念和原理
1.1 事务的批处理
对于容错机制,storm采用一个系统级别的组件acker,结合xor校验机制来判断一个tuple是否被发送成功,一旦确认没有发送成功,spout可以重发该tuple以保证每个tuple至少被发送成功一次,但是当我们的需求是精确统计的时候,自然希望数据是发送并且仅仅被发送了一次的时候,上述机制就难以满足了
storm 0.7.0 引入了Trancaction Topology来实现发送且仅发送一次tuple,这样我们在进行精确计算的时候就不用在担心重复计算等问题的出现
逐个处理单个tuple,会造成很大的开销,并且单个tuple进行事务性处理的速度会比较慢,因此storm提出了批处理的概念,即一次性处理一批tuple,要么全部处理成功,一旦有失败就全部记录失败,Storm会对处理失败的tuple进行重新的发送,保证每批tuple处理且仅仅被处理一次
1.2 事务机制原理
对于只处理一次的需要,从原理上讲,需要在处理tuple的时候带上一个事务id(txid),在需要进行事务处理的时候,根据txid以前是否处理成功来进行处理,而后经处理的结果和txid进行保存,并且需要保证提交的顺序性,即在当前事务tuple提交前,比当前txid小的tuple都已经被提交
在事务的批处理中,一批tuple被赋予一个txid,为了提高batch之间的并行度,storm采用了popiline管道处理模型,以保证事务的并行处理,但是commit是严格按照顺序进行的
Storm在处理事务的过程分成两个阶段:Process和Commit,Process阶段可以实现多个batch的并行计算,Commit阶段则必须保证按照先后顺序提交
Processing阶段:多个bacth并行计算,上图中bolt2是一个普通的batchbolt(实现了IBatchBolt),多个batch可以在bolt2的task之间并行执行
Commiting阶段:batch之间强制按照顺序提交,bolt3实现了IBatchBolt并且标记需要事务处理(实现了ICommiter接口或者通过TransactionTopologyBuilder的setCommitterBolt方法将BatchBolt添加到topology中),那么Storm认为可以提交batch的时候调用finishbatch,在finishBatch做txid的比较以及状态保存工作,bacth2必须在batch1后提交
1.3 事务Topologies
使用Transaction Topologies的时候,storm会为用户做下面的事情:
管理状态:
Storm将所有实现Transaction Topologiessu所必须的状态保存在zookeeper中,包括当前的transaction id以及定义每个batch的一些元数据
协调事务:Strom管理所有的事情,例如决定任何一个时间点是应该在processing还是在committing
错误检查:Storm利用acking框架高效的检测什么时候一个bacth被成功的处理了,被成功的提交了,或者失败了,Storm然后会相应的replay对应的bacth,你不需要手动的做任何的acking或者anchoring(emit是发生的动作)
内置的批处理API:Storm在普通的bolt上包装了一层api来提供对tuple的批处理支持。
Storm管理所有的协调工作,包括决定什么时候一个bolt接收到一个特定transaction的所有tuple,Storm会同时自动的处理每个Transaction产生的中间数据
事务性的spout实现了ITransactionSpout,这个接口包涵两个内部接口类:Coorfinator和Emitter。在topology运行的时候,事务性的spout内部包涵一个自Topology,结构如图:
这里有两种类型的tuple,一种是事务类型的tuple,一种是batch中的tuple:
Coodinator开启一个事务准备发射一个batch时候,进入一个事务的processing阶段,会发射一个事务性的tuple ( TransactionAttempt & metedata)到”batch emit“流
Emitter以广播的形式订阅coordinator的”batch emit“流,负责每个bacth实际发射tuple。发送的tuple必须以TransactionAttempt最为第一个field,Storm根据这个field来判断tuple是属于哪一个batch
coordinator只有一个,emitter根据并行度可以有多个实例
1.4 TransactionAttempt & 元数据
TransactionAttempt包涵两个值:transaction id,一个attempt id:
transaction id:每个batch中的tuple的唯一标识,而且不管这个batch replay了多少次该数值都不变
attempt id:也是每个batch的唯一标识,但是对于同一个batch,他replay之后的attemp id和replay之前的是不一样的
元数据中包涵的是当前事务可以从哪个point进行重新放数据,存放在zookeeper中的,spout可以通过Kryo从zookeepoer中序列化和反序列化该元数据
1.5 整体执行流程
1.6 事务性Bolt
BaseTransactionBolt处理batch在一起的tuples,对于每一个tuple调用execute方法,而在整个bacth处理(processing)完成的时候调用finishBatch方法,如果BatchBolt被标记成Commiter,则只能在commit阶段调用finishBolt方法,一个Batch的commit阶段由Storm保证只有在前一个batch提交结束之后才会执行,那么是如何确定batch的processing完成了,也就是bolt是否接收处理了batch中的所有tuple?在bolt内部,有个CoorfinatedBolt模型,具体原理如下:
1.7 CoodrnateBolt
每个CoodrnateBolt记录两个值:哪些个task发送了tuple;我要给哪些个task发送消息
等所有的tuple都发送完成了之后,CoordinateBolt通知另外一个特殊的stream以emitDirect的方式告诉所有它发送过tuple的task,他发送了多少个tuple给这个task,下游task会将这个数字和已经接收到的tuple数量做对比,如果相等,就说明处理完所有的tuple了
2.Storm事务API介绍
2.1 Spout
普通spout
ItranscationalSpout<T>: 普通事务Spout(内部有两个接口类)
--ITransactionalSpout.Coordinator<X>
--initializeTransaction(BigInteger txid,X prevMetadata):
创建一个新的metadata,当isReady()为true的时候,发射该metadata(事务tuple)到”batch emit“流
--isReady():
为true的时候启动新事务,需要在此时sleep
--ITransactionalSpout.Emitter<X>
-- emitBatch(TransactionAttempt tx,X coordinatorMeta,BatchOutputCollector collector):
逐个发射batch的tuple
分区类型的spout
IPartitionedTransactionalSpout<T>:分区事务Spout,主流事务Spout,原因是目前主流Message Queue都支持分区,分区的作用是增加MQ的吞吐量(每个分区作为一个数据源发送点),主流MQ如Kafka、RocketMQ
-- IParttitionedTransactionalSpout.Coordinator
-- isReady():同上
-- numPartitions() :返回分区个数。当增加了数据源新分区,同时一个事务被replayed ,此时则不发射新分区的tuples,因为它知道该事务中有多少个分区。
-- IParttitionedTransactionalSpout.Emmitter<X> 控制事务
-- emitPartitionBatchNew(TransactionAttempt tx,BatchOutputCollection collector,int partition, X lastPartitioneta)
发射一个新的Batch,返回Metadata元数据
--emitPartitionBatch(TransactionAttempt tx,BatchOutputCollection collector, int partition, X partitionMeta) :
如果这批消息Bolt消费失败了,emitPartitionBatch负责重发这批消息
不透明分区事务spout
IOpaquePartitionedTransactionalSpout<T>:不透明分区事务Spout
--IOpaquePartitionedTransactionalSpout.Coordinator
-- isReady(): 同上
-- IOpaquePartitionedTransactionalSpout.Emitter<X>
-- emitartitionBatch(TransactionAttrmpt tx, BatchOutputCollector collector,int partition, X lastPartitionMeta)
-- numPartitions()
它不区分发新消息还是重发旧消息,全部用emitPartitionBatch搞定。虽然emitPartitionBatch返回的X应该是下一批次供自己使用的(emitPartitionBatch的第4个参数),但是只有一个批次成功以后X才会更新到ZooKeeper中,如果失败重发,emitPartitionBatch读取的X还是旧的。所以这时候自定义的X不需要记录当前批次的开始位置和下一批次的开始位置两个值,只需要记录下一批次开始位置一个值即可,例如:
public class BatchMeta {
public long nextOffset; //下一批次的偏移量
}
IPartitionedTransactionalSpout和IOpaquePartitionedTransactionalSpout对比
IPartitionedTransactionalSpout和IOpaquePartitionedTransactionalSpout都是把tuple封装成batch进行处理,同时可以保证每一个tuple都被完整地处理,都支持消息重发。为了支持事务性,它们为每一个批次(batch)提供一个唯一的事务ID(transaction id:txid),txid是顺序递增的,而且保证对批次的处理是强有序的,即必须完整处理完txid=1才能再接着处理txid=2。
二者的区别以及用法:
IPartitionedTransactionalSpout的每一个tuple都会绑定在固定的批次中。无论一个tuple重发多少次,它都在同一个批次里面,都有同样的事务ID;一个tuple不会出现在两个以上的批次里。一个批次无论重发多少次,它也只有一个唯一且相同的事务ID,不会改变。这也就是说,一个批次无论重发多少次,它所包含的内容都是完全一致的。
但是IPartitionedTransactionalSpout会有一个问题,虽然这种问题非常罕见:假设一批消息在被bolt消费过程中失败了,需要spout重发,此时如果正巧遇到消息发送中间件故障,例如某一个分区不可读,spout为了保证重发时每一批次包含的tuple一致,它只能等待消息中间件恢复,也就是卡在那里无法再继续发送给bolt消息了,直至消息中间件恢复。IOpaquePartitionedTransactionalSpout可以解决这个问题。
而IOpaquePartitionedTransactionalSpout为了解决这个问题,它不保证每次重发一个批次的消息所包含的tuple完全一致。也就是说某个tuple可能第一次在txid=2的批次中出现,后面有可能在txid=5的批次中出现。这种情况只出现在当某一批次消息消费失败需要重发且恰巧消息中间件故障时。这时,IOpaquePartitionedTransactionalSpout不是等待消息中间件故障恢复,而是先读取可读的partition。例如txid=2的批次在消费过程中失败了,需要重发,恰巧消息中间件的16个分区有1个分区(partition=3)因为故障不可读了。这时候IOpaquePartitionedTransactionalSpout会先读另外的15个分区,完成txid=2这个批次的发送,这时候同样的批次其实包含的tuple已经少了。假设在txid=5时消息中间件的故障恢复了,那之前在txid=2且在分区partition=3的tuple会重新发送,包含在txid=5的批次中。
在使用IOpaquePartitionedTransactionalSpout时,因为tuple与txid的对应关系有可能改变,因此与业务计算结果同时保存一个txid就无法保证事务性了。这时候解决方案会稍微复杂一些,除了保存业务计算结果以外,还要保存两个元素:前一批次的业务计算结果以及本批次的事务ID。
我们以一个更简单的计算全局count的例子作说明,假设目前的统计结果为:
{
value = 4,
prevValue = 1,
txid = 2
}
2.2 Bolt
普通批处理
IBatchBolt<T>:同BaseBatchBolt<T>,普通批处理
事务Bolt
BaseTransactionalBolt:事务Bolt
接口Icommitter:标识IBatchBolt 或BaseTransactionalBolt是否是一个committer CoordinatedBolt
2.3 普通事务案例(按天统计)
2.3.1 创建springboot项目并引入依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.hcycom</groupId>
<artifactId>storm-demo-simple2</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>storm-demo-simple2</name>
<description>Demo project for Spring Boot</description>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>1.5.6.RELEASE</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<java.version>1.8</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<java.version>1.8</java.version>
<springboot.version>1.5.9.RELEASE</springboot.version>
<mybatis-spring-boot>1.2.0</mybatis-spring-boot>
<mysql-connector>5.1.44</mysql-connector>
<slf4j.version>1.7.25</slf4j.version>
<logback.version>1.2.3</logback.version>
<kafka.version>1.0.0</kafka.version>
<storm.version>1.2.2</storm.version>
<fastjson.version>1.2.41</fastjson.version>
<druid>1.1.8</druid>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<!--storm相关jar -->
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>${storm.version}</version>
<!--排除相关依赖-->
<exclusions>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-web</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<artifactId>ring-cors</artifactId>
<groupId>ring-cors</groupId>
</exclusion>
</exclusions>
<!--开发阶段这个需要被注掉-->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>${logback.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>${logback.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
2.3.2 编写spout类
package com.hcycom.stormdemosimple2.transaction1;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.transactional.ITransactionalSpout;
import org.apache.storm.tuple.Fields;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
/**
* @Auther: CYL
* @Date: 2018/7/23 09:07
* @Description:
*/
public class MyTxSpout implements ITransactionalSpout<MyMata> {
//实现序列化
private static final long serialVersionUID = 1L;
/*
数据源
*/
Map<Long,String> dbMap = null;
/**
*
* 功能描述: 初始化数据 连续产生100条数据
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 14:08
*/
public MyTxSpout(){
Random random = new Random();
dbMap = new HashMap<Long, String> ();
String[] hosts = { "www.taobao.com" };
String[] session_id = { "ABYH6Y4V4SCVXTG6DPB4VH9U123", "XXYH6YCGFJYERTT834R52FDXV9U34", "BBYH61456FGHHJ7JL89RG5VV9UYU7",
"CYYH6Y2345GHI899OFG4V9U567", "VVVYH6Y4V4SFXZ56JIPDPB4V678" };
String[] time = { "2014-01-07 08:40:50", "2014-01-07 08:40:51", "2014-01-07 08:40:52", "2014-01-07 08:40:53",
"2014-01-07 09:40:49", "2014-01-07 10:40:49", "2014-01-07 11:40:49", "2014-01-07 12:40:49" };
for (long i = 0; i < 100; i++) {
dbMap.put(i,hosts[0]+"\t"+session_id[random.nextInt(5)]+"\t"+time[random.nextInt(8)]);
}
}
/**
*
* 功能描述: 返回一个元数据对象
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 14:08
*/
@Override
public Coordinator<MyMata> getCoordinator(Map map, TopologyContext topologyContext) {
return new MyCoordinator();
}
/**
*
* 功能描述:
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 14:08
*/
@Override
public Emitter<MyMata> getEmitter(Map map, TopologyContext topologyContext) {
return new MyEmitter(dbMap);
}
/**
*
* 功能描述: 定义输出类型
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 14:08
*/
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("tx","log"));
}
/**
*
* 功能描述: 返回storm的配置文件
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 14:09
*/
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
2.3.3 编写上面提到的MyCoordinator类
package com.hcycom.stormdemosimple2.transaction1;
import org.apache.storm.transactional.ITransactionalSpout;
import org.apache.storm.utils.Utils;
import java.math.BigInteger;
/**
* @Auther: CYL
* @Date: 2018/7/23 09:05
* @Description: ITransactionalSpout的一个内部接口类 用来判断并开启事务 发送事务到batch emit流中
*/
public class MyCoordinator implements ITransactionalSpout.Coordinator<MyMata>{
//定义每个批次中tuple的个数
public static int BATCH_NUM = 10 ;
/**
*
* 功能描述: 初始化事务,本质上就是返回了一个元数据
*
* @param: bigInteger: 事务id myMata:上一个元数据
* @return: MyMata定义了tuple的开始位置和tuple的数量
* @auther: CYL
* @date: 2018/7/23 11:55
*/
@Override
public MyMata initializeTransaction(BigInteger bigInteger, MyMata myMata) {
long beginPoint = 0;
//如果上一个Mata为空的时候,就从第一个开始,否则在上一个的基础上处理
if(myMata == null){
beginPoint = 0;
} else {
beginPoint = myMata.getBeginPoint() + myMata.getNum();
}
MyMata mata = new MyMata();
mata.setBeginPoint(beginPoint);
mata.setNum(BATCH_NUM);
System.err.println("启动一个事务:"+mata.toString());
return mata;
}
/**
*
* 功能描述: 如果是ture的时候,事务开始执行 Coordinator将消息发送到“batch emit‘流中
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 11:55
*/
@Override
public boolean isReady() {
//睡眠2秒
Utils.sleep(2000);
return true;
}
/**
*
* 功能描述: 释放资源
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 11:55
*/
@Override
public void close() {
}
}
2.3.4 编写上面提到的MyEmitter类
package com.hcycom.stormdemosimple2.transaction1;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.transactional.ITransactionalSpout;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.tuple.Values;
import java.lang.reflect.MalformedParameterizedTypeException;
import java.math.BigInteger;
import java.util.Map;
/**
* @Auther: CYL
* @Date: 2018/7/23 09:05
* @Description: 将接收到的数据发射到下一级的bolt中
* 广播的方式接收batch emit流
*/
public class MyEmitter implements ITransactionalSpout.Emitter<MyMata> {
Map<Long,String> dbMap = null;
public MyEmitter(Map<Long,String> dbMap){
this.dbMap = dbMap;
}
/**
*
* 功能描述: 循环每个批次中的tuple的数据,并发送出去
*
* @param: transactionAttempt:事务的标识 myMata:元数据 batchOutputCollector:发射类
* @return:
* @auther: CYL
* @date: 2018/7/23 12:02
*/
@Override
public void emitBatch(TransactionAttempt transactionAttempt, MyMata myMata, BatchOutputCollector batchOutputCollector) {
long beginPoint = myMata.getBeginPoint() ;
int num = myMata.getNum() ;
for (long i = beginPoint; i < num+beginPoint; i++) {
if (dbMap.get(i)==null) {
continue;
}
batchOutputCollector.emit(new Values(transactionAttempt,dbMap.get(i)));
}
}
/**
*
* 功能描述: 清理之前的事务信息
*
* @param:
* @return:
* @auther: CYL
* @date: 2018/7/23 12:02
*/
@Override
public void cleanupBefore(BigInteger bigInteger) {
}
@Override
public void close() {
}
}
2.3.5 编写上面的MyMata类(用来封装数据)
package com.hcycom.stormdemosimple2.transaction1;
import java.io.Serializable;
/**
* @Auther: CYL
* @Date: 2018/7/23 09:06
* @Description: 必须实现序列化接口
*/
public class MyMata implements Serializable {
private static final long serialVersionUID = 1L;
//事务开始位置
private long beginPoint;
//batch 的tuple个数
private int num;
public long getBeginPoint() {
return beginPoint;
}
public void setBeginPoint(long beginPoint) {
this.beginPoint = beginPoint;
}
public int getNum() {
return num;
}
public void setNum(int num) {
this.num = num;
}
@Override
public String toString() {
return getBeginPoint()+"--------------"+getNum();
}
}
到这个阶段,我们的spout才算是基本上完成了,这里需要强调的是,coordinator和emitter都是要实例化到spout中使用的,在0.9.0 API中,他们是以内部类的形式存在的,当然我们这里也可以写成字符串的形式
2.3.6 编写MyDailyBatchBolt类
这个类主要是对数据进行第一次的处理,主要是并行的对自己拿到的tuple进行按天统计,并传到下一级
package com.hcycom.stormdemosimple2.daily;
import com.hcycom.stormdemosimple2.tools.DateFmt;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.coordination.IBatchBolt;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import java.util.HashMap;
import java.util.Map;
public class MyDailyBatchBolt implements IBatchBolt<TransactionAttempt> {
/**
*
*/
private static final long serialVersionUID = 1L;
Map<String, Integer> countMap = new HashMap<String, Integer>();
BatchOutputCollector collector ;
Integer count = null;
String today = null;
TransactionAttempt tx = null;
@Override
public void execute(Tuple tuple) {
// TODO Auto-generated method stub
String log = tuple.getString(1);
tx = (TransactionAttempt)tuple.getValue(0);
if (log != null && log.split("\\t").length >=3 ) {
today = DateFmt.getCountDate(log.split("\\t")[2], DateFmt.date_short) ;
count = countMap.get(today);
if(count == null)
{
count = 0;
}
count ++ ;
countMap.put(today, count);
}
}
@Override
public void finishBatch() {
collector.emit(new Values(tx,today,count));
}
@Override
public void prepare(Map conf, TopologyContext context,
BatchOutputCollector collector, TransactionAttempt id) {
// TODO Auto-generated method stub
this.collector = collector;
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
declarer.declare(new Fields("tx","date","count"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
// TODO Auto-generated method stub
return null;
}
}
2.3.7 编写 MyDailyCommitterBolt类
这个类主要是订阅上面类中统计的数据,并对他们统计的数据按天统计,形成最终的结果,并注册其为提交事务的类,负责事务的提交
注意:此类在拓扑中的并行度必须是1.才能保证统计的准确性
package com.hcycom.stormdemosimple2.daily;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseTransactionalBolt;
import org.apache.storm.transactional.ICommitter;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.tuple.Tuple;
import java.math.BigInteger;
import java.util.HashMap;
import java.util.Map;
public class MyDailyCommitterBolt extends BaseTransactionalBolt implements ICommitter {
/**
*
*/
private static final long serialVersionUID = 1L;
public static final String GLOBAL_KEY = "GLOBAL_KEY";
public static Map<String, DbValue> dbMap = new HashMap<String, DbValue>() ;
Map<String, Integer> countMap = new HashMap<String, Integer>();
TransactionAttempt id ;
BatchOutputCollector collector;
String today = null;
@Override
public void execute(Tuple tuple) {
today = tuple.getString(1) ;
Integer count = tuple.getInteger(2);
id = (TransactionAttempt)tuple.getValue(0);
if (today !=null && count != null) {
Integer batchCount = countMap.get(today) ;
if (batchCount == null) {
batchCount = 0;
}
batchCount += count ;
countMap.put(today, batchCount);
}
}
@Override
public void finishBatch() {
// TODO Auto-generated method stub
if (countMap.size() > 0) {
DbValue value = dbMap.get(GLOBAL_KEY);
DbValue newValue ;
if (value == null || !value.txid.equals(id.getTransactionId())) {
//更新数据库
newValue = new DbValue();
newValue.txid = id.getTransactionId() ;
newValue.dateStr = today;
if (value == null) {
newValue.count = countMap.get(today) ;
}else {
newValue.count = value.count + countMap.get("2014-01-07") ;
}
dbMap.put(GLOBAL_KEY, newValue);
}else
{
newValue = value;
}
}
System.out.println("total==========================:"+dbMap.get(GLOBAL_KEY).count);
// collector.emit(tuple)
}
@Override
public void prepare(Map conf, TopologyContext context,
BatchOutputCollector collector, TransactionAttempt id) {
// TODO Auto-generated method stub
this.id = id ;
this.collector = collector;
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
}
public static class DbValue
{
BigInteger txid;
int count = 0;
String dateStr;
}
}
2.3.8 创建topo
TransactionanToplogyBuilder类在0.9.0 版后被注为不建议使用,而是交给trident进行管理,这里我们作为原理的理解学习,就不关注这个问题了
package com.hcycom.stormdemosimple2.daily;
import com.hcycom.stormdemosimple2.transaction1.MyTxSpout;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.transactional.TransactionalTopologyBuilder;
public class MyDailyTopo {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
TransactionalTopologyBuilder builder = new TransactionalTopologyBuilder("ttbId","spoutid",new MyTxSpout(),1);
builder.setBolt("bolt1", new MyDailyBatchBolt(),3).shuffleGrouping("spoutid");
builder.setBolt("committer", new MyDailyCommitterBolt(),1).shuffleGrouping("bolt1") ;
Config conf = new Config() ;
conf.setDebug(true);
if (args.length > 0) {
try {
StormSubmitter.submitTopology(args[0], conf, builder.buildTopology());
} catch (AlreadyAliveException e) {
e.printStackTrace();
} catch (InvalidTopologyException e) {
e.printStackTrace();
} catch (AuthorizationException e) {
e.printStackTrace();
}
}else {
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("mytopology", conf, builder.buildTopology());
}
}
}
2.3.9 运行结果
因为log太多,这里只能展示局部的
2.3.10 整个运行过程
执行过程的总结图
2.4 分区事务案例
2.4.1 编写spout类
这里将coordinator和emitter写成内部类的形式们这里可以对比前面的coordinator类和emitter发现不一样的地方:
1)coordinator中多了一个numPartitions方法,返回分区的数量,少了一个初始化mata的方法,即不需要提供开始的节点位置和tuple的数目
2)emitter中多了两个方法:emitPartitionBatchNew和emitPartitionBatch方法,用来初始化batch和提交batch
MyMata可以直接用上面定义的,这里就不在重写一遍了
package com.hcycom.storm.partition;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.transactional.partitioned.IPartitionedTransactionalSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
public class MyPtTxSpout implements IPartitionedTransactionalSpout<MyMata>{
private static final long serialVersionUID = 1L;
public static final int BATCH_NUM = 10;
public Map<Integer,Map<Long,String>> PT_DATA_MAP = new HashMap<Integer,Map<Long,String>>();
/**
* 初始化数据 假设数据源有5个分区,每个分区中有100条数据
*/
public MyPtTxSpout(){
Random random = new Random();
String[] hosts = { "www.taobao.com" };
String[] session_id = { "ABYH6Y4V4SCVXTG6DPB4VH9U123", "XXYH6YCGFJYERTT834R52FDXV9U34", "BBYH61456FGHHJ7JL89RG5VV9UYU7",
"CYYH6Y2345GHI899OFG4V9U567", "VVVYH6Y4V4SFXZ56JIPDPB4V678" };
String[] time = { "2014-01-07 08:40:50", "2014-01-07 08:40:51", "2014-01-07 08:40:52", "2014-01-07 08:40:53",
"2014-01-07 09:40:49", "2014-01-07 10:40:49", "2014-01-07 11:40:49", "2014-01-07 12:40:49" };
for (int j=0;j<6;j++) {
//实现5个分区,每个分区100行
HashMap<Long,String> dbMap = new HashMap<Long, String> ();
for (long i = 0; i < 100; i++) {
dbMap.put(i,hosts[0]+"\t"+session_id[random.nextInt(5)]+"\t"+time[random.nextInt(8)]);
}
PT_DATA_MAP.put(j,dbMap);
}
}
@Override
public Coordinator getCoordinator(Map map, TopologyContext topologyContext) {
return new MyCoordinator();
}
@Override
public Emitter<MyMata> getEmitter(Map map, TopologyContext topologyContext) {
return new MyEmitter();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("tx","log"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public class MyCoordinator implements IPartitionedTransactionalSpout.Coordinator{
/**
* 返回分区的个数
* @return
*/
@Override
public int numPartitions() {
return 5;
}
/**
* 返回是否开启发射
* @return
*/
@Override
public boolean isReady() {
Utils.sleep(1000);
return true;
}
@Override
public void close() {
}
}
public class MyEmitter implements IPartitionedTransactionalSpout.Emitter<MyMata>{
@Override
public MyMata emitPartitionBatchNew(TransactionAttempt transactionAttempt, BatchOutputCollector batchOutputCollector, int i, MyMata myMata) {
long beginPoint = 0;
//myMata上一个batch的地址
if (myMata == null) {
beginPoint = 0 ;
}else {
beginPoint = myMata.getBeginPoint() + myMata.getNum() ;
}
MyMata mata = new MyMata() ;
mata.setBeginPoint(beginPoint);
mata.setNum(BATCH_NUM);
//此处需要发射参数
emitPartitionBatch(transactionAttempt,batchOutputCollector,i,mata);
System.err.println("启动一个事务:"+mata.toString());
return mata;
}
/**
* 实际发送数据的方法
* @param transactionAttempt
* @param batchOutputCollector
* @param i
* @param
*/
@Override
public void emitPartitionBatch(TransactionAttempt transactionAttempt, BatchOutputCollector batchOutputCollector, int i, MyMata myMata) {
System.out.println("emitPartitionBatch partition: "+i);
long beginPoint = myMata.getBeginPoint() ;
int num = myMata.getNum() ;
Map<Long ,String> dbMap = PT_DATA_MAP.get(i);
for (long y = beginPoint; y < num+beginPoint; y++) {
if (dbMap.get(y)==null) {
break;
}
batchOutputCollector.emit(new Values(transactionAttempt,dbMap.get(y)));
}
}
@Override
public void close() {
}
}
}
2.4.2 编写bolt
这里对bolt做一些细小的调整,不是storm层面的东西,但是为了方便,这里还是贴出代码
package com.hcycom.storm.partition;
import com.hcycom.storm.tools.DateFmt;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.coordination.IBatchBolt;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import java.util.HashMap;
import java.util.Map;
public class MyDailyBatchBolt implements IBatchBolt<TransactionAttempt> {
/**
*
*/
private static final long serialVersionUID = 1L;
Map<String, Integer> countMap = new HashMap<String, Integer>();
BatchOutputCollector collector ;
Integer count = null;
String today = null;
TransactionAttempt tx = null;
@Override
public void execute(Tuple tuple) {
// TODO Auto-generated method stub
String log = tuple.getString(1);
tx = (TransactionAttempt)tuple.getValue(0);
if (log != null && log.split("\\t").length >=3 ) {
today = DateFmt.getCountDate(log.split("\\t")[2], DateFmt.date_short) ;
count = countMap.get(today);
if(count == null)
{
count = 0;
}
count ++ ;
countMap.put(today, count);
}
}
@Override
public void finishBatch() {
System.out.println(tx+"------------"+today+"-----------"+count);
collector.emit(new Values(tx,today,count));
}
@Override
public void prepare(Map conf, TopologyContext context,
BatchOutputCollector collector, TransactionAttempt id) {
// TODO Auto-generated method stub
this.collector = collector;
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
declarer.declare(new Fields("tx","date","count"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
// TODO Auto-generated method stub
return null;
}
}
2.4.3 编写committerbolt
package com.hcycom.storm.partition;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseTransactionalBolt;
import org.apache.storm.transactional.ICommitter;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.tuple.Tuple;
import java.math.BigInteger;
import java.util.HashMap;
import java.util.Map;
public class MyDailyCommitterBolt extends BaseTransactionalBolt implements ICommitter {
/**
*
*/
private static final long serialVersionUID = 1L;
public static final String GLOBAL_KEY = "GLOBAL_KEY";
public static Map<String, DbValue> dbMap = new HashMap<String, DbValue>() ;
Map<String, Integer> countMap = new HashMap<String, Integer>();
TransactionAttempt id ;
BatchOutputCollector collector;
String today = null;
@Override
public void prepare(Map map, TopologyContext topologyContext, BatchOutputCollector batchOutputCollector, TransactionAttempt transactionAttempt) {
this.id = transactionAttempt ;
this.collector = batchOutputCollector;
}
@Override
public void execute(Tuple tuple) {
today = tuple.getString(1) ;
Integer count = tuple.getInteger(2);
id = (TransactionAttempt)tuple.getValue(0);
if (today !=null && count != null) {
Integer batchCount = countMap.get(today) ;
if (batchCount == null) {
batchCount = 0;
}
batchCount += count ;
countMap.put(today, batchCount);
}
}
@Override
public void finishBatch() {
// TODO Auto-generated method stub
if (countMap.size() > 0) {
DbValue value = dbMap.get(GLOBAL_KEY);
DbValue newValue ;
if (value == null || !value.txid.equals(id.getTransactionId())) {
//更新数据库
newValue = new DbValue();
newValue.txid = id.getTransactionId() ;
newValue.dateStr = today;
if (value == null) {
newValue.count = countMap.get(today) ;
}else {
newValue.count = value.count + countMap.get("2014-01-07") ;
}
dbMap.put(GLOBAL_KEY, newValue);
}else
{
newValue = value;
}
}
System.err.println("total==========================:"+dbMap.get(GLOBAL_KEY).count);
// collector.emit(tuple)
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
}
public static class DbValue
{
BigInteger txid;
int count = 0;
String dateStr;
}
}
2.4.3 编写topo
package com.hcycom.storm.partition;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.transactional.TransactionalTopologyBuilder;
public class MyDailyTopo {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
TransactionalTopologyBuilder builder = new TransactionalTopologyBuilder("ttbId","spoutid",new MyPtTxSpout(),1);
builder.setBolt("bolt1", new MyDailyBatchBolt(),3).shuffleGrouping("spoutid");
builder.setBolt("committer", new MyDailyCommitterBolt(),1).shuffleGrouping("bolt1") ;
Config conf = new Config() ;
conf.setDebug(true);
if (args.length > 0) {
try {
StormSubmitter.submitTopology(args[0], conf, builder.buildTopology());
} catch (AlreadyAliveException e) {
e.printStackTrace();
} catch (InvalidTopologyException e) {
e.printStackTrace();
} catch (AuthorizationException e) {
e.printStackTrace();
}
}else {
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("mytopology", conf, builder.buildTopology());
}
}
}
2.4.4 执行结果
这里贴出部分运行结果
2.5 不透明分区事务案例
2.5.1 编写spout类
这里将coordinator和emitter写成内部类的形式们这里可以对比前面分区事务的coordinator类和emitter发现不一样的地方:
1)coordinator中的numPartitions方法没了
2)emitter中多了个方法:numPartitionsemit,少了PartitionBatchNew方法,即分区的数目由emitter接管,这是为了方便batch重分配
这里只编写需要重新编写的类
package com.hcycom.storm.opaque;
import com.hcycom.storm.partition.MyMata;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.transactional.partitioned.IOpaquePartitionedTransactionalSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
public class MyOpaquePtTxSpout implements IOpaquePartitionedTransactionalSpout<MyMata>{
private static final long serialVersionUID = 1L;
public static int BATCH_NUM = 10;
public Map<Integer,Map<Long,String>> PT_DATA_MAP = new HashMap<Integer,Map<Long,String>>();
public MyOpaquePtTxSpout(){
Random random = new Random();
String[] hosts = { "www.taobao.com" };
String[] session_id = { "ABYH6Y4V4SCVXTG6DPB4VH9U123", "XXYH6YCGFJYERTT834R52FDXV9U34", "BBYH61456FGHHJ7JL89RG5VV9UYU7",
"CYYH6Y2345GHI899OFG4V9U567", "VVVYH6Y4V4SFXZ56JIPDPB4V678" };
String[] time = { "2014-01-07 08:40:50", "2014-01-07 08:40:51", "2014-01-07 08:40:52", "2014-01-07 08:40:53",
"2014-01-07 09:40:49", "2014-01-07 10:40:49", "2014-01-07 11:40:49", "2014-01-07 12:40:49" };
for (int j=0;j<6;j++) {
//实现5个分区,每个分区100行
HashMap<Long,String> dbMap = new HashMap<Long, String> ();
for (long i = 0; i < 100; i++) {
dbMap.put(i,hosts[0]+"\t"+session_id[random.nextInt(5)]+"\t"+time[random.nextInt(8)]);
}
PT_DATA_MAP.put(j,dbMap);
}
}
@Override
public Emitter<MyMata> getEmitter(Map map, TopologyContext topologyContext) {
return new MyEmitter();
}
@Override
public Coordinator getCoordinator(Map map, TopologyContext topologyContext) {
return new MyCoordinator();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("tx","log"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public class MyCoordinator implements IOpaquePartitionedTransactionalSpout.Coordinator{
@Override
public boolean isReady() {
Utils.sleep(1000);
return true;
}
@Override
public void close() {
}
}
public class MyEmitter implements IOpaquePartitionedTransactionalSpout.Emitter<MyMata>{
@Override
public MyMata emitPartitionBatch(TransactionAttempt transactionAttempt, BatchOutputCollector batchOutputCollector, int i, MyMata myMata) {
System.err.println("emitPartitionBatch partition:"+i);
long beginPoint = 0;
if(myMata == null){
beginPoint = 0;
} else {
beginPoint = myMata.getBeginPoint() + myMata.getNum();
}
MyMata mata=new MyMata();
mata.setBeginPoint(beginPoint);
mata.setNum(BATCH_NUM);
System.err.println("启动一个事务:"+mata.toString());
Map<Long,String> batchMap = PT_DATA_MAP.get(i);
for (long j = mata.getBeginPoint(); j < mata.getBeginPoint()+mata.getNum(); j++) {
if(batchMap.size()<=1){
break;
}
batchOutputCollector.emit(new Values(transactionAttempt,batchMap.get(j)));
}
return mata;
}
/**
* 分区为5个分区
* @return
*/
@Override
public int numPartitions() {
return 5;
}
@Override
public void close() {
}
}
}
2.5.2 编写batchbolt
package com.hcycom.storm.opaque;
import com.hcycom.storm.tools.DateFmt;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.coordination.IBatchBolt;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import java.util.HashMap;
import java.util.Map;
public class MyDailyBatchBolt implements IBatchBolt<TransactionAttempt> {
/**
*
*/
private static final long serialVersionUID = 1L;
Map<String, Integer> countMap = new HashMap<String, Integer>();
BatchOutputCollector collector ;
Integer count = null;
String today = null;
TransactionAttempt tx = null;
@Override
public void execute(Tuple tuple) {
// TODO Auto-generated method stub
String log = tuple.getString(1);
tx = (TransactionAttempt)tuple.getValue(0);
if (log != null && log.split("\\t").length >=3 ) {
today = DateFmt.getCountDate(log.split("\\t")[2], DateFmt.date_short) ;
count = countMap.get(today);
if(count == null)
{
count = 0;
}
count ++ ;
countMap.put(today, count);
}
}
@Override
public void finishBatch() {
System.out.println(tx+"------------"+today+"-----------"+count);
collector.emit(new Values(tx,today,count));
}
@Override
public void prepare(Map conf, TopologyContext context,
BatchOutputCollector collector, TransactionAttempt id) {
// TODO Auto-generated method stub
this.collector = collector;
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
declarer.declare(new Fields("tx","date","count"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
// TODO Auto-generated method stub
return null;
}
}
2.5.3 编写committerbolt
package com.hcycom.storm.opaque;
import org.apache.storm.coordination.BatchOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseTransactionalBolt;
import org.apache.storm.transactional.ICommitter;
import org.apache.storm.transactional.TransactionAttempt;
import org.apache.storm.tuple.Tuple;
import java.math.BigInteger;
import java.util.HashMap;
import java.util.Map;
public class MyDailyCommitterBolt extends BaseTransactionalBolt implements ICommitter {
/**
*
*/
private static final long serialVersionUID = 1L;
public static final String GLOBAL_KEY = "GLOBAL_KEY";
public static Map<String, DbValue> dbMap = new HashMap<String, DbValue>() ;
Map<String, Integer> countMap = new HashMap<String, Integer>();
TransactionAttempt id ;
BatchOutputCollector collector;
String today = null;
@Override
public void prepare(Map map, TopologyContext topologyContext, BatchOutputCollector batchOutputCollector, TransactionAttempt transactionAttempt) {
this.id = transactionAttempt ;
this.collector = batchOutputCollector;
}
@Override
public void execute(Tuple tuple) {
today = tuple.getString(1) ;
Integer count = tuple.getInteger(2);
id = (TransactionAttempt)tuple.getValue(0);
if (today !=null && count != null) {
Integer batchCount = countMap.get(today) ;
if (batchCount == null) {
batchCount = 0;
}
batchCount += count ;
countMap.put(today, batchCount);
}
}
@Override
public void finishBatch() {
// TODO Auto-generated method stub
if (countMap.size() > 0) {
DbValue value = dbMap.get(today);
DbValue newValue ;
if (value == null || !value.txid.equals(id.getTransactionId())) {
//更新数据库
newValue = new DbValue();
newValue.txid = id.getTransactionId() ;
newValue.dateStr = today;
if (value == null) {
newValue.count = countMap.get(today) ;
newValue.pre_count = 0;
}else {
newValue.count = value.count + countMap.get("2014-01-07") ;
newValue.pre_count = value.count;
}
dbMap.put(today, newValue);
}else
{
newValue = value;
}
System.err.println("total==========================:"+dbMap.get(today).count);
}
// collector.emit(tuple)
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
}
public static class DbValue
{
BigInteger txid;
int count = 0;
String dateStr;
int pre_count ;
}
}
2.5.4 编写topo
package com.hcycom.storm.opaque;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.transactional.TransactionalTopologyBuilder;
public class MyDailyTopo {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
TransactionalTopologyBuilder builder = new TransactionalTopologyBuilder("ttbId","spoutid",new MyOpaquePtTxSpout(),1);
builder.setBolt("bolt1", new MyDailyBatchBolt(),3).shuffleGrouping("spoutid");
builder.setBolt("committer", new MyDailyCommitterBolt(),1).shuffleGrouping("bolt1") ;
Config conf = new Config() ;
conf.setDebug(false);
if (args.length > 0) {
try {
StormSubmitter.submitTopology(args[0], conf, builder.buildTopology());
} catch (AlreadyAliveException e) {
e.printStackTrace();
} catch (InvalidTopologyException e) {
e.printStackTrace();
} catch (AuthorizationException e) {
e.printStackTrace();
}
}else {
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("mytopology", conf, builder.buildTopology());
}
}
}
2.5.5 运行结果
部分运行结果:
总结:
这三种事务的变化其实就是coordinator类中的权利下放到emmitter中的过程,就实用性,安全性上而言,不透明事务是最好的选择