Trident状态机制

最新推荐文章于 2021-06-17 21:11:44 发布

Quan.S

最新推荐文章于 2021-06-17 21:11:44 发布

阅读量525

点赞数

分类专栏： streaming 文章标签： storm trident state spout

本文链接：https://blog.csdn.net/xianzhen376/article/details/52982188

版权

streaming 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

前言

本文是对Trident state的理解
官网地址：http://storm.apache.org/releases/current/Trident-state.html

如何理解有状态和无状态啊？一个数据淌过拓扑的时候，有状态相对于无状态的区别，就是我们知道它从哪里来的，知道当前流经过了那些节点的处理。是错误回滚的先决条件。

这篇文档从Spout和State两方面入手，着重讲数据重放和出错处理。

1 准备知识

状态管理3要素：

Trident处理时将Tuples划成Batches；
每个Batch会生成一个事务ID（txid）,如果数据重放，事务ID不变;
按照事务ID顺序更新状态；

通过以上三条，就可以知道消息是否被处理过。如果没有处理过，采取什么处理方法。在说明Spout和State之前，先聊一下他们的分工：

Spout 负责将tuple分成batch，并分配事物ID；
负责提供存储和获取机制，进行状态管理。

2 Spouts

注意：原文中在说明Spout时，其实和State的存储混着说的，具体参考上面说的分工。例如当说明Transactional spouts时，实际说的是事务性Spout和事务性state的组合。

2.1 Transactional spouts

特征：

Batch重放时，要求batch内数据与原来数据一致；
小于等于当前txid的batch一律不处理；
存储形式：value + txid

[count=3, txid=3]

举例说明：

新来数据：count=x, txid=a，其中a<=3，不会处理；
新来数据：count=x, txid=a，其中a=4，更新数据count=3+x, txid=4；

缺点：当数据源不可靠时，需要重放的数据有所丢失，无法和上次内容一致，会被挂死在Spout。

2.2 Opaque transactional spouts

相比于 transactional spouts有一个改进，就是如果重放数据和原数据并非严格一致时，以新来的数据为准。
数据存储中，加了一个字段，表示上一次的值，形如：

{ value = 6,
  prevValue = 4,
  txid = 3
}

举例说明：

新来数据：value=x, txid=a，其中a<3，不会处理；
新来数据：value=x, txid=a，其中a=3，更新数据{value=4+x, preValue=4, txid=3}；
新来数据：value=x, txid=a，其中a=4，更新数据{value=6+x, preValue=6, txid=4}；

缺点：要多记录一个字段

2.3 Non-transactional spouts

非事务性的，数据可能被重复处理。

2.4 小结

之前一直在将Spout和State的配合，这两块的逻辑分开定义，抛开非事物类型的来说：
Spout总结

Spout类型	核心说明
Transactional	回放数据要求和原始数据一致
Opaque transactional spouts	回放数据要求和原始数据一致

State总结

State类型	核心说明
Transactional	只存结果，ID相同不更新
Opaque transactional spouts	存结果和上次结果，ID相同更新

State APIs

API说明

（1）State接口

public interface State {
    //状态更新前
    void beginCommit(Long txid);
    //状态更新后 
    void commit(Long txid);
}

（2）QueryFunction 状态查询接口

public interface QueryFunction<S extends State, T> extends EachOperation {
    //先Tuples列表一对一查询出结果
    List<T> batchRetrieve(S state, List<TridentTuple> args);
    //对查询出的结果进行操作
    void execute(TridentTuple tuple, T result, TridentCollector collector);
}

（3）StateUpdater 状态更新接口

public interface StateUpdater<S extends State> extends Operation {
    void updateState(S state, List<TridentTuple> tuples, TridentCollector collector);
}

官网上很长的一个例子：

/* 定义状态基本操作，包括存储和获取 */
public class LocationDB implements State {
    public void beginCommit(Long txid) {    
    }

    public void commit(Long txid) {    
    }

    public void setLocationsBulk(List<Long> userIds, List<String> locations) {
      // set locations in bulk
    }

    public List<String> bulkGetLocations(List<Long> userIds) {
      // get locations in bulk
    }
}

/* 对应的工厂类 */
public class LocationDBFactory implements StateFactory {
   public State makeState(Map conf, int partitionIndex, int numPartitions) {
      return new LocationDB();
   } 
}

/* 状态查询方法 */
public class QueryLocation extends BaseQueryFunction<LocationDB, String> {
    public List<String> batchRetrieve(LocationDB state, List<TridentTuple> inputs) {
        List<Long> userIds = new ArrayList<Long>();
        for(TridentTuple input: inputs) {
            userIds.add(input.getLong(0));
        }
        return state.bulkGetLocations(userIds);
    }

    public void execute(TridentTuple tuple, String location, TridentCollector collector) {
        collector.emit(new Values(location));
    }    
}

/* 定义状态更新类，其中BaseStateUpdater implements StateUpdater */
public class LocationUpdater extends BaseStateUpdater<LocationDB> {
    public void updateState(LocationDB state, List<TridentTuple> tuples, TridentCollector collector) {
        List<Long> ids = new ArrayList<Long>();
        List<String> locations = new ArrayList<String>();
        for(TridentTuple t: tuples) {
            ids.add(t.getLong(0));
            locations.add(t.getString(1));
        }
        /* 批量存入 */
        state.setLocationsBulk(ids, locations);
    }
}

/* 结合上面的类加以说明：
   1. 产生拓扑，以locationsSpout为源；
   2. 将tuple中的第一个值作为userid，第二个值作为Location批量存入State中  */
TridentTopology topology = new TridentTopology();
TridentState locations = 
    topology.newStream("locations", locationsSpout)
        .partitionPersist(new LocationDBFactory(), new Fields("userid", "location"), new LocationUpdater())

/* 结合上面的类加以说明：
   1. 产生拓扑，以spout为源；
   2. 以tuple userid字段作为查询参数，结果为location字段 */
TridentTopology topology = new TridentTopology();
TridentState locations = topology.newStaticState(new LocationDBFactory());
topology.newStream("myspout", spout)
        .stateQuery(locations, new Fields("userid"), new QueryLocation(), new Fields("location"))

TridentState locations = 
    topology.newStream("locations", locationsSpout)
        .partitionPersist(new LocationDBFactory(), new Fields("userid", "location"), new LocationUpdater())

/*说明：代码从官网搞下来的，应该可以合并成一个拓扑; */