storm-[6]-Trident API

官方文档:

Trident API Overview

Trident API Overview

Trident的核心数据模型“stream”,按照一系列的分批(batches)处理,stream在集群节点中分区形式存在,不同分区的operations并行

Trident有以下五种operations:

  1. Operations that apply locally to each partition and cause no network transfer
  2. Repartitioning operations that repartition a stream but otherwise don't change the contents (involves network transfer)
  3. Aggregation operations that do network transfer as part of the operation
  4. Operations on grouped streams
  5. Merges and joins

Partition-local operations

分区本地化的operations不引发网络传输,在不同的分区中独立进行

Functions

  • 以一个输入集的fields,emits0或更多的tuples作为输出  
  • 输出tuples是appended到原来的 input tuple中的
  • 如果没有输出tuples,则一般是对源tuples做了过滤

Suppose you have this function:

public class MyFunction extends BaseFunction {
    public void execute(TridentTuple tuple, TridentCollector collector) {
        for(int i=0; i < tuple.getInteger(0); i++) {
            collector.emit(new Values(i));
        }
    }
}

Now suppose you have a stream in the variable "mystream" with the fields ["a", "b", "c"] with the following tuples:

[1, 2, 3]
[4, 1, 6]
[3, 0, 8]

If you run this code:

mystream.each(new Fields("b"), new MyFunction(), new Fields("d")))

The resulting tuples would have fields ["a", "b", "c", "d"] and look like this:

[1, 2, 3, 0]
[1, 2, 3, 1]
[4, 1, 6, 0]
这里[3, 0, 8]被过滤掉了

Filters

function可以作为过滤器,而更为一般的过滤器为:

public class MyFilter extends BaseFilter {
    public boolean isKeep(TridentTuple tuple) {
        return tuple.getInteger(0) == 1 && tuple.getInteger(1) == 2;
    }
}

Now suppose you had these tuples with fields ["a", "b", "c"]:

[1, 2, 3]
[2, 1, 1]
[2, 3, 4]

If you ran this code:

mystream.filter(new MyFilter())

The resulting tuples would be:

[1, 2, 3]

map and flatMap

map对输入tuple进行一对一的处理,例如转大写的处理:

public class UpperCase extends MapFunction {
 @Override
 public Values execute(TridentTuple input) {
   return new Values(input.getString(0).toUpperCase());
 }
}

如下调用:

mystream.map(new UpperCase())

flatMap对处理结果flattening到一个新的stream中,例如将一个句子的stream转成一个words的stream:

public class Split extends FlatMapFunction {
  @Override
  public Iterable<Values> execute(TridentTuple input) {
    List<Values> valuesList = new ArrayList<>();
    for (String word : input.getString(0).split(" ")) {
      valuesList.add(new Values(word));
    }
    return valuesList;
  }
}

调用如下方式

mystream.flatMap(new Split())

当然也可以链式调用:

mystream.flatMap(new Split()).map(new UpperCase())

可以为输入重命名

mystream.map(new UpperCase(), new Fields("uppercased"))
mystream.flatMap(new Split(), new Fields("word"))

peek

用于进行一些附加性的动作,(如堆栈一样只是预取过来,不是真正的拿掉处理)

例如在分组前将要进行分组的tupe打印出来:

mystream.flatMap(new Split()).map(new UpperCase())

.peek(

new Consumer() {

 @Override

public void accept(TridentTuple input) {

 System.out.println(input.getString(0));

 }

}

)

.groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))

min and minBy

min and minBy operations return minimum value on each partition of a batch of tuples in a trident stream.

Suppose, a trident stream contains fields ["device-id", "count"] and the following partitions of tuples

Partition 0:
[123, 2]
[113, 54]
[23,  28]
[237, 37]
[12,  23]
[62,  17]
[98,  42]

Partition 1:
[64,  18]
[72,  54]
[2,   28]
[742, 71]
[98,  45]
[62,  12]
[19,  174]


Partition 2:
[27,  94]
[82,  23]
[9,   86]
[53,  71]
[74,  37]
[51,  49]
[37,  98]

minBy operation can be applied on the above stream of tuples like below which results in emitting tuples with minimum values of count field in each partition.

  mystream.minBy(new Fields("count"))

Result of the above code on mentioned partitions is:

Partition 0:
[123, 2]

Partition 1:
[62,  12]

Partition 2:
[82,  23]

max and maxBy

max and maxBy operations return maximum value on each partition of a batch of tuples in a trident stream.

Suppose, a trident stream contains fields ["device-id", "count"] as mentioned in the above section.

max and maxBy operations can be applied on the above stream of tuples like below which results in emitting tuples with maximum values of count field for each partition.

  mystream.maxBy(new Fields("count"))

Result of the above code on mentioned partitions is:

Partition 0:
[113, 54]


Partition 1:
[19,  174]


Partition 2:
[37,  98]

Windowing

Trident可以多批次合并为一个窗统一处理,提供两种基于时间和 tuples count处理窗:1. Tumbling window 2. Sliding window

Tumbling window

任一个 tuple仅在一个窗口中被处理一次


    /** * Returns a stream of tuples which are aggregated results of a tumbling window with every {@code windowCount} of tuples. */
    public Stream tumblingWindow(int windowCount, WindowsStoreFactory windowStoreFactory,
                                      Fields inputFields, Aggregator aggregator, Fields functionFields);

    /** * Returns a stream of tuples which are aggregated results of a window that tumbles at duration of {@code windowDuration} */
    public Stream tumblingWindow(BaseWindowedBolt.Duration windowDuration, WindowsStoreFactory windowStoreFactory,
                                     Fields inputFields, Aggregator aggregator, Fields functionFields);

Sliding window

滑动窗窗口处理,这样tuple可能会在多个窗中:

Examples here


    /** * Returns a stream of tuples which are aggregated results of a sliding window with every {@code windowCount} of tuples * and slides the window after {@code slideCount}. */
    public Stream slidingWindow(int windowCount, int slideCount, WindowsStoreFactory windowStoreFactory,
                                      Fields inputFields, Aggregator aggregator, Fields functionFields);

    /** * Returns a stream of tuples which are aggregated results of a window which slides at duration of {@code slidingInterval} * and completes a window at {@code windowDuration} */
    public Stream slidingWindow(BaseWindowedBolt.Duration windowDuration, BaseWindowedBolt.Duration slidingInterval,
                                    WindowsStoreFactory windowStoreFactory, Fields inputFields, Aggregator aggregator, Fields functionFields);
Common windowing API

ommon windowing API配置WindowConfig

详细的介绍:here

Example applications

应用实例: TridentHBaseWindowingStoreTopology and TridentWindowingInmemoryStoreTopology

partitionAggregate

partitionAggregate 在batch的分区中运行函数,不像functions partitionAggregate会将emitted代替given field,例如

mystream.partitionAggregate(new Fields("b"), new Sum(), new Fields("sum"))

Suppose the input stream contained fields ["a", "b"] and the following partitions of tuples:

Partition 0:
["a", 1]
["b", 2]

Partition 1:
["a", 3]
["c", 8]

Partition 2:
["e", 1]
["d", 9]
["d", 10]

输出中只有一个Field “sum”

Partition 0:
[3]

Partition 1:
[11]

Partition 2:
[20]
三种聚合器: CombinerAggregator, ReducerAggregator, and Aggregator.

 (1)CombinerAggregator:

public interface CombinerAggregator<T> extends Serializable {
    T init(TridentTuple tuple);
    T combine(T val1, T val2);
    T zero();
}

CombinerAggregator返回一个只有一个field的tuple,CombinerAggregators对每一输入tuple运行init function,使用 combine function对结果进行聚合,如果没有分区中没有tuples,CombinerAggregator发送出 zero function的输出,例如Count的实现:

public class Count implements CombinerAggregator<Long> {
    public Long init(TridentTuple tuple) {
        return 1L;
    }

    public Long combine(Long val1, Long val2) {
        return val1 + val2;
    }

    public Long zero() {
        return 0L;
    }
}

 (2)ReducerAggregator:

ReducerAggregator以init初识化value,然后迭代每一个value产生一个value作为输出。

public interface ReducerAggregator<T> extends Serializable {
    T init();
    T reduce(T curr, TridentTuple tuple);
}

For example, here's how you would define Count as a ReducerAggregator:

public class Count implements ReducerAggregator<Long> {
    public Long init() {
        return 0L;
    }

    public Long reduce(Long curr, TridentTuple tuple) {
        return curr + 1;
    }
}

(3)Aggregator

public interface Aggregator<T> extends Operation {
    T init(Object batchId, TridentCollector collector);
    void aggregate(T state, TridentTuple tuple, TridentCollector collector);
    void complete(T state, TridentCollector collector);
}

Aggregators可以emit出任意数量的tuples(包含任意数量的fields):

  1. 在处理batch前调用init. 返回值Object代表aggregation的状态被传入aggregate和complete函数。
  2. aggregate对batch分区中每一个输入tuple作用,更新状态并有选择的输出tuples。
  3. 在aggregate对所有tuple作用完毕后被调用

 Aggregator实现Count:

public class CountAgg extends BaseAggregator<CountState> {
    static class CountState {
        long count = 0;
    }

    public CountState init(Object batchId, TridentCollector collector) {
        return new CountState();
    }

    public void aggregate(CountState state, TridentTuple tuple, TridentCollector collector) {
        state.count+=1;
    }

    public void complete(CountState state, TridentCollector collector) {
        collector.emit(new Values(state.count));
    }
}:

链式多种聚合器作用在输入tuple中,例如在每一个分区中运行Count and Sum aggregators ,输出一个tuple,其中的fields为["count", "sum"].

mystream.chainedAgg()
        .partitionAggregate(new Count(), new Fields("count"))
        .partitionAggregate(new Fields("b"), new Sum(), new Fields("sum"))
        .chainEnd()

stateQuery and partitionPersist

stateQuery and partitionPersist 分别查询更新状态源,关于状态的详细资料Trident state doc.

projection

过滤field,对于一个含有fields ["a", "b", "c", "d"]的流:

mystream.project(new Fields("b", "d"))

The output stream would contain only the fields ["b", "d"].

Repartitioning operations

Repartitioning operations run a function to change how the tuples are partitioned across tasks. The number of partitions can also change as a result of repartitioning (for example, if the parallelism hint is greater after repartioning). Repartitioning requires network transfer. Here are the repartitioning functions:

  1. shuffle: Use random round robin algorithm to evenly redistribute tuples across all target partitions
  2. broadcast: Every tuple is replicated to all target partitions. This can useful during DRPC – for example, if you need to do a stateQuery on every partition of data.
  3. partitionBy: partitionBy takes in a set of fields and does semantic partitioning based on that set of fields. The fields are hashed and modded by the number of target partitions to select the target partition. partitionBy guarantees that the same set of fields always goes to the same target partition.
  4. global: All tuples are sent to the same partition. The same partition is chosen for all batches in the stream.
  5. batchGlobal: All tuples in the batch are sent to the same partition. Different batches in the stream may go to different partitions.
  6. partition: This method takes in a custom partitioning function that implements org.apache.storm.grouping.CustomStreamGrouping

Aggregation operations

Trident has aggregate and persistentAggregate methods for doing aggregations on Streams. aggregate is run on each batch of the stream in isolation, while persistentAggregate will aggregation on all tuples across all batches in the stream and store the result in a source of state.

Running aggregate on a Stream does a global aggregation. When you use a ReducerAggregator or an Aggregator, the stream is first repartitioned into a single partition, and then the aggregation function is run on that partition. When you use a CombinerAggregator, on the other hand, first Trident will compute partial aggregations of each partition, then repartition to a single partition, and then finish the aggregation after the network transfer. CombinerAggregator's are far more efficient and should be used when possible.

Here's an example of using aggregate to get a global count for a batch:

mystream.aggregate(new Count(), new Fields("count"))

Like partitionAggregate, aggregators for aggregate can be chained. However, if you chain a CombinerAggregator with a non-CombinerAggregator, Trident is unable to do the partial aggregation optimization.

You can read more about how to use persistentAggregate in the Trident state doc.

Operations on grouped streams

The groupBy operation repartitions the stream by doing a partitionBy on the specified fields, and then within each partition groups tuples together whose group fields are equal. For example, here's an illustration of a groupBy operation:

Grouping

If you run aggregators on a grouped stream, the aggregation will be run within each group instead of against the whole batch. persistentAggregate can also be run on a GroupedStream, in which case the results will be stored in a MapState with the key being the grouping fields. You can read more about persistentAggregate in the Trident state doc.

Like regular streams, aggregators on grouped streams can be chained.

Merges and joins

The last part of the API is combining different streams together. The simplest way to combine streams is to merge them into one stream. You can do that with the TridentTopology#merge method, like so:

topology.merge(stream1, stream2, stream3);

Trident will name the output fields of the new, merged stream as the output fields of the first stream.

Another way to combine streams is with a join. Now, a standard join, like the kind from SQL, require finite input. So they don't make sense with infinite streams. Joins in Trident only apply within each small batch that comes off of the spout.

Here's an example join between a stream containing fields ["key", "val1", "val2"] and another stream containing ["x", "val1"]:

topology.join(stream1, new Fields("key"), stream2, new Fields("x"), new Fields("key", "a", "b", "c"));

This joins stream1 and stream2 together using "key" and "x" as the join fields for each respective stream. Then, Trident requires that all the output fields of the new stream be named, since the input streams could have overlapping field names. The tuples emitted from the join will contain:

  1. First, the list of join fields. In this case, "key" corresponds to "key" from stream1 and "x" from stream2.
  2. Next, a list of all non-join fields from all streams, in order of how the streams were passed to the join method. In this case, "a" and "b" correspond to "val1" and "val2" from stream1, and "c" corresponds to "val1" from stream2.

When a join happens between streams originating from different spouts, those spouts will be synchronized with how they emit batches. That is, a batch of processing will include tuples from each spout.

You might be wondering – how do you do something like a "windowed join", where tuples from one side of the join are joined against the last hour of tuples from the other side of the join.

To do this, you would make use of partitionPersist and stateQuery. The last hour of tuples from one side of the join would be stored and rotated in a source of state, keyed by the join field. Then the stateQuery would do lookups by the join field to perform the "join".


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值