【Storm】Storm Trident 入门和入门案例讲解

最新推荐文章于 2021-01-30 22:57:13 发布

晚风中的自由

最新推荐文章于 2021-01-30 22:57:13 发布

阅读量417

点赞数

分类专栏：大数据 Storm

原文链接：https://blog.csdn.net/derekjiang/article/details/9126185

版权

大数据同时被 2 个专栏收录

41 篇文章 0 订阅

订阅专栏

Storm

16 篇文章 0 订阅

订阅专栏

参考自这篇文章 https://blog.csdn.net/derekjiang/article/details/9126185

一、什么是 Storm Trident ？

Trident是在storm基础上，一个以realtime 实时计算为目标的高度抽象。它在提供处理大吞吐量数据能力的同时，也提供了低延时分布式查询和有状态流式处理的能力。如果你对Pig和Cascading这种高级批量处理工具很了解的话，那么应该毕竟容易理解Trident，因为他们之间很多的概念和思想都是类似的。Tident提供了 joins, aggregations, grouping, functions, 以及 filters等能力。除此之外，Trident 还提供了一些专门的原语，从而在基于数据库或者其他存储的前提下来应付有状态的递增式处理。
Trident是完全容错的，拥有有且只有一次处理的语义，其实就是transactional的高级封装。这就让你可以很轻松的使用 Trident来进行实时数据处理。Trident会把状态以某种形式保存起来，当有错误发生时，它会根据需要来恢复这些状态。有了前面事务的基础，学习Trident会容易一些。

下面通过官方提供的案例storm-starter-master，讲解trident例子。

二、TridentWordCount 案例

1、完整代码

package storm.starter.trident;

import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.LocalDRPC;
import backtype.storm.StormSubmitter;
import backtype.storm.generated.StormTopology;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import storm.trident.TridentState;
import storm.trident.TridentTopology;
import storm.trident.operation.BaseFunction;
import storm.trident.operation.TridentCollector;
import storm.trident.operation.builtin.Count;
import storm.trident.operation.builtin.FilterNull;
import storm.trident.operation.builtin.MapGet;
import storm.trident.operation.builtin.Sum;
import storm.trident.testing.FixedBatchSpout;
import storm.trident.testing.MemoryMapState;
import storm.trident.tuple.TridentTuple;


public class TridentWordCount {
  public static class Split extends BaseFunction {
    @Override
    public void execute(TridentTuple tuple, TridentCollector collector) {
      String sentence = tuple.getString(0);
      for (String word : sentence.split(" ")) {
        collector.emit(new Values(word));
      }
    }
  }

  public static StormTopology buildTopology(LocalDRPC drpc) {
	// spout，数据源是5个tuple，每个tuple是一行语句，每个batch包含的tuple最多为3个
    FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3, 
    	new Values("the cow jumped over the moon"),
        new Values("the man went to the store and bought some candy"), 
        new Values("four score and seven years ago"),
        new Values("how many apples can you eat"), 
        new Values("to be or not to be the person"));
    
    // 如果设置为true，spout会持续发送数据
    spout.setCycle(false); 

    TridentTopology topology = new TridentTopology();
    
    // 把spout当作数据源，并发度为16，split是按空格分割单词，对每个单词分组，把分组结果存储在内存，在用Count聚集函数。
    // 最终结果存储在 TridentState
    TridentState wordCounts = topology.newStream("spout1", spout)
    		.parallelismHint(16)
    		.each(new Fields("sentence"), new Split(), new Fields("word"))
    		.groupBy(new Fields("word"))
    		.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
    		.parallelismHint(16);
    
    // 分布式查询
    // topology 生成 DRPC 流，DRPC 函数名称是 “words”，args是输入的客户端参数，可以有多个。
    // each对每一行处理，split是按空格分割单词， 输出的字段是word，用groupby分组，stateQuery 是进行查询，
    // 传入的是上文的TridentState (数据源的单词)，要查询split处理之后的每个单词（查询的单词），
    // MapGet处理查询，查询的结果count，在过滤FilterNull，做聚合aggregate，再统计sum。
    // 最终的结果是每个输入的单词在数据源出现多少次的总和
    topology.newDRPCStream("words", drpc)
            .each(new Fields("args"), new Split(), new Fields("word"))
            .groupBy(new Fields("word"))
            .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
            .each(new Fields("count"), new FilterNull())
            .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
    
    return topology.build();
  }

  public static void main(String[] args) throws Exception {
    Config conf = new Config();
    conf.setMaxSpoutPending(20);
    
    // 如果没有参数，本地模式提交
    if (args.length == 0) {
      LocalDRPC drpc = new LocalDRPC();
      LocalCluster cluster = new LocalCluster();
      cluster.submitTopology("wordCounter", conf, buildTopology(drpc));
      for (int i = 0; i < 100; i++) {
    	  
    	// 第一个是 DRPC 函数名称，第二个是参数，每个单词由空格分割开.
    	// 返回的是每一个单词个数的汇总，the是5个，jumped是1个，总数是 6
        System.out.println("DRPC RESULT: " + drpc.execute("words", "cat the dog jumped"));
        Thread.sleep(1000);
      }
    }
    else {    // 分布式模式提交
      conf.setNumWorkers(3);
      StormSubmitter.submitTopology(args[0], conf, buildTopology(null));
    }
  }
}

2、代码详解

让我们一起来看一个Trident的例子。在这个例子中，我们主要做了两件事情：

从一个流式输入中读取语句病计算每个单词的个数
提供查询给定单词列表中每个单词当前总数的功能

因为这只是一个例子，我们会从如下这样一个无限的输入流中读取语句作为输入：

FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3,
               new Values("the cow jumped over the moon"),
               new Values("the man went to the store and bought some candy"),
               new Values("four score and seven years ago"),
               new Values("how many apples can you eat"),
spout.setCycle(true);

这个spout会循环输出列出的那些语句到sentence stream当中，下面的代码会以这个stream作为输入并计算每个单词的个数：

TridentTopology topology = new TridentTopology();        
TridentState wordCounts =
     topology.newStream("spout1", spout)
       .each(new Fields("sentence"), new Split(), new Fields("word"))
       .groupBy(new Fields("word"))
       .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))                
       .parallelismHint(6);

让我们一起来读一下这段代码。我们首先创建了一个TridentTopology对象。TridentTopology类相应的接口来构造Trident计算过程中的所有内容。我们在调用了TridentTopology类的newStream方法时，传入了一个spout对象，spout对象会从外部读取数据并输出到当前topology当中，从而在topology中创建了一个新的数据流。在这个例子中，我们使用了上面定义的FixedBatchSpout对象。输入数据源同样也可以是如Kestrel或者Kafka这样的队列服务。Trident会再Zookeeper中保存一小部分状态信息来追踪数据的处理情况，而在代码中我们指定的字符串“spout1”就是Zookeeper中用来存储metadata信息的Znode节点

Trident在处理输入stream的时候会把输入转换成若干个tuple的batch来处理。比如说，输入的sentence stream可能会被拆分成如下的batch：

一般来说，这些小的batch中的tuple可能会在数千或者数百万这样的数量级，这完全取决于你的输入的吞吐量。

Trident提供了一系列非常成熟的批量处理的API来处理这些小batch. 这些API和你在Pig或者Cascading中看到的非常类似，你可以做group by's, joins, aggregations, 运行 functions, 执行 filters等等。当然，独立的处理每个小的batch并不是非常有趣的事情，所以Trident提供了很多功能来实现batch之间的聚合的结果并可以将这些聚合的结果存储到内存，Memcached， Cassandra或者是一些其他的存储中。同时，Trident还提供了非常好的功能来查询实时状态。这些实时状态可以被Trident更新，同时它也可以是一个独立的状态源。

回到我们的这个例子中来，spout输出了一个只有单一字段“sentence”的数据流。在下一行，topology使用了Split函数来拆分stream中的每一个tuple，Split函数读取输入流中的“sentence”字段并将其拆分成若干个word tuple。每一个sentence tuple可能会被转换成多个word tuple，比如说"the cow jumped over the moon" 会被转换成6个 "word" tuples. 下面是Split的定义:

public class Split extends BaseFunction {
   public void execute(TridentTuple tuple, TridentCollector collector) {
       String sentence = tuple.getString(0);
       for(String word: sentence.split(" ")) {
           collector.emit(new Values(word));                
       }
   }
}

如你所见，真的很简单。它只是简单的根据空格拆分sentence，并将拆分出的每个单词作为一个tuple输出。

topology的其他部分计算单词的个数并将计算结果保存到了持久存储中。首先，word stream被根据“word”字段进行group操作，然后每一个group使用Count聚合器进行持久化聚合。persistentAggregate会帮助你把一个状态源聚合的结果存储或者更新到存储当中。在这个例子中，单词的数量被保持在内存中，不过我们可以很简单的把这些数据保存到其他的存储当中，如 Memcached, Cassandra等。如果我们要把结果存储到Memcached中，只是简单的使用下面这句话替换掉persistentAggregate就可以，这当中的"serverLocations"是Memcached cluster的主机和端口号列表：

.persistentAggregate(MemcachedState.transactional(serverLocations), new Count(), new Fields("count"))        
MemcachedState.transactional()

persistentAggregate存储的数据就是所有batch聚合的结果。

Trident非常酷的一点就是它是完全容错的，拥有者有且只有一次处理的语义。这就让你可以很轻松的使用Trident来进行实时数据处理。Trident会把状态以某种形式保持起来，当有错误发生时，它会根据需要来恢复这些状态。

persistentAggregate方法会把数据流转换成一个TridentState对象。在这个例子当中，TridentState对象代表了所有的单词的数量。我们会使用这个TridentState对象来实现在计算过程中的进行分布式查询。

下面这部分实现了一个低延时的单词数量的分布式查询。这个查询以一个用空格分割的单词列表为输入，并返回这些单词当天的个数。这些查询是想普通的RPC调用那样被执行的，要说不同的话，那就是他们在后台是并行执行的。下面是执行查询的一个例子：

DRPCClient client = new DRPCClient("drpc.server.location", 3772);
System.out.println(client.execute("words", "cat dog the man");
// prints the JSON-encoded result, e.g.: "[[5078]]"

如你所见，除了这是并发执行在storm cluster上之外，这看上去就是一个正常的RPC调用。这样的简单查询的延时通常在10毫秒左右。当然，更负责的DRPC调用可能会占用更长的时间，尽管延时很大程度上是取决于你给计算分配了多少资源。

这个分布式查询的实现如下所示：

topology.newDRPCStream("words")
       .each(new Fields("args"), new Split(), new Fields("word"))
       .groupBy(new Fields("word"))
       .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
       .each(new Fields("count"), new FilterNull())
       .aggregate(new Fields("count"), new Sum(), new Fields("sum"));

我们仍然是使用TridentTopology对象来创建DRPC stream，并且我们将这个函数命名为“words”。这个函数名会作为第一个参数在使用DRPC Client来执行查询的时候用到。

每个DRPC请求会被当做只有一个tuple的batch来处理。在处理的过程中，以这个输入的单一tuple来表示这个请求。这个tuple包含了一个叫做“args”的字段，在这个字段中保存了客户端提供的查询参数。在这个例子中，这个参数是一个以空格分割的单词列表。

首先，我们使用Splict功能把入参拆分成独立的单词。然后对“word” 进行group by操作，之后就可以使用stateQuery来在上面代码中创建的TridentState对象上进行查询。stateQuery接受一个数据源（在这个例子中，就是我们的topolgoy所计算的单词的个数）以及一个用于查询的函数作为输入。在这个例子中，我们使用了MapGet函数来获取每个单词的出现个数。由于DRPC stream是使用跟TridentState完全同样的group方式（按照“word”字段进行group），每个单词的查询会被路由到TridentState对象管理和更新这个单词的分区去执行。

接下来，我们用FilterNull这个过滤器把从未出现过的单词给去掉，并使用Sum这个聚合器将这些count累加起来。最终，Trident会自动把这个结果发送回等待的客户端。

Trident在如何最大程度的保证执行topogloy性能方面是非常智能的。在topology中会自动的发生两件非常有意思的事情：

读取和更新状态的操作 (比如说 persistentAggregate 和 stateQuery) 会自动的是batch的形式操作状态。如果有20次更新需要被同步到存储中，Trident会自动的把这些操作汇总到一起，只做一次读一次写，而不是进行20次读20次写的操作。因此你可以在很方便的执行计算的同时，保证了非常好的性能。
Trident的聚合器已经是被优化的非常好了的。Trident并不是简单的把一个group中所有的tuples都发送到同一个机器上面进行聚合，而是在发送之前已经进行过一次部分的聚合。打个比方，Count聚合器会先在每个partition上面进行count，然后把每个分片count汇总到一起就得到了最终的count。这个技术其实就跟MapReduce里面的combiner是一个思想。
让我们再来看一下Trident的另外一个例子。

三、TridentReach 案例

1、完整代码

package storm.starter.trident;

import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.LocalDRPC;
import backtype.storm.generated.StormTopology;
import backtype.storm.task.IMetricsContext;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import storm.trident.TridentState;
import storm.trident.TridentTopology;
import storm.trident.operation.BaseFunction;
import storm.trident.operation.CombinerAggregator;
import storm.trident.operation.TridentCollector;
import storm.trident.operation.builtin.MapGet;
import storm.trident.operation.builtin.Sum;
import storm.trident.state.ReadOnlyState;
import storm.trident.state.State;
import storm.trident.state.StateFactory;
import storm.trident.state.map.ReadOnlyMapState;
import storm.trident.tuple.TridentTuple;

import java.util.*;

public class TridentReach {
	
	// 模拟存放TWEETERS数据的库，某个推特消息转发的用户列表
  public static Map<String, List<String>> TWEETERS_DB = new HashMap<String, List<String>>() {{
    put("foo.com/blog/1", Arrays.asList("sally", "bob", "tim", "george", "nathan"));
    put("engineering.twitter.com/blog/5", Arrays.asList("adam", "david", "sally", "nathan"));
    put("tech.backtype.com/blog/123", Arrays.asList("tim", "mike", "john"));
  }};

  // 模拟存放FOLLOWERS数据的库，用户的粉丝列表
  public static Map<String, List<String>> FOLLOWERS_DB = new HashMap<String, List<String>>() {{
    put("sally", Arrays.asList("bob", "tim", "alice", "adam", "jim", "chris", "jai"));
    put("bob", Arrays.asList("sally", "nathan", "jim", "mary", "david", "vivian"));
    put("tim", Arrays.asList("alex"));
    put("nathan", Arrays.asList("sally", "bob", "adam", "harry", "chris", "vivian", "emily", "jordan"));
    put("adam", Arrays.asList("david", "carissa"));
    put("mike", Arrays.asList("john", "bob"));
    put("john", Arrays.asList("alice", "nathan", "jim", "mike", "bob"));
  }};

  public static class StaticSingleKeyMapState extends ReadOnlyState implements ReadOnlyMapState<Object> {
    public static class Factory implements StateFactory {
      Map _map;

      public Factory(Map map) {
        _map = map;
      }

      @Override
      public State makeState(Map conf, IMetricsContext metrics, int partitionIndex, int numPartitions) {
        return new StaticSingleKeyMapState(_map);
      }

    }

    Map _map;

    public StaticSingleKeyMapState(Map map) {
      _map = map;
    }


    @Override
    public List<Object> multiGet(List<List<Object>> keys) {
      List<Object> ret = new ArrayList();
      for (List<Object> key : keys) {
        Object singleKey = key.get(0);
        ret.add(_map.get(singleKey));
      }
      return ret;
    }

  }

  public static class One implements CombinerAggregator<Integer> {
    @Override
    public Integer init(TridentTuple tuple) {
      return 1;
    }

    @Override
    public Integer combine(Integer val1, Integer val2) {
      return 1;
    }

    @Override
    public Integer zero() {
      return 1;
    }
  }

  public static class ExpandList extends BaseFunction {

    @Override
    public void execute(TridentTuple tuple, TridentCollector collector) {
      List l = (List) tuple.getValue(0);
      if (l != null) {
        for (Object o : l) {
          collector.emit(new Values(o));
        }
      }
    }

  }

  public static StormTopology buildTopology(LocalDRPC drpc) {
	  
	  // 读取数据
    TridentTopology topology = new TridentTopology();
    TridentState urlToTweeters = topology.newStaticState(new StaticSingleKeyMapState.Factory(TWEETERS_DB));
    TridentState tweetersToFollowers = topology.newStaticState(new StaticSingleKeyMapState.Factory(FOLLOWERS_DB));

    // 生成drpc流，定义函数 reach；在urlToTweeters中进行查询，按输入的参数args的值来过滤，得到list。
    // 然后对每一个twitters，分割成单个，再shuffle打散成单个的tweeter；
    // 再次进行查询，以tweetersToFollowers作为输入，输入是每一个tweeter，然后通过MapGet，得到每一个twitter的follower的list，
    // 在用each对list，分割成单个follower；
    // 在用groupby分组，
    // aggregate聚合函数，输入是one类，每个都返回1，输出是one的fields
    // aggregate聚合函数，输入是one的fields，用Sum统计有多少个one
    topology.newDRPCStream("reach", drpc)
    		.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
    		.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
    		.shuffle()
    		.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
    		.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
    		.groupBy(new Fields("follower"))
    		.aggregate(new One(), new Fields("one"))
    		.aggregate(new Fields("one"), new Sum(), new Fields("reach"));
    
    return topology.build();
  }

  public static void main(String[] args) throws Exception {
    LocalDRPC drpc = new LocalDRPC();

    Config conf = new Config();
    LocalCluster cluster = new LocalCluster();

    cluster.submitTopology("reach", conf, buildTopology(drpc));

    Thread.sleep(2000);
    
    // 输入url，统计有多少个follower（去重后的follower）
    System.out.println("REACH: " + drpc.execute("reach", "aaa"));
    System.out.println("REACH: " + drpc.execute("reach", "foo.com/blog/1"));
    System.out.println("REACH: " + drpc.execute("reach", "engineering.twitter.com/blog/5"));


    cluster.shutdown();
    drpc.shutdown();
  }
}

2、代码详解

下一个例子是一个纯粹的DRPC topology。这个topology会计算一个给定URL的reach。那么什么事reach呢，这里我们将reach定义为有多少个独立用户在Twitter上面expose了一个给定的URL，那么我们就把这个数量叫做这个URL的reach。要计算reach，你需要tweet过这个URL的所有人，然后找到所有follow这些人的人，并将这些follower去重，最后就得到了去重后的follower的数量。如果把计算reach的整个过程都放在一台机器上面，就太勉强了，因为这会需要进行数千次数据库调用以及上一次的tuple的读取。如果使用Storm和Trident，你就可以把这些计算步骤在整个cluster中进行并发。

这个topology会读取两个state源。一个用来保存URL以及tweet这个URL的人的关系的数据库。还有一个保持人和他的follower的关系的数据库。topology的定义如下：

TridentState urlToTweeters =
       topology.newStaticState(getUrlToTweetersState());
TridentState tweetersToFollowers =
       topology.newStaticState(getTweeterToFollowersState());

topology.newDRPCStream("reach")
       .stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
       .each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
       .shuffle()
       .stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
       .parallelismHint(200)
       .each(new Fields("followers"), new ExpandList(), new Fields("follower"))
       .groupBy(new Fields("follower"))
       .aggregate(new One(), new Fields("one"))
       .parallelismHint(20)
       .aggregate(new Count(), new Fields("reach"));

这个topology使用newStaticState方法创建了TridentState对象来代表一种外部存储。使用这个TridentState对象，我们就可以在这个topology上面进行动态查询了。和所有的状态源一样，在数据库上面的查找会自动被批量执行，从而最大程度的提升效率。

这个topology的定义是非常直观的 - 只是一个简单的批量处理job。首先，查询urlToTweeters数据库来得到tweet过这个URL的人员列表。这个查询会返回一个列表，因此我们使用ExpandList函数来把每一个反悔的tweeter转换成一个tuple。

接下来，我们来获取每个tweeter的follower。我们使用shuffle来把要处理的tweeter分布到toplology运行的每一个worker中并发去处理。然后查询follower数据库从而的到每个tweeter的follower。你可以看到我们为topology的这部分分配了很大的并行度，这是因为这部分是整个topology中最耗资源的计算部分。

然后我们在follower上面使用group by操作进行分组，并对每个组使用一个聚合器。这个聚合器只是简单的针对每个组输出一个tuple “One”，再count “One” 从而的到不同的follower的数量。“One”聚合器的定义如下：

public class One implements CombinerAggregator<Integer> {
   public Integer init(TridentTuple tuple) {
       return 1;
   }

   public Integer combine(Integer val1, Integer val2) {
       return 1;
   }

   public Integer zero() {
       return 1;
   }        
}

这是一个"汇总聚合器", 它会在传送结果到其他worker汇总之前进行局部汇总，从而来最大程度上提升性能。Sum也是一个汇总聚合器，因此以Sum作为topology的最终操作是非常高效的。