在storm和jstorm中,fieldsGrouping是按指定字段进行分组,相同的指定字段的值都会分到同一个组里面,那么不同的值是怎么分组的?分配策略的实质是什么?
猜测是根据hash取模来分,相同的余数会分到同一个组里面,也就是说会被同一个线程处理。
做个简单的测试:
storm版本:0.9.5
jstorm版本:2.1.1
主类代码如下:
public class FieldGroupTest { public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new SplitSpoultTest(), 1); //指定3个,便于测试 builder.setBolt("bolt", new PinrtBolt(), 3).fieldsGrouping("spout", new Fields("field")); Config conf = new Config(); conf.setDebug(false); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("toplogy", conf, builder.createTopology()); Utils.sleep(60000); cluster.shutdown(); } }
为了便于测试,数据源除了27是整除外,其他数据都是除以3余数为1。
spout代码:
public class SplitSpoultTest extends BaseRichSpout { int count = 0; private SpoutOutputCollector collector; //多数是除以3余数为1,27是整除,便于测试 String[] array2 = {"1","4","7","10","13","16","19","22","25","27","28","16"}; @Override public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) { this.collector=spoutOutputCollector; } @Override public void nextTuple() { if(count>=array0.length){ Utils.sleep(10000); count=0; } collector.emit(new Values(array2[count]),array2[count]); count++; } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("field")); }
bolt阶段只做输出:
代码如下:
public class PrintBolt extends BaseBasicBolt { @Override public void execute(Tuple input, BasicOutputCollector collector) { System.out.println("tuple0->"+input.getString(0)+" "+ Thread.currentThread().getName()); } @Override public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) { } }结果猜测:如果fieldGtouping的分配策略是根据hash取模来分的,会有两个线程,余数为1的都会被同一个bolt处理,也就是被同一个线程处理。
实际结果如下:
storm:
tuple0->1 Thread-13-boltjstorm:tuple0->4 Thread-13-bolt tuple0->7 Thread-13-bolt tuple0->10 Thread-13-bolt tuple0->13 Thread-13-bolt tuple0->16 Thread-13-bolt tuple0->19 Thread-13-bolt tuple0->22 Thread-13-bolt tuple0->25 Thread-13-bolt tuple0->27 Thread-11-bolt
tuple0->1 bolt:4-BoltExecutors
tuple0->4 bolt:4-BoltExecutors
tuple0->7 bolt:4-BoltExecutors
tuple0->27 bolt:3-BoltExecutors
tuple0->10 bolt:4-BoltExecutors
tuple0->13 bolt:4-BoltExecutors
tuple0->16 bolt:4-BoltExecutors
tuple0->19 bolt:4-BoltExecutors
tuple0->22 bolt:4-BoltExecutors
tuple0->25 bolt:4-BoltExecutors
tuple0->28 bolt:4-BoltExecutors
tuple0->16 bolt:4-BoltExecutors
测试结果与猜想一致。
总结:storm 和jstorm中fieldsGrouping分配策略的实质是根据指定的字段的值,进行hash取模,根据模进行分配。这也说明了为什么相同的值会被同一个bolt处理。
需要注意的是:在使用fieldsGrouping时,如果发生数据倾斜,会导致其中某个bolt压力过大。