编程模型:
Spout
/**
* @program: WordCountSpout.class
* @description: 传输数据到bolt,有一个抽象类BaseRichSpout,BaseRichBolt,一个接口IRichSpout,IRichBolt,
* 常用抽象类
* @author: YCF
* @create: 2018/12/22
**/
public class WordCountSpout extends BaseRichSpout {
//定义收集器
SpoutOutputCollector Collector ;
//初始化
public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector Collector) {
this.Collector = Collector;
}
//发送数据到Bolt
public void nextTuple() {
//发送数据
Collector.emit(new Values("I am ycf very hen shuai"));
//延时
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
//声明
public void declareOutputFields(OutputFieldsDeclarer out) {
out.declare(new Fields("itstar"));
}
}
Bolt
public class WordCountSplitBolt extends BaseRichBolt {
//定义收集器
OutputCollector Collector ;
//初始化
public void prepare(Map map, TopologyContext topologyContext, OutputCollector Collector) {
this.Collector = Collector;
}
//业务逻辑
public void execute(Tuple in) {
//获取数据
String line = in.getStringByField("itstar");
//数据切分
String[] fields = line.split(" ");
//发送数据
for (String w : fields){
Collector.emit(new Values(w,1));
}
}
//声明
public void declareOutputFields(OutputFieldsDeclarer out) {
out.declare(new Fields("word","sum"));
}
}
public class WordCountBolt extends BaseRichBolt {
Map<String,Integer> map = new HashMap();
//初始化
public void prepare(Map map, TopologyContext topologyContext, OutputCollector Collector) {
}
//业务逻辑
public void execute(Tuple in) {
//获取数据
String word = in.getStringByField("word");
Integer sum = in.getIntegerByField("sum");
//数据整合
if (map.containsKey(word)){
Integer value = map.get(word);
map.put(word,value+sum);
}else {
map.put(word,sum);
}
//打印到控制台
System.err.println(Thread.currentThread().getName()+"\t"+"单词位:"+ word + "\t 当前已出现次数为:" + map.get(word));
}
//声明
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
}
}
Driver
public class WordCountDrive {
public static void main(String[] args) {
//实例化拓扑
TopologyBuilder builder = new TopologyBuilder();
//指定设置,分组策略
builder.setSpout("WordCountSpout",new WordCountSpout(),2);
builder.setBolt("WordCountSplitBolt", new WordCountSplitBolt(),4).fieldsGrouping("WordCountSpout",new Fields("itstar"));
builder.setBolt("WordCountBolt",new WordCountBolt(),2).fieldsGrouping("WordCountSplitBolt",new Fields("word"));
//初始化配置
Config config = new Config();
//提交任务
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("WordCountTopology",config,builder.createTopology());
}
}
运行结果(截取部分):
Thread-20-WordCountBolt-executor[2 2] 单词位:ycf 当前已出现次数为:34
Thread-26-WordCountBolt-executor[1 1] 单词位:hen 当前已出现次数为:34
Thread-20-WordCountBolt-executor[2 2] 单词位:very 当前已出现次数为:34
Thread-20-WordCountBolt-executor[2 2] 单词位:shuai 当前已出现次数为:34
Thread-20-WordCountBolt-executor[2 2] 单词位:am 当前已出现次数为:35
Thread-20-WordCountBolt-executor[2 2] 单词位:ycf 当前已出现次数为:35
Thread-26-WordCountBolt-executor[1 1] 单词位:I 当前已出现次数为:35
Thread-20-WordCountBolt-executor[2 2] 单词位:very 当前已出现次数为:35
Thread-26-WordCountBolt-executor[1 1] 单词位:hen 当前已出现次数为:35
Thread-20-WordCountBolt-executor[2 2] 单词位:shuai 当前已出现次数为:35
Thread-26-WordCountBolt-executor[1 1] 单词位:I 当前已出现次数为:36
Thread-26-WordCountBolt-executor[1 1] 单词位:hen 当前已出现次数为:36
Thread-20-WordCountBolt-executor[2 2] 单词位:am 当前已出现次数为:36
Thread-20-WordCountBolt-executor[2 2] 单词位:ycf 当前已出现次数为:36
Thread-20-WordCountBolt-executor[2 2] 单词位:very 当前已出现次数为:36
Thread-20-WordCountBolt-executor[2 2] 单词位:shuai 当前已出现次数为:36
Thread-20-WordCountBolt-executor[2 2] 单词位:am 当前已出现次数为:37
Thread-26-WordCountBolt-executor[1 1] 单词位:I 当前已出现次数为:37
Thread-20-WordCountBolt-executor[2 2] 单词位:ycf 当前已出现次数为:37
Thread-26-WordCountBolt-executor[1 1] 单词位:hen 当前已出现次数为:37
Thread-20-WordCountBolt-executor[2 2] 单词位:very 当前已出现次数为:37
Thread-20-WordCountBolt-executor[2 2] 单词位:shuai 当前已出现次数为:37
Thread-20-WordCountBolt-executor[2 2] 单词位:am 当前已出现次数为:38
Spout->传输数据->Bolt->将数据分切+1(map)
->Bolt->整合数据(reduce)
并发度&分组策略
1)Fields Grouping
按照字段分组。相同字段发送到一个task中。
2)shuffle Grouping
随机分组。轮询。平均分配。随机分发tuple,保证每个bolt中的tuple数量相同。
3)Non Grouping
不分组
采用这种策略每个bolt中接收的单词不同。
4)All Grouping
广播发送
5)Global Grouping
全局分组
分配给task id值最小的
根据线程id判断,只分噢诶给线程id最小的
设置
Worker数为2个
总的线程数为10个,并行度决定了线程数/executor的数量,也就是10个executor.
总的任务数为12个,因为splitBolt设置了task数为4个,所以是2+4+6
一个executor可以对应多个task任务,所以splitBolt的task,在图中executor中是两个与两个的
每个线程是单独执行自己的业务逻辑,对于我们这个wordcount的程序来说,使用图中的shuffle分组策略是影响了业务逻辑的,因为他随机分给每个线程单词,每个线程都有可能接收同样的单词,并且执行自己的业务逻辑,也就造成每个线程统计的同样的单词可能有数量差异,还需要把每个线程的结果都给加起来,我们这里改成1的并行度就不影响业务逻辑了。
上面编程模型,我们使用的字段分组策略,不影响业务逻辑