1. 定义
Flink中,Kafka Source是非回撤流,Group By是回撤流。所谓回撤流,就是可以更新历史数据的流,更新历史数据并不是将发往下游的历史数据进行更改,要知道,已经发往下游的消息是追不回来的。更新历史数据的含义是,在得知某个Key(接在Key BY / Group By后的字段)对应数据已经存在的情况下,如果该Key对应的数据再次到来,会生成一条delete消息和一条新的insert消息发往下游。
2. 示例
public class RetractDemo {
public static void main(String[] args) throws Exception {
// set up execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// use blink planner in streaming mode
EnvironmentSettings settings = EnvironmentSettings.newInstance()
.inStreamingMode()
.build();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
// 用fromElements模拟非回撤消息
DataStream<Tuple2<String, Integer>> dataStream = env.fromElements(new Tuple2<>("hello", 1), new Tuple2<>("hello", 1), new Tuple2<>("hello", 1));
tEnv.registerDataStream("tmpTable", dataStream, "word, num");
Table table = tEnv.sqlQuery("select cnt, count(word) as freq from (select word, count(num) as cnt from tmpTable group by word) group by cnt");
// 启用回撤流机制
tEnv.toRetractStream(table, TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {
})).print();
env.execute();
}
}
结果:
(true,(1,1))
(false,(1,1))
(true,(2,1))
(false,(2,1))
(true,(3,1))
2.1 源码分析
2.1.1 聚合算子回撤
有如下sql:
第一层count,接收kafka source的非回撤流:
SELECT region, count(id) AS order_cnt FROM order_tab GROUP BY region
第二层count,接收第一层count的回撤流:
SELECT order_cnt, count(region) as region_cnt FROM order_count_view GROUP BY order_cnt
下面来分析一下其源码:
- 代码生成
Flink在为SQL语句生成物理执行计划是,会在AggregateUtil.createGroupAggregateFunction
方法中生成聚合方法GeneratedAggregations#retract()
,并最终利用Janino动态编译框架编译运行。生成GeneratedAggregations
object AggregateUtil {
private[flink] def createDataStreamGroupAggregateFunction[K](...generateRetraction: Boolean...