项目使用Flink SQL做实时数据统计,用到ROW_NUMBER(),特意看了下官网,支持,嗯,好开心。然而实际上还是没有顺利的实现。
先看代码设置,代码中使用的处理时间为处理时间(TimeCharacteristic.ProcessingTime),使用窗口为滚动窗口,统计每天00:00:00-23:59:59时间范围内的数据。
建表
//准备环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(param);
env.enableCheckpointing(5 * 60 * 1000);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(2000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setCheckpointTimeout(60000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
//设置checkPoint失败容忍度
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(5);
//设置状态存储后端
RocksDBStateBackend dbStateBackend = new RocksDBStateBackend("file:///zxj/flink_checkpoint/state", true);
//创建TableEnvironment
StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(env);
bsTableEnv.createTemporaryView("test_crm", testDataStream
,$("companyId")
,$("positionId")
,$("orgId")
,$("orgPath")
,$("isLogin")
,$("isVisit")
,$("isLoginVisit")
,$("isPrivate")
,$("isFamily")
,$("rowTime")
,$("dateKey")
,$("currentDate")
,$("proctime").proctime()
);
//准备sql,sql 列在下面
String bdSql = "";
Table bdResult = bsTableEnv.sqlQuery(bdSql);
DataStream<Tuple2<Boolean, Row>> bdResultStream = bsTableEnv.toRetractStream(bdResult, Row.class);
bdResultStream.print();
FlinkSQL如下
select
positionId
, count(distinct companyId) filter(where (isFamily = 0 and isLogin = 1)) as login_company_num
from (
select
*
,ROW_NUMBER() OVER(PARTITION BY companyId order by proctime desc) as rownum
from table_test
) a
where rownum = 1
group by positionId, tumble(proctime, interval '1' day)
报错信息
org.apache.flink.table.api.TableException: GroupWindowAggregate doesn't support consuming update and delete changes which is produced by node Rank(strategy=[UndefinedStrategy], rankType=[ROW_NUMBER], rankRange=[rankStart=1, rankEnd=1], partitionBy=[companyId], orderBy=[rowTime DESC], select=[companyId, positionId, isLogin, isFamily, rowTime])
at org.apache.flink.table.planner.plan.optimize.program.FlinkChangelogModeInferenceProgram$SatisfyModifyKindSetTraitVisitor.createNewNode(FlinkChangelogModeInferenceProgram.scala:355)
然后,开始百度
看到我认为最明白的一篇,原博链接:https://blog.csdn.net/wangpei1949/article/details/105472473/
我和博主的区别,也就是处理事件类型不一样,但这不应该影响row_number的使用。到这里还一直以为是row_number没有用对姿势,知道我放弃row_number,sql换了另一种写法,也报了相同的错误。
然后发现,子查询和外层group by都使用了窗口函数,去掉窗口函数,OK正常了。
改动点:
- 不在使用窗口,group by后面去掉 , tumble(rowTime, interval '1' day)
- Flink SQL改为用dateKey来限定时间范围:每条数据的c_t转换为20200428的格式,SQL查询直接用dateKey=currentDateKey
- Flink SQL状态设置最大闲置时间,不然数据会越来越多,总有一天任务会挂掉闲置时间设置(这里最大时间设置是否合理有待验证,相关文档介绍参考:https://zhuanlan.zhihu.com/p/98788196)
bsTableEnv.getConfig().getConfiguration().addAll(tConfig); bsTableEnv.getConfig().setIdleStateRetentionTime(Time.hours(24), Time.hours(25));
修改后SQL如下:
select
positionId
, count(distinct companyId) filter(where (isFamily = 0 and isLogin = 1)) as login_company_num
from (
select
*
,ROW_NUMBER() OVER(PARTITION BY companyId order by proctime desc) as rownum
from table_test
where dateKey = currentDate
) a
where rownum = 1
group by positionId
本地测试正常,测试环境待验证,生产环境待验证。
文章有不妥之处还请大佬指点再走,谢!