文章参考:记一次Flink 写Mysql优化_三年之期已到的博客-CSDN博客
使用Flink开发的应用常见实时写入,即更新每一条流水的统计结果至数据库。在生产环境下,数据库为多个应用共用,那随着Flink应用的不断新增部署,数据库的写入压力越来越大,直至Flink应用出现阻塞等待、背压的情况出现,情况更严重的情况下会导致任务checkpoint超时、重启。
从以上角度出发,考虑针对一些特定场景优化Flink应用,降低写入频次,减轻对数据库的压力。
待优化的代码:
Table clearTransTable = tableEnv.fromDataStream(srcStream,$("stlBranchId"),$("stlOrgId"),$("txnAmt"),$("payTimeTs"),$("bigmccId"));
tableEnv.createTemporaryView("clearTransTable", clearTransTable);
tableEnv.executeSql(
" CREATE TABLE rts_clear_mcc_day_trans(\n" +
" stl_branch_id VARCHAR,\n" +
" stl_org_id VARCHAR,\n" +
" bigmcc_id VARCHAR,\n" +
" time_date VARCHAR,\n" +
" trans_count BIGINT,\n" +
" trans_amount DOUBLE,\n" +
" primary key(time_date,stl_branch_id,stl_org_id) NOT ENFORCED\n"+
") WITH (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = '" + SINK_JDBC_URL + "',\n" +
" 'table-name' = '" + SINK_TABLE_NAME + "',\n" +
" 'username' = '"+ SINK_JDBC_USERNAME +"',\n" +
" 'password' = '" + SINK_JDBC_PWD + "',\n" +
" 'scan.auto-commit' = 'true',\n" +
" 'sink.buffer-flush.max-rows' = '100',\n" +
" 'sink.buffer-flush.interval' = '1s'\n" +
")");
tableEnv.executeSql("INSERT INTO rts_clear_mcc_day_trans\n" +
" SELECT stlBranchId as stl_branch_id, " +
" stlOrgId as stl_org_id,\n" +
" bigmccId as bigmcc_id,\n" +
" DATE_FORMAT(payTimeTs ,'yyyy-MM-dd') as time_day," +
" count(1) as trans_count,\n" +
" sum(txnAmt) as trans_amount\n" +
" FROM clearTransTable\n" +
" GROUP BY stlBranchId,stlOrgId,bigmccId,\n" +
" DATE_FORMAT(payTimeTs,'yyyy-MM-dd')\n");
不开窗和使用普通分组GROUP BY的原因:
这里因为无法保证上游业务的流水在时间上100%实时有序,为了不漏算流水,没有开窗口而是使用了分组GROUP BY,每条流水都会触发计算+写库;
优化的原因:
该业务场景为大类流水统计,key空间不大,大概几百上千个,也就是说每秒会对少量的key的统计结果反复更新,在业务上并不要求此种大类统计计算数据做到100%实时,达到分钟级实时即可;
优化后的代码:
Table clearTransTable = tableEnv.fromDataStream(sreStream,$("stlBranchId"),$("stlOrgId"),$("txnAmt"),$("payTimeTs"),$("bigmccId"));
tableEnv.createTemporaryView("clearTransTable", clearTransTable);
String aggregateSql = "SELECT stlBranchId , " +
"stlOrgId , " +
"bigmccId , " +
"DATE_FORMAT(payTimeTs ,'yyyy-MM-dd') as timeDay," +
"count(1) as transCount," +
"sum(txnAmt) as transAmount " +
"FROM clearTransTable " +
"group by stlBranchId,stlOrgId,bigmccId," +
"DATE_FORMAT(payTimeTs,'yyyy-MM-dd')";
Table aggregatedTable = tableEnv.sqlQuery(aggregateSql);
aggregatedTable.printSchema();
SingleOutputStreamOperator<MccResult> res = tableEnv.toRetractStream(aggregatedTable, MccResult.class).flatMap(new FlatMapFunction<Tuple2<Boolean, MccResult>, MccResult>() {
@Override
public void flatMap(Tuple2<Boolean, MccResult> value, Collector<MccResult> out) throws Exception {
if (value.f0) {
out.collect(value.f1);
}
}
}).returns(MccResult.class);
tableEnv.createTemporaryView("aggregatedTable", res);
Table result = tableEnv.sqlQuery("SELECT stlBranchId,stlOrgId,bigmccId,timeDay,LAST_VALUE(transCount) AS transCount,LAST_VALUE(transAmount) AS transAmount FROM aggregatedTable GROUP BY TUMBLE(proctime(), INTERVAL '1' MINUTES),stlBranchId,stlOrgId,bigmccId,timeDay ");
tableEnv.toAppendStream(result,MccResult.class).addSink(JdbcSink.sink("insert into rts_clear_mcc_day_trans(stl_branch_id,stl_org_id,bigmcc_id,time_date,trans_count,trans_amount) values(?,?,?,?,?,?) on duplicate key update trans_count = VALUES(`trans_count`), trans_amount = VALUES(`trans_amount`)",
(JdbcStatementBuilder<MccResult>) (ps, t)->{
ps.setString(1, t.getStlBranchId());
ps.setString(2, t.getStlOrgId());
ps.setString(3, t.getBigmccId());
ps.setString(4, t.getTimeDay());
ps.setLong(5, t.getTransCount());
ps.setDouble(6, t.getTransAmount());
},
new JdbcExecutionOptions.Builder().withBatchSize(10).withMaxRetries(3).build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl(SystemConstant.SINK_JDBC_URL)
.withDriverName(SystemConstant.SINK_JDBC_DRIVER)
.withUsername(SystemConstant.SINK_JDBC_USERNAME)
.withPassword(SystemConstant.SINK_JDBC_PWD)
.build()
)).uid("jdbcSink").name("jdbcSink");
env.execute("RtsClearMccDayTrans");
1. GROUP BY后的聚合结果先不直接写入数据库,而是转化为一个中间表aggregatedTable;
2. 将中间表aggregatedTable转化为DataStream;
3. 使用一个flatMap算子将DataStream中的回撤流水剔除掉,获得聚合结果的更新流res;
4. 将res开一个一分钟窗口,根据聚合key取每个key在一分钟内的最新一条记录,这里用到了LAST_VALUE()函数,为什么不用MAX或者MIN函数,因为计算的trans_amount可能有负数,用MAX或者MIN都不准;
5. 再将聚合后的流水转化为DataStream流后SINK到JDBC;
该优化方法有两个前提:
1. 参与统计的Key空间不大
2. 统计实时要求性不高
关于回撤流的介绍,可以参考这篇博客