1. StreamProcessing 流式处理
1.1 ETL/Pipeline 流处理
两种框架都已经将ETL的开发变得很简单。
基本的SOURCE/SINK:
SPARK3.5 - Structured Stream
def streaming_by_jdbc(): Unit = {
// ----------------------------------- source -----------------------------
// NORMAL JDBC: Data source jdbc does [not] support streamed reading.
val df = spark
.readStream
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/spark")
.option("user", "changeme")
.option("password", "changeme")
.option("dbtable", "(select * from poc_1m.lineitem limit 10) as t")
.load()
df.createOrReplaceTempView("jdbc_source_vw")
// ----------------------------------- process -----------------------------
// simple map
var sdf1 = spark.sql("select 'JDBC_SOURCE'" +
", cast(value as STRING)" +
", * " +
"from jdbc_source_vw")
// ----------------------------------- sink -----------------------------
// STREAM SINK 1 (APPEND)
sdf1
.writeStream
.outputMode("append")
.queryName("JDBC_SOURCE_SINK")
.trigger(Trigger.ProcessingTime(1000))
.option("truncate", false)
.format("console")
.start()
}
Blink (Flink aliyun propriotery)
try{
ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();
// environment.getConfig().setExecutionMode(ExecutionMode.BATCH);
JDBCInputFormat.JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.cj.jdbc.Driver")
.setDBUrl("jdbc:mysql://localhost:3306/blink?characterEncoding=UTF-8&useUnicode=true&useSSL=false&tinyInt1isBit=false&allowPublicKeyRetrieval=true&serverTimezone=Asia/Shanghai")
.setQuery("select id, device_id, flag from et01_source_tbl")
.setUsername("changeme")
.setPassword("changeme")
.setRowTypeInfo(ROW_TYPE_INFO);
DataSet<Row> source = environment.createInput(inputBuilder.finish());
source.print();
source.output(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername("com.mysql.cj.jdbc.Driver")
.setDBUrl("jdbc:mysql://localhost:3306/blink?characterEncoding=UTF-8&useUnicode=true&useSSL=false&tinyInt1isBit=false&allowPublicKeyRetrieval=true&serverTimezone=Asia/Shanghai")
.setUsername("changeme")
.setPassword("changeme")
.setQuery("insert into et01_sink_tbl (id, device_id, flag, ts) values (?,?,?,now()) on duplicate key update id=values(id),device_id=values(device_id),flag=values(flag),ts=values(ts) ")
.setSqlTypes(new int[]{Types.VARCHAR, Types.VARCHAR, Types.VARCHAR})
.finish());
environment.execute();
}catch(Exception e){
e.printStackTrace();
}
Flink1.20 Table API(Apache)
able orders = tEnv.from("Orders"); // schema (a, b, c, rowtime)
Table result = orders
.filter(
and(
$("a").isNotNull(),
$("b").isNotNull(),
$("c").isNotNull()
))
.select($("a").lowerCase().as("a"), $("b"), $("rowtime"))
.window(Tumble.over(lit(1).hours()).on($("rowtime")).as("hourlyWindow"))
.groupBy($("hourlyWindow"), $("a"))
.select($("a"), $("hourlyWindow").end().as("hour"), $("b").avg().as("avgBillingAmount"));
Flink1.20 SQL (Apache)
tableEnv.executeSql(
"CREATE CATALOG my_catalog "
+ "WITH (\n"
+ " 'type' = 'jdbc',\n"
+ " 'default-database' = 'flink_poc', \n"
+ " 'base-url' = 'jdbc:mysql://localhost:3306',\n"
+ " 'username' = 'changeme',\n"
+ " 'password' = 'changeme'\n"
+ " );");
tableEnv.executeSql("USE CATALOG my_catalog;\n");
tableEnv.sqlQuery("select 'CATALOG_SOURCE', * from lineitem_sink_tbl").execute().print();
statementSet.addInsertSql("insert into lineitem_sink_tbl_cp1_4_catalog select * from lineitem_sink_tbl");
Flink CDC 3.2 Pipeline (Apache)
Introduction | Apache Flink CDC
source:
type: mysql
hostname: localhost
port: 3306
username: root
password: 123456
tables: app_db.\.*
server-id: 5400-5404
server-time-zone: UTCsink:
type: doris
fenodes: 127.0.0.1:8030
username: root
password: ""
table.create.properties.light_schema_change: true
table.create.properties.replication_num: 1pipeline:
name: Sync MySQL Database to Doris
parallelism: 2
1.2 Timely Analysis (Fraud Detection) 实时分析(欺骗实时发现)
这里有很多有价值的场景。
1.2.1 基于Blink (Flink aliyun propriotery)
自定义Crontab定时触发的JDBC数据源
package com.certusnet.utils.mysql
import org.apache.flink.api.java.io.jdbc.JDBCInputFormat
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.io.InputSplit
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.types.Row
import scala.util.control.Breaks._
class JDBCSource(var cronExpress: String, val inputFo