20241001 大数据流式计算 - SPRAK3.5与FLINK1.20 (实践)

1. StreamProcessing 流式处理

1.1 ETL/Pipeline 流处理

两种框架都已经将ETL的开发变得很简单。

基本的SOURCE/SINK:

https://blog.csdn.net/weixin_46449024/article/details/141761576?fromshare=blogdetail&sharetype=blogdetail&sharerId=141761576&sharerefer=PC&sharesource=weixin_46449024&sharefrom=from_link

SPARK3.5 - Structured Stream


  def streaming_by_jdbc(): Unit = {
    // ----------------------------------- source -----------------------------
    // NORMAL JDBC: Data source jdbc does [not] support streamed reading.
    val df = spark
      .readStream
      .format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/spark")
      .option("user", "changeme")
      .option("password", "changeme")
      .option("dbtable", "(select * from poc_1m.lineitem limit 10) as t")
      .load()

    df.createOrReplaceTempView("jdbc_source_vw")



    // ----------------------------------- process -----------------------------
    // simple map
    var sdf1 = spark.sql("select 'JDBC_SOURCE'" +
      ", cast(value as STRING)" +
      ", * " +
      "from jdbc_source_vw")


    // ----------------------------------- sink -----------------------------
    // STREAM SINK 1 (APPEND)
    sdf1
      .writeStream
      .outputMode("append")
      .queryName("JDBC_SOURCE_SINK")
      .trigger(Trigger.ProcessingTime(1000))
      .option("truncate", false)
      .format("console")
      .start()
  }


 

Blink (Flink aliyun propriotery)

        try{
            ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();
//            environment.getConfig().setExecutionMode(ExecutionMode.BATCH);
            JDBCInputFormat.JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
                    .setDrivername("com.mysql.cj.jdbc.Driver")
                    .setDBUrl("jdbc:mysql://localhost:3306/blink?characterEncoding=UTF-8&useUnicode=true&useSSL=false&tinyInt1isBit=false&allowPublicKeyRetrieval=true&serverTimezone=Asia/Shanghai")
                    .setQuery("select id, device_id, flag from et01_source_tbl")
                    .setUsername("changeme")
                    .setPassword("changeme")
                    .setRowTypeInfo(ROW_TYPE_INFO);
            DataSet<Row> source = environment.createInput(inputBuilder.finish());
            source.print();

            source.output(JDBCOutputFormat.buildJDBCOutputFormat()
                    .setDrivername("com.mysql.cj.jdbc.Driver")
                    .setDBUrl("jdbc:mysql://localhost:3306/blink?characterEncoding=UTF-8&useUnicode=true&useSSL=false&tinyInt1isBit=false&allowPublicKeyRetrieval=true&serverTimezone=Asia/Shanghai")
                    .setUsername("changeme")
                    .setPassword("changeme")
                    .setQuery("insert into et01_sink_tbl (id, device_id, flag, ts) values (?,?,?,now()) on duplicate key update id=values(id),device_id=values(device_id),flag=values(flag),ts=values(ts) ")
                    .setSqlTypes(new int[]{Types.VARCHAR, Types.VARCHAR, Types.VARCHAR})
                    .finish());

            environment.execute();
        }catch(Exception e){
            e.printStackTrace();
        }

Flink1.20  Table API(Apache)

Table API | Apache Flink

able orders = tEnv.from("Orders"); // schema (a, b, c, rowtime)

Table result = orders
        .filter(
            and(
                $("a").isNotNull(),
                $("b").isNotNull(),
                $("c").isNotNull()
            ))
        .select($("a").lowerCase().as("a"), $("b"), $("rowtime"))
        .window(Tumble.over(lit(1).hours()).on($("rowtime")).as("hourlyWindow"))
        .groupBy($("hourlyWindow"), $("a"))
        .select($("a"), $("hourlyWindow").end().as("hour"), $("b").avg().as("avgBillingAmount"));

Flink1.20  SQL (Apache)

tableEnv.executeSql(
                "CREATE CATALOG     my_catalog "
                        + "WITH (\n"
                        + "   'type' = 'jdbc',\n"
                        + "   'default-database' = 'flink_poc', \n"
                        + "   'base-url' = 'jdbc:mysql://localhost:3306',\n"
                        + "   'username' = 'changeme',\n"
                        + "   'password' = 'changeme'\n"
                        + " );");

        tableEnv.executeSql("USE CATALOG my_catalog;\n");

        tableEnv.sqlQuery("select 'CATALOG_SOURCE', * from lineitem_sink_tbl").execute().print();


        statementSet.addInsertSql("insert into lineitem_sink_tbl_cp1_4_catalog select * from lineitem_sink_tbl");

Flink CDC 3.2   Pipeline (Apache)

Introduction | Apache Flink CDC

source:
  type: mysql
  hostname: localhost
  port: 3306
  username: root
  password: 123456
  tables: app_db.\.*
  server-id: 5400-5404
  server-time-zone: UTC

sink:
  type: doris
  fenodes: 127.0.0.1:8030
  username: root
  password: ""
  table.create.properties.light_schema_change: true
  table.create.properties.replication_num: 1

pipeline:
  name: Sync MySQL Database to Doris
  parallelism: 2

1.2 Timely Analysis (Fraud Detection) 实时分析(欺骗实时发现)

这里有很多有价值的场景。

1.2.1 基于Blink (Flink aliyun propriotery)

自定义Crontab定时触发的JDBC数据源

package com.certusnet.utils.mysql

import org.apache.flink.api.java.io.jdbc.JDBCInputFormat
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.io.InputSplit
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.types.Row

import scala.util.control.Breaks._

class JDBCSource(var cronExpress: String, val inputFo
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值