Spark Structured Streaming Join

Structured Streaming中的Join操作分两类: 流-静态数据(Stream-Static)Join和流-流(Stream-Stream) Join。

本文总结这两类Join操作。

Stream-Static Join

测试数据

// 静态mysql
select * from t_user_info;
+--------+-------+-----+---------+---------------------+
| userID | name  | age | address | updated_at          |
+--------+-------+-----+---------+---------------------+
| user_1 | name1 |  22 | CN      | 2020-03-22 09:56:03 |
+--------+-------+-----+---------+---------------------+

// 动态kafka
{"eventTime": "2016-01-01 10:02:00" ,"eventType": "click" ,"userID":"1"}

Kafka Join Mysql 示例

package com.bigdata.structured.streaming.join

import org.apache.spark.sql.functions.{col, from_json}
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.slf4j.LoggerFactory

/**
  * Author: Wang Pei
  * Summary:
  *   Stream-Static Join 
  */
object StreamStaticJoin {

  lazy val logger = LoggerFactory.getLogger(StreamStaticJoin.getClass)

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().master("local[3]").appName(this.getClass.getSimpleName.replace("$", "")).getOrCreate()
    import spark.implicits._

    //静态数据
    val staticDF: DataFrame = spark
      .read
      .format("jdbc")
      .option("driver", "com.mysql.jdbc.Driver")
      .option("url", "jdbc:mysql://localhost:3306/bigdata?characterEncoding=utf8&useSSL=false")
      .option("dbtable", "t_user_info")
      .option("user", "bigdata")
      .option("password", "******")
      .load()

    //动态数据
    val kafkaJsonSchema =
      """{"type":"struct","fields":[{"name":"eventTime","type":"string","nullable":true},{"name":"eventType","type":"string","nullable":true},{"name":"userID","type":"string","nullable":true}]}"""
    val streamDF: DataFrame = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "kafka01:9092")
      .option("subscribe", "test_1")
      .load()
      .select(from_json(col("value").cast("string"), DataType.fromJson(kafkaJsonSchema)).as("value"))
      .select($"value.*")

    // Stream-Static Join
    // Left Join
    // val joinedDF = streamDF.join(staticDF,streamDF("userID")===staticDF("userID"),"left")

    // Right Join 不支持
    // val joinedDF = streamDF.join(staticDF,streamDF("userID")===staticDF("userID"),"right")

    // Inner Join
    val joinedDF = streamDF.join(staticDF, streamDF("userID") === staticDF("userID"))

    val query = joinedDF
      .writeStream
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("2 seconds"))
      .format("console")
      .start()

    query.awaitTermination()

  }
}

注意

  1. Stream-Static对Join的支持如下:
LeftRightInnerLeft OuterRight OuterFull Outer
StreamStaticYesYesNoNo
StaticStreamYesNoYesNo
  1. Stream-Static Join是无状态的,流中的记录会与整个静态数据集进行匹配。

  2. 举个例子: Kafka Join 静态数据Mysql,会将查询下推到Mysql中,当Mysql的数据变化(如某列值发生改变),此时可获取到新的值。同理Join CSV文件,每个微批Join,都会重新读取文件,然后通过Broadcast进行Join, 因此,当文件内容变化时,也可获取到最新的值。

Stream-Stream Join

从Spark 2.3开始,开始支持Stream-Stream Join,即Stream DataSet/DataFrame Join Stream DataSet/DataFrame

测试数据

//浏览流
{"browse_user": "user_1", "browse_time": "2016-01-01 00:00:00"}

//点击流
{"click_user": "user_1", "click_time": "2016-01-01 00:00:00"}

Inner Join示例

package com.bigdata.structured.streaming.join

import java.sql.Timestamp
import java.time.{LocalDateTime, ZoneId}
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.functions.{col, from_json, lit,expr}
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.{SparkSession, functions}
import org.slf4j.LoggerFactory

/**
  * Author: Wang Pei
  * Summary:
  *   Stream-Stream Inner Join
  */
object StreamStreamJoin {

  lazy val logger = LoggerFactory.getLogger(StreamStreamJoin.getClass)

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().master("local[3]").appName(this.getClass.getSimpleName.replace("$", "")).getOrCreate()
    import spark.implicits._

    // 注册UDF
    spark.udf.register("timezoneToTimestamp", timezoneToTimestamp _)

    //动态数据-浏览流
    val browseSchema = """{"type":"struct","fields":[{"name":"browse_time","type":"string","nullable":true},{"name":"browse_user","type":"string","nullable":true}]}"""
    val browseStreamDF = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "kafka01:9092,kafka02:9092,kafka03:9092")
      .option("subscribe", "test_1")
      .load()
      .select(from_json(col("value").cast("string"), DataType.fromJson(browseSchema)).as("value"))
      .select($"value.*")
      .withColumn("browse_timestamp", functions.callUDF("timezoneToTimestamp", functions.col("browse_time"),lit("yyyy-MM-dd HH:mm:ss"),lit("GMT+8")))
      .filter($"browse_timestamp".isNotNull && $"browse_user".isNotNull)
      .withWatermark("browse_timestamp", "10 seconds")

    //动态数据-点击流
    val clickSchema = """{"type":"struct","fields":[{"name":"click_time","type":"string","nullable":true},{"name":"click_user","type":"string","nullable":true}]}"""
    val clickStreamDF = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "kafka01:9092,kafka02:9092,kafka03:9092")
      .option("subscribe", "test_2")
      .load()
      .select(from_json(col("value").cast("string"), DataType.fromJson(clickSchema)).as("value"))
      .select($"value.*")
      .withColumn("click_timestamp", functions.callUDF("timezoneToTimestamp", functions.col("click_time"),lit("yyyy-MM-dd HH:mm:ss"),lit("GMT+8")))
      .filter($"click_timestamp".isNotNull && $"click_user".isNotNull)
      .withWatermark("click_timestamp", "30 seconds")

    // Inner Join
    val joinedDF = browseStreamDF.join(
      clickStreamDF,
      expr("""
        browse_user = click_user AND
        click_timestamp >= browse_timestamp AND
        click_timestamp <= browse_timestamp + interval 20 seconds
        """)
    )

    browseStreamDF
      .writeStream
      .outputMode("append")
      .format("console")
      .start()

    clickStreamDF
      .writeStream
      .outputMode("append")
      .format("console")
      .start()

    joinedDF
      .writeStream
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("2 seconds"))
      .format("console")
      .start()

    spark.streams.awaitAnyTermination()

  }

  /**
    * 带时区的时间转换为Timestamp
    *
    * @param dateTime
    * @param dataTimeFormat
    * @param dataTimeZone
    * @return
    */
  def timezoneToTimestamp(dateTime: String, dataTimeFormat: String, dataTimeZone: String): Timestamp = {
    var output: Timestamp = null
    try {
      if (dateTime != null) {
        val format = DateTimeFormatter.ofPattern(dataTimeFormat)
        val eventTime = LocalDateTime.parse(dateTime, format).atZone(ZoneId.of(dataTimeZone));
        output = new Timestamp(eventTime.toInstant.toEpochMilli)
      }
    } catch {
      case ex: Exception => logger.error("时间转换异常..." + dateTime, ex)
    }
    output
  }
}

Outer Join 示例

和Inner Join类型,不同之处在于,将有一个附加参数将其指定为外部联接。如下:

val joinedDF = browseStreamDF.join(
      clickStreamDF,
      expr("""
        browse_user = click_user AND
        click_timestamp >= browse_timestamp AND
        click_timestamp <= browse_timestamp + interval 20 seconds
        """),
      "left")

注意

  1. Stream-Stream Join是有状态的,通过状态,将无限的数据流的Join,拆分成有限的数据集Join。

  2. Stream-Stream Inner Join,水印和时间约束是可选的。未指定水印和时间约束时,记录将无限期存储在状态中; 在两侧设置水印和时间约束时,会相应地启用状态清除。

  3. Stream-Stream Left Join,水印和时间约束是必选的,即必须在右侧流上指定水印和时间约束。左侧流也可以指定水印和时间约束。

  4. Stream-Stream Right Join,水印和时间约束是必选的,即必须在左侧流上指定水印和时间约束。右侧流也可以指定水印和时间约束。

  5. Stream-Stream Full Join,不支持。

  6. 水印(Watermark)和时间约束(Time Constraint)的作用:

    A. 水印决定了数据可以延迟多久,以及数据何时会被删除,如:水印设置为30分钟,则超过30分钟的记录将被删除或忽略。

    B. 时间约束决定了与其相关的流的状态将保留多久的记录。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值