spark大数据分析:spark Struct Strreaming(24)Stream-Static/Stream模式下的innerjoin操作

流与静态数据的join

对于流式数据与静态数据的join操作,直接DataFrame之间的join即可

  val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("Chapter9_8_1")
      .getOrCreate()

    import spark.implicits._
    spark.sparkContext.setLogLevel("WARN")

    val javaList = new java.util.ArrayList[Row]()
    javaList.add(Row("Alice", "Female"))
    javaList.add(Row("Bob", "Male"))
    javaList.add(Row("Thomas", "Male"))

    val schema = StructType(List(
      StructField("name", StringType, nullable = false),
      StructField("sex", StringType, nullable = false)
    ))

    val staticDataFrame = spark.createDataFrame(javaList, schema)

    val lines = spark.readStream
      .format("socket")
      .option("host", "linux01")
      .option("port", 9999)
      .load()

    val streamDataFrame = lines.as[String].map(s => {
      val arr = s.split(",")
      (arr(0), arr(1).toInt)
    }).toDF("name", "age")

    val joinResult = streamDataFrame.join(staticDataFrame, "name")

    val query = joinResult.writeStream
      .outputMode("append")
      .trigger(Trigger.ProcessingTime(0))
      .format("console")
      .start()

    query.awaitTermination()

Stream-Stream的join

在spark2.3之后才支持流inner join操作,输出模式必须为Append,两个流发生join操作,Struct Strreaming会维护两个流状态,保障后续流入的数据与之前流入数据发生join操作,但是这会导致状态无限增长,因此两个流发生join,通过Watermark机制来清除过期的状态,避免状态无限增长,默认以两个流中最小watermark为基准,当然在2.4之后可以通过multipleWatermarkPolicy属性进行配置

package struct

import java.text.SimpleDateFormat

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.streaming.Trigger

object StructStream09 {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("Chapter9_8_2")
      .getOrCreate()

    import org.apache.spark.sql.functions._
    import spark.implicits._
    spark.sparkContext.setLogLevel("WARN")

    val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    val streamNameSex = spark.readStream
      .format("socket")
      .option("host", "linux01")
      .option("port", 9998)
      .load()
      .as[String].map(s => {
      val arr = s.split(",")
      val date = sdf.parse(arr(2))
      (arr(0), arr(1), new Timestamp(date.getTime))
    }).toDF("name1", "sex", "ts1")

    val streamNameAge = spark.readStream
      .format("socket")
      .option("host", "linux01")
      .option("port", 9999)
      .load()
      .as[String].map(s => {
      val arr = s.split(",")
      val date = sdf.parse(arr(2))
      (arr(0), arr(1).toInt, new Timestamp(date.getTime))
    }).toDF("name2", "age", "ts2")

    val streamNameSexWithWatermark = streamNameSex.withWatermark("ts1", "2 minutes")
    val streamNameAgeWithWatermark = streamNameAge.withWatermark("ts2", "1 minutes")
    val joinResult = streamNameSexWithWatermark.join(
      streamNameAgeWithWatermark,
      expr(
        """
        name1 = name2 AND
        ts2 >= ts1 AND
        ts2 <= ts1 + interval 1 minutes
        """),
      joinType = "inner")
    val query = joinResult.writeStream
      .outputMode("append")
      .trigger(Trigger.ProcessingTime(0))
      .format("console")
      .start()

    query.awaitTermination()
  }
}

只有当ts1<=ts2<=ts1+1分钟,且name1=name2才会发生join操作,对于joinType如果不设置默认是inner join,还可以选择left_outer / right_outer

总结

static 与 static支持所有join
Stream join Static 支持inner join以及left join
Static join Stream 支持inner join以及right join

流与流之间的join 在left out join 中如果左侧条件在右侧没有找到会出现null值,为提高join的精确度,右边的streaming需要设withWaterMark方法设置Threshold,可能会有数据延迟,但是输出更有意义
right out join 反之

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值