structure streaming 使用小结3-输出模式（append,update,complete）

最新推荐文章于 2024-01-22 20:47:21 发布

jin6872115

最新推荐文章于 2024-01-22 20:47:21 发布

阅读量738

点赞数

分类专栏： structure streaming 文章标签： spark

本文链接：https://blog.csdn.net/jin6872115/article/details/119139203

版权

structure streaming 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

常规使用不提，主要是使用不同模式完成排序，数据更新操作，算是对小结2的补充优化。

1、排序，使用complete模式，将数据流看做静态表，不断追加数据，通过order by可以实现排序功能。

val df =spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "*:9092")
//      .option("kafka.bootstrap.servers", "*:9092")
      .option("subscribe", "test")
      .option("startingOffsets", "latest")
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String]
      .filter(line => line.split(",").length > 2)
//      //        .dropDuplicates()
      .map(line => {
      val mes = line.split(",")
      val provincial_centre_id = mes(0).toInt
      val terminal_id =  mes(1).toLong
      val timestamp = mes(2).toLong
      val value = mes(3).toInt
      val time = MyUtils.getSqlTimes(timestamp)

      (provincial_centre_id,terminal_id,timestamp,value,time)
    })
      .toDF("provincial_centre_id","terminal_id","timestamp","value","time")
      .selectExpr("provincial_centre_id","terminal_id" ,"timestamp","value","time")
      .withWatermark("time","10 seconds")
//      .groupBy("time","provincial_centre_id","terminal_id","timestamp","value").count()
//      .repartition(100,new Column("terminal_id"))
      .createOrReplaceTempView("t_odscb_sales")
//    spark.sql("select * from b_mapping_siteclerk limit 3").show()
    println("=====================>2")

//    val sql_sale = "select terminal_id,max(provincial_centre_id) as provincial_centre_id," +
//      " max(timestamp) as timestamp,1 as value,max(time) as time ,count(1)" +
//      "  from t_odscb_sales  " +
//      " group by terminal_id " +
//      "   "
    val sql_sale = "select terminal_id,provincial_centre_id," +
      " timestamp,time ,count(1)" +
      "  from t_odscb_sales  " +
      " group by terminal_id,provincial_centre_id,timestamp,time" +
  " order by timestamp " +
      "   "
//    //      println(sql_sale)
    val df1 = spark.sql(sql_sale)
//    val query1 = df
      .writeStream
      .outputMode("complete")
//      .foreach(new Test)

      .format("console")
      .option("checkpointLocation", directory1)
      .start()
    println("=====================>3")
    spark.streams.awaitAnyTermination()

排序结果，好用

官网说配合withWatermark实现数据过期处理，通过实验，结果失败，数据不能过期，随着数据量增大，表的数据会很大。如何配合使用，有待后续继续研究。（ .withWatermark("time","10 seconds")无效）

2、update模式配合max聚合，不支持排序order by。进行计数统计，使用 .withWatermark("time","10 seconds")无效。

去重可用，.dropDuplicates("timestamp")

去重加时间范围。去重可用， withWatermark("time","10 seconds")无效。

去重加sql语句的时间控制进行数据处理

.dropDuplicates("timestamp")
" where   timestamp >= unix_timestamp()-60 " +

3、append

.withWatermark("time","10 seconds")延迟一个批次显示

使用聚合函数需要配合

.withWatermark("time","10 seconds")使用

jin6872115

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
structure streaming 使用小结3-输出模式（append,update,complete）

常规使用不提，主要是使用不同模式完成排序，数据更新操作，算是对小结2的补充优化。1、排序，使用complete模式，将数据流看做静态表，不断追加数据，通过order by可以实现排序功能。val df =spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "*:9092")// .option("kafka.bootstrap.servers", "*:9092")
复制链接

扫一扫