spark structuredstreaming 多流 写入kafka

需求:

存在一张离线规则表A,A 中存在两列 gid(规则id),gsql(规则sql) ,一份实时数据,搜索出实时数据中符合规则的数据

A 中数据格式如下:

gidgsql
202column0=0.0 and column6>0
201 column0=0.0 and column7>10.0

 

 

 

 

实时数据格式如下

timestamptagvalue
2021-03-18 20:11:530.0,0.0,0.0,0.0,1529.0,8472.0,0.932982,56.0,-0.35992295,0.0,0.0

 

 

 

现在需要查出实时数据中符合规则的数据:

val session: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("testAnalysis")
      .config("hive.metastore.uris", "thrift://172.17.3.205:9083")
      .config("spark.sql.warehouse.dir", "hdfs://172.17.3.205:8020/apps/hive/warehouse")
      .config("spark.debug.maxToStringFields","200")
      .enableHiveSupport().getOrCreate()

    val source: DataFrame = session
      .readStream
      .format("org.apache.spark.sql.execution.streaming.ConsumerSocketSourceProvider")
      .option("host",host)
      .option("port","5002")
      //.option("sendCode",createSendCode(datatype,start.toInt,length.toInt))
      .option("sendCode","3E2A07D10000000B")
      .option("dataType",datatype)
      .option("proppath",path)
      .option("includeTimestamp",true)
      .load()

    source.isStreaming
    source.printSchema()

  val frame1: DataFrame = session.sql("select *  from xiaojihan.f_gz_test")
    frame1.cache()

    import session.implicits._
    val tuples: Array[(String, String)] = frame1.map(row => {
      val string: String = row.get(8).toString
      (row.get(1).toString, s"select * from tmp_aa where $string")
    }).collect()

    /**
     * 把 一个由,分隔的一列数据变成多列
     */
    val frame2: DataFrame = source.dropDuplicates()
      .withColumn("columns", split(col("tagvalue"), ","))
      .select(col("timestamp") +: Range(0, length, 1).map(
        i => {
          col("columns").getItem(i).alias(s"column$i")
        }
      ): _*)

    /**
     * 当对同一个流数据存在多个分支操作的时候使用临时表
     */
     frame2.createOrReplaceTempView("tmp_aa")

    /**
     * 当存在多个流时应该先启动起来,如果之间在start 之后跟.awaitTermination()会阻塞第二个分支Query的执行
     * 所以 应当先全部启动之后使用 session.streams.awaitAnyTermination()
     */
    for(i<-0 to tuples.length-1){
        session.sql(tuples(i)._2)
        .withColumn("id", lit(tuples(i)._1))
        .select(to_json(struct("*")).alias("value"))
          .writeStream.format("kafka")
          .option("kafka.bootstrap.servers", boostrap)
          .option("checkpointLocation", checkpointdir+s"$i")
          .option("topic", topic).start()
    }

    session.streams.awaitAnyTermination()
    
  }

错误说明:

1.如果写成:

session.sql(tuples(i)._2)
.withColumn("id", lit(tuples(i)._1))
.select(to_json(struct("*")).alias("value"))
  .writeStream.format("kafka")
  .option("kafka.bootstrap.servers", boostrap)
  .option("checkpointLocation", checkpointdir+s"$i")
  .option("topic", topic).start().awaitTermination()

将永远只能看见一个规则id的数据,多个流数据进行写入时,应该先启动所用流数据,再进行等待阻塞

session.streams.awaitAnyTermination()

2.当多个流数据写入一个checkpoint 目录时会出现如下错误(注,checkpoint 目录会自己创建,所以在开启循环写入多个流时并不需要自己创建)

org.apache.spark.SparkException: Writing job aborted.
    at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
    at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
    at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
    at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:480)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:474)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

21/03/18 20:56:47 INFO HDFSBackedStateStoreProvider: Committed version 1 for HDFSStateStore[id=(op=0,part=4),dir=hdfs://172.17.3.205:8020/tmp/checkpoint/state/0/4] to file hdfs://172.17.3.205:8020/tmp/checkpoint/state/0/4/1.delta
21/03/18 20:56:47 INFO DataWritingSparkTask: Writer for partition 4 is committing.
21/03/18 20:56:47 INFO DataWritingSparkTask: Writer for partition 4 committed.
21/03/18 20:56:47 INFO Executor: Finished task 4.0 in stage 2.0 (TID 12). 2165 bytes result sent to driver
21/03/18 20:56:47 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$3@24be81f rejected from java.util.concurrent.ThreadPoolExecutor@2a26287b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 12]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
    at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:413)
    at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:394)
    at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:67)
    at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
    at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

3. 如果把流表写入到内存表之后命名为xiaojihan.tt,直接对frame1 进行循环,使用session 对 流表进行查询,会出现根本没有结果,甚至报错。

 第一,session 不可序列化,session 是在driver 端进行创建的,然后driver 端会根据算子之间的宽窄依赖去划分stage,然后stage 会根据partition 划分出taskset(包含partition 个task),task 是被发送到exector 中进行运行的

见这个:https://bbs.csdn.net/topics/392176755

第二:当job 被写入内存表之后,内存表只能show,想要把结果保存到外部文件不好用(没有找到原因)

 frame1.foreachPartition(irow => {
      irow.foreach(row => {
          val string: String = row.get(8).toString
        session.sql(s"select * from xiaojihan.tt where $string").show()

      }
      )
    })

参考博客:

https://www.jianshu.com/p/91d4a61e1461

https://stackoverflow.com/questions/62155006/spark-writestream-into-kafka-difference-between-awaittermination-vs-awaitany


 

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值