spark structuredstreaming 多流写入kafka

最新推荐文章于 2024-07-20 07:15:00 发布

date-date

最新推荐文章于 2024-07-20 07:15:00 发布

阅读量841

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/learner_up/article/details/114989968

版权

Spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

需求：

存在一张离线规则表A，A 中存在两列 gid(规则id),gsql（规则sql），一份实时数据，搜索出实时数据中符合规则的数据

A 中数据格式如下：

gid	gsql
202	column0=0.0 and column6>0
201	column0=0.0 and column7>10.0

实时数据格式如下

timestamp	tagvalue
2021-03-18 20:11:53	0.0,0.0,0.0,0.0,1529.0,8472.0,0.932982,56.0,-0.35992295,0.0,0.0

现在需要查出实时数据中符合规则的数据：

val session: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("testAnalysis")
      .config("hive.metastore.uris", "thrift://172.17.3.205:9083")
      .config("spark.sql.warehouse.dir", "hdfs://172.17.3.205:8020/apps/hive/warehouse")
      .config("spark.debug.maxToStringFields","200")
      .enableHiveSupport().getOrCreate()

    val source: DataFrame = session
      .readStream
      .format("org.apache.spark.sql.execution.streaming.ConsumerSocketSourceProvider")
      .option("host",host)
      .option("port","5002")
      //.option("sendCode",createSendCode(datatype,start.toInt,length.toInt))
      .option("sendCode","3E2A07D10000000B")
      .option("dataType",datatype)
      .option("proppath",path)
      .option("includeTimestamp",true)
      .load()

    source.isStreaming
    source.printSchema()

  val frame1: DataFrame = session.sql("select *  from xiaojihan.f_gz_test")
    frame1.cache()

    import session.implicits._
    val tuples: Array[(String, String)] = frame1.map(row => {
      val string: String = row.get(8).toString
      (row.get(1).toString, s"select * from tmp_aa where $string")
    }).collect()

    /**
     * 把 一个由，分隔的一列数据变成多列
     */
    val frame2: DataFrame = source.dropDuplicates()
      .withColumn("columns", split(col("tagvalue"), ","))
      .select(col("timestamp") +: Range(0, length, 1).map(
        i => {
          col("columns").getItem(i).alias(s"column$i")
        }
      ): _*)

    /**
     * 当对同一个流数据存在多个分支操作的时候使用临时表
     */
     frame2.createOrReplaceTempView("tmp_aa")

    /**
     * 当存在多个流时应该先启动起来，如果之间在start 之后跟.awaitTermination()会阻塞第二个分支Query的执行
     * 所以 应当先全部启动之后使用 session.streams.awaitAnyTermination()
     */
    for(i<-0 to tuples.length-1){
        session.sql(tuples(i)._2)
        .withColumn("id", lit(tuples(i)._1))
        .select(to_json(struct("*")).alias("value"))
          .writeStream.format("kafka")
          .option("kafka.bootstrap.servers", boostrap)
          .option("checkpointLocation", checkpointdir+s"$i")
          .option("topic", topic).start()
    }

    session.streams.awaitAnyTermination()
    
  }

错误说明：

1.如果写成：

session.sql(tuples(i)._2)
.withColumn("id", lit(tuples(i)._1))
.select(to_json(struct("*")).alias("value"))
  .writeStream.format("kafka")
  .option("kafka.bootstrap.servers", boostrap)
  .option("checkpointLocation", checkpointdir+s"$i")
  .option("topic", topic).start().awaitTermination()

将永远只能看见一个规则id的数据，多个流数据进行写入时，应该先启动所用流数据，再进行等待阻塞

session.streams.awaitAnyTermination()

2.当多个流数据写入一个checkpoint 目录时会出现如下错误（注，checkpoint 目录会自己创建，所以在开启循环写入多个流时并不需要自己创建）

org.apache.spark.SparkException: Writing job aborted.
   at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112)
   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
   at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
   at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
   at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
   at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
   at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:480)
   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
   at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
   at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:474)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
   at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
   at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
   at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
   at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
   at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
   at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

21/03/18 20:56:47 INFO HDFSBackedStateStoreProvider: Committed version 1 for HDFSStateStore[id=(op=0,part=4),dir=hdfs://172.17.3.205:8020/tmp/checkpoint/state/0/4] to file hdfs://172.17.3.205:8020/tmp/checkpoint/state/0/4/1.delta
21/03/18 20:56:47 INFO DataWritingSparkTask: Writer for partition 4 is committing.
21/03/18 20:56:47 INFO DataWritingSparkTask: Writer for partition 4 committed.
21/03/18 20:56:47 INFO Executor: Finished task 4.0 in stage 2.0 (TID 12). 2165 bytes result sent to driver
21/03/18 20:56:47 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$3@24be81f rejected from java.util.concurrent.ThreadPoolExecutor@2a26287b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 12]
   at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
   at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
   at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
   at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
   at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:413)
   at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:394)
   at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:67)
   at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
   at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)

3. 如果把流表写入到内存表之后命名为xiaojihan.tt，直接对frame1 进行循环，使用session 对流表进行查询，会出现根本没有结果，甚至报错。

第一，session 不可序列化，session 是在driver 端进行创建的，然后driver 端会根据算子之间的宽窄依赖去划分stage,然后stage 会根据partition 划分出taskset（包含partition 个task），task 是被发送到exector 中进行运行的

见这个：https://bbs.csdn.net/topics/392176755

第二：当job 被写入内存表之后，内存表只能show，想要把结果保存到外部文件不好用（没有找到原因）

 frame1.foreachPartition(irow => {
      irow.foreach(row => {
          val string: String = row.get(8).toString
        session.sql(s"select * from xiaojihan.tt where $string").show()

      }
      )
    })

参考博客：

https://www.jianshu.com/p/91d4a61e1461

https://stackoverflow.com/questions/62155006/spark-writestream-into-kafka-difference-between-awaittermination-vs-awaitany