需求:
存在一张离线规则表A,A 中存在两列 gid(规则id),gsql(规则sql) ,一份实时数据,搜索出实时数据中符合规则的数据
A 中数据格式如下:
gid | gsql |
202 | column0=0.0 and column6>0 |
201 | column0=0.0 and column7>10.0 |
实时数据格式如下
timestamp | tagvalue |
2021-03-18 20:11:53 | 0.0,0.0,0.0,0.0,1529.0,8472.0,0.932982,56.0,-0.35992295,0.0,0.0 |
现在需要查出实时数据中符合规则的数据:
val session: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("testAnalysis")
.config("hive.metastore.uris", "thrift://172.17.3.205:9083")
.config("spark.sql.warehouse.dir", "hdfs://172.17.3.205:8020/apps/hive/warehouse")
.config("spark.debug.maxToStringFields","200")
.enableHiveSupport().getOrCreate()
val source: DataFrame = session
.readStream
.format("org.apache.spark.sql.execution.streaming.ConsumerSocketSourceProvider")
.option("host",host)
.option("port","5002")
//.option("sendCode",createSendCode(datatype,start.toInt,length.toInt))
.option("sendCode","3E2A07D10000000B")
.option("dataType",datatype)
.option("proppath",path)
.option("includeTimestamp",true)
.load()
source.isStreaming
source.printSchema()
val frame1: DataFrame = session.sql("select * from xiaojihan.f_gz_test")
frame1.cache()
import session.implicits._
val tuples: Array[(String, String)] = frame1.map(row => {
val string: String = row.get(8).toString
(row.get(1).toString, s"select * from tmp_aa where $string")
}).collect()
/**
* 把 一个由,分隔的一列数据变成多列
*/
val frame2: DataFrame = source.dropDuplicates()
.withColumn("columns", split(col("tagvalue"), ","))
.select(col("timestamp") +: Range(0, length, 1).map(
i => {
col("columns").getItem(i).alias(s"column$i")
}
): _*)
/**
* 当对同一个流数据存在多个分支操作的时候使用临时表
*/
frame2.createOrReplaceTempView("tmp_aa")
/**
* 当存在多个流时应该先启动起来,如果之间在start 之后跟.awaitTermination()会阻塞第二个分支Query的执行
* 所以 应当先全部启动之后使用 session.streams.awaitAnyTermination()
*/
for(i<-0 to tuples.length-1){
session.sql(tuples(i)._2)
.withColumn("id", lit(tuples(i)._1))
.select(to_json(struct("*")).alias("value"))
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", boostrap)
.option("checkpointLocation", checkpointdir+s"$i")
.option("topic", topic).start()
}
session.streams.awaitAnyTermination()
}
错误说明:
1.如果写成:
session.sql(tuples(i)._2)
.withColumn("id", lit(tuples(i)._1))
.select(to_json(struct("*")).alias("value"))
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", boostrap)
.option("checkpointLocation", checkpointdir+s"$i")
.option("topic", topic).start().awaitTermination()
将永远只能看见一个规则id的数据,多个流数据进行写入时,应该先启动所用流数据,再进行等待阻塞
session.streams.awaitAnyTermination()
2.当多个流数据写入一个checkpoint 目录时会出现如下错误(注,checkpoint 目录会自己创建,所以在开启循环写入多个流时并不需要自己创建)
org.apache.spark.SparkException: Writing job aborted.
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:480)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:474)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down
21/03/18 20:56:47 INFO HDFSBackedStateStoreProvider: Committed version 1 for HDFSStateStore[id=(op=0,part=4),dir=hdfs://172.17.3.205:8020/tmp/checkpoint/state/0/4] to file hdfs://172.17.3.205:8020/tmp/checkpoint/state/0/4/1.delta
21/03/18 20:56:47 INFO DataWritingSparkTask: Writer for partition 4 is committing.
21/03/18 20:56:47 INFO DataWritingSparkTask: Writer for partition 4 committed.
21/03/18 20:56:47 INFO Executor: Finished task 4.0 in stage 2.0 (TID 12). 2165 bytes result sent to driver
21/03/18 20:56:47 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$3@24be81f rejected from java.util.concurrent.ThreadPoolExecutor@2a26287b[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 12]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:413)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:394)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:67)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
3. 如果把流表写入到内存表之后命名为xiaojihan.tt,直接对frame1 进行循环,使用session 对 流表进行查询,会出现根本没有结果,甚至报错。
第一,session 不可序列化,session 是在driver 端进行创建的,然后driver 端会根据算子之间的宽窄依赖去划分stage,然后stage 会根据partition 划分出taskset(包含partition 个task),task 是被发送到exector 中进行运行的
见这个:https://bbs.csdn.net/topics/392176755
第二:当job 被写入内存表之后,内存表只能show,想要把结果保存到外部文件不好用(没有找到原因)
frame1.foreachPartition(irow => {
irow.foreach(row => {
val string: String = row.get(8).toString
session.sql(s"select * from xiaojihan.tt where $string").show()
}
)
})
参考博客:
https://www.jianshu.com/p/91d4a61e1461