一、StreamingQueryManager创建流并启动
接一篇文章创建流的Source、Sink
在创建Sink后,会调用sessionState.streamingQueryManager.startQuery()创建并启动流,
对应的StreamQueryManager启动流程图为:
startQuery()、createQuery()主要代码:
class StreamingQueryManager private[sql] (sparkSession: SparkSession) extends Logging {
private[sql] def startQuery(
userSpecifiedName: Option[String],
userSpecifiedCheckpointLocation: Option[String],
df: DataFrame,
extraOptions: Map[String, String],
sink: BaseStreamingSink,
outputMode: OutputMode,
useTempCheckpointLocation: Boolean = false,
recoverFromCheckpointLocation: Boolean = true,
trigger: Trigger = ProcessingTime(0),
triggerClock: Clock = new SystemClock()): StreamingQuery = {
val query = createQuery(
userSpecifiedName,
userSpecifiedCheckpointLocation,
df,
extraOptions,
sink,
outputMode,
useTempCheckpointLocation,
recoverFromCheckpointLocation,
trigger,
triggerClock)
activeQueries.put(query.id, query)
}
try {
query.streamingQuery.start()
} catch {
case e: Throwable =>
activeQueriesLock.synchronized {
activeQueries -= query.id
}
throw e
}
query
}
private def createQuery(
userSpecifiedName: Option[String],
userSpecifiedCheckpointLocation: Option[String],
df: DataFrame,
extraOptions: Map[String, String],
sink: BaseStreamingSink,
outputMode: OutputMode,
useTempCheckpointLocation: Boolean,
recoverFromCheckpointLocation: Boolean,
trigger: Trigger,
triggerClock: Clock): StreamingQueryWrapper = {
var streamExecutionCls = extraOptions.getOrElse("stream.execution.class", "")
var deleteCheckpointOnStop = false
val checkpointLocation = userSpecifiedCheckpointLocation.map { userSpecified =>
new Path(userSpecified).toUri.toString
}.orElse {
xxxx
}
}
val analyzedPlan = df.queryExecution.analyzed
df.queryExecution.assertAnalyzed()
if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) {
UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode)
}
if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
logWarning(s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
"is not supported in streaming DataFrames/Datasets and will be disabled.")
}
var streamingQueryWrapper: StreamingQueryWrapper = null
if (streamExecutionCls.length > 0) {
val cls = Utils.classForName(streamExecutionCls)
val constructor = cls.getConstructor(
classOf[SparkSession],
classOf[String],
classOf[String],
classOf[LogicalPlan],
classOf[BaseStreamingSink],
classOf[Trigger],
classOf[Clock],
classOf[OutputMode],
classOf[Map[String, String]],
classOf[Boolean])
val streamExecution = constructor.newInstance(
sparkSession,
userSpecifiedName.orNull,
checkpointLocation,
analyzedPlan,
sink,
trigger,
triggerClock,
outputMode,
extraOptions,
new java.lang.Boolean(deleteCheckpointOnStop)).asInstanceOf[StreamExecution]
streamingQueryWrapper = new StreamingQueryWrapper(streamExecution)
} else {
streamingQueryWrapper = (sink, trigger) match {
case (v2Sink: StreamWriteSupport, trigger: ContinuousTrigger) =>
UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode)
new StreamingQueryWrapper(new ContinuousExecution(
sparkSession,
userSpecifiedName.orNull,
checkpointLocation,
analyzedPlan,
v2Sink,
trigger,
triggerClock,
outputMode,
extraOptions,
deleteCheckpointOnStop))
case _ =>
new StreamingQueryWrapper(new MicroBatchExecution(
sparkSession,
userSpecifiedName.orNull,
checkpointLocation,
analyzedPlan,
sink,
trigger,
triggerClock,
outputMode,
extraOptions,
deleteCheckpointOnStop))
}
}
streamingQueryWrapper
}
}
二、StreamExecution的初始化
1、 StreamExecution源码分析
StreamQueryManager.startQuery()最后一步描述的query.streamingQuery.start()即真正创建StreamExecution的流处理线程:
abstract class StreamExecution(xxxxx)
extends StreamingQuery with ProgressReporter with Logging {
def start(): Unit = {
logInfo(s"Starting $prettyIdString. Use $resolvedCheckpointRoot to store the query checkpoint.")
queryExecutionThread.setDaemon(true)
queryExecutionThread.start()
startLatch.await() // Wait until thread started and QueryStart event has been posted
}
}
实际是执行queryExecutionThread的run()方法:
val queryExecutionThread: QueryExecutionThread =
new QueryExecutionThread(s"stream execution thread for $prettyIdString") {
override def run(): Unit = {
// To fix call site like "run at <unknown>:0", we bridge the call site from the caller
// thread to this micro batch thread
sparkSession.sparkContext.setCallSite(callSite)
runStream()
}
}
runStream()分为环境的初始化、启动和执行过程中的异常处理try{…}catch{…}结构、其核心方法是runActivatedStream(sparkSessionForStream),具体的实现在MicroBatchExecution(批量处理)、ContinuousExecution(连续处理)这个两个子类中均有各自具体的实现。
runStream()流程:
· 创建metics
· postEvent(id, runId, name)向listenBus发送启动事件Event
· 设置其它conf变量
· runActivatedStream(sparkSessionForStream)执行MicroBatchExecution或ContinousExecution的runActivatedStream(),持续查询流
· 使用try{…}catch{…}获取启动和运行过程异常