2021SC@SDUSC
前言
上一篇博客分析了SparkStreaming的数据接收,但由于Receiver子类众多,无法一一剖析,接下来将简述其工作原理
Receiver类工作流程
上一篇讲到了SocketReceiver类中必须定义onStart方法,这个方法会启动后台线程。而onStart方法没有在Driver端被调用,Driver端只是生成了Receiver对象,但Receiver对象并不在Driver端工作。Receiver是可序列化的类,其对象会在Executor上反序列化生成 ,经过研究,发现ReceiverSupervisorImpl.onReceiverStart中调用这个onStart方法后,Receiver对象才开始工作。
回到ReceiverTracker.luanchReceivers()中,可以发现endpoint.send(StartAll-Receivers(receivers))。 ReceiverTrackerEndpoint对象发送了StartAllReceivers消息,这个消息中的参数是所有新Receiver。 ReceiverTrackerEndpoint对象接收后所做的处理在ReceiverTrackerEndpoint.receive中。而ReceiverSchedulingPolicy.scheduleReceivers确定receiver可以运行在那些Executor上,然后用startReceiver循环启动各Receiver。
// ReceiverTracker.ReceiverTrackerEndpoint.startReceiver
/**
* Start a receiver along with its scheduled executors
*/
private def startReceiver(
receiver: Receiver[_],
// schedulerdLocations指定具体在哪个物理机器上执行
scheduledLocations: Sep[TaskLocation]): Unit = {
// 判断Receiver的状态是否正常
def shouldStartReceiver: Bollean = {
// It's okay to start when trackerState is Initialized or Started
!(isTrackerStopping || isTrackerStopped)
}
val receiverId = receiver.streamId
if (!shouldStartReceiver) {
// 如果不需要启动Receiver则会调用
onReceiverJobFinish(receiverId)
return
}
val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
// startReceiverFunc封装了在Worker上启动Receiver的动作
// Function to start the receiver on the worker node
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]] => {
if (!iterator.hasNext) {
throw new SparkExecution{
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
// ReceiverSupervisorImpl是Receiver的监控器,同时负责数据的写等奥做·操作
val supervisor = new ReceiverSpervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
// 若想重启Receiver,则需重新完成上面的调度,而不是Task重试
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// 生成Receiver的RDD,此RDD用于该Receiver在Worker上在生成和启动
// Create the RDD using the scheduledLocations to run the receiver in a Spark job
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
// 从receiverId可以看出,Receiver只有一个
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
// 每个Reveiver的启动都会触发一个Job,而不是一个作业的Task去启动所有的Receiver
// 应用程序可能有若干个Receiver
// 调用SparkContext的submitJob,为了启动Receiver,启动了一个Spark作业
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc,Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver*receiver))
}
case Failure(e) =>
if(!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try tp restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
// 使用线程池的方式提交Job,这样的好处是可以并发地启动Receiver
}(submitJobThreadPool)
logInfo(s"Receiver ${receiver.streamId} started")
}
从注释中可以看到,Spark Streaming指定Receiver在哪些Executor上运行,而不是基于Spark Core中的Task来指定。
函数startReceiverFunc虽然是在submitJob之前定义,但要通过SparkContext.submitJob提交到Executor端去执行。将Receiver转化为
Spark能够处理的RDD,这个RDD也是SparkContext.submitJob的一个参数。所以Spark Streaming是使用SparkContext.submitJob的方式在Executor启动一个Rceiver。一个Job只启动一个Receiver。
当Receiver启动失败时,就会触发ReceiverTrackEndpoint重新启动一个Spark Job去启动Receiver。
函数在Executor中运行时,会调用ReceiverSupervisor.start:
// ReceiverSupervisor.start
/** Start the supervisor */
def start() {
onStart() //具体实现是通过子类完成的
startReceiver()
}
下面看ReceiverSupervisor子类ReceiverSupervisorImpl中的onStart:
// ReceiverSupervisorImpl.onStart
override protected def onStart() {
registeredBlockGenerators.foreach{ _.start() }
}
这实际上是启动其中的每一个BlockGenerator。其中的_.start()是BlockGenerator.start。
这是启动数据Block的生成器。按照流程图暂且分析到这里。
ReceiverSupervisor.start
回到ReceiverSupervisor.start,再看startReceiver
// ReceiverSupervisor.startReceiver
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
logInfo("Starting receiver")
receiverState = Started
receiver.onStart()
logInfo("Called receiver onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver "+ streamId,None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
正常流程情况下,会满足条件ReceiverSupervisorImpl.onReceiverStart,然后调用receiver.onStart。这个receiver是ReceverInputDStream子类中定义的Receiver子类,是由Spark Steraming应用程序设置流数据接收时所关联的InputDStream决定的。
至此分析完了上一篇博客中的流程图。