Spark分析(九)Spark Streaming运行流程详解(4)

2021SC@SDUSC

前言

上一篇博客分析了SparkStreaming的数据接收,但由于Receiver子类众多,无法一一剖析,接下来将简述其工作原理

Receiver类工作流程

上一篇讲到了SocketReceiver类中必须定义onStart方法,这个方法会启动后台线程。而onStart方法没有在Driver端被调用,Driver端只是生成了Receiver对象,但Receiver对象并不在Driver端工作。Receiver是可序列化的类,其对象会在Executor上反序列化生成 ,经过研究,发现ReceiverSupervisorImpl.onReceiverStart中调用这个onStart方法后,Receiver对象才开始工作。
回到ReceiverTracker.luanchReceivers()中,可以发现endpoint.send(StartAll-Receivers(receivers))。 ReceiverTrackerEndpoint对象发送了StartAllReceivers消息,这个消息中的参数是所有新Receiver。 ReceiverTrackerEndpoint对象接收后所做的处理在ReceiverTrackerEndpoint.receive中。而ReceiverSchedulingPolicy.scheduleReceivers确定receiver可以运行在那些Executor上,然后用startReceiver循环启动各Receiver。

// ReceiverTracker.ReceiverTrackerEndpoint.startReceiver

/**
 * Start a receiver along with its scheduled executors
 */

private def startReceiver(
	receiver: Receiver[_],
	// schedulerdLocations指定具体在哪个物理机器上执行
	scheduledLocations: Sep[TaskLocation]): Unit = {
	// 判断Receiver的状态是否正常

   def shouldStartReceiver: Bollean = {
	// It's okay to start when trackerState is Initialized or Started
	!(isTrackerStopping || isTrackerStopped)

    }

	val receiverId = receiver.streamId
	if (!shouldStartReceiver) {
		// 如果不需要启动Receiver则会调用
		onReceiverJobFinish(receiverId)
		return
	}



	val checkpointDirOption = Option(ssc.checkpointDir)
	val serializableHadoopConf = 
		new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)



	// startReceiverFunc封装了在Worker上启动Receiver的动作
	// Function to start the receiver on the worker node
	val startReceiverFunc: Iterator[Receiver[_]] => Unit = 
		(iterator: Iterator[Receiver[_]] => {
			if (!iterator.hasNext) {
				throw new SparkExecution{
					"Could not start receiver as object not found.")

			}
			if (TaskContext.get().attemptNumber() == 0) {
				val receiver = iterator.next()
				assert(iterator.hasNext == false)
				// ReceiverSupervisorImpl是Receiver的监控器,同时负责数据的写等奥做·操作
				val supervisor = new ReceiverSpervisorImpl(
					receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
				supervisor.start()
				supervisor.awaitTermination()
			} else {
				// 若想重启Receiver,则需重新完成上面的调度,而不是Task重试
				// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
			}
		
		}
	// 生成Receiver的RDD,此RDD用于该Receiver在Worker上在生成和启动
	// Create the RDD using the scheduledLocations to run the receiver in a Spark job
	val receiverRDD: RDD[Receiver[_]] = 
		if (scheduledLocations.isEmpty) {
			ssc.sc.makeRDD(Seq(receiver), 1)
		} else {
			val preferredLocations = scheduledLocations.map(_.toString).distinct
			ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
		}
	// 从receiverId可以看出,Receiver只有一个
	receiverRDD.setName(s"Receiver $receiverId")
	ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
	ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))


	// 每个Reveiver的启动都会触发一个Job,而不是一个作业的Task去启动所有的Receiver
	// 应用程序可能有若干个Receiver
	// 调用SparkContext的submitJob,为了启动Receiver,启动了一个Spark作业
	val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
		receiverRDD, startReceiverFunc,Seq(0), (_, _) => Unit, ())
	// We will keep restarting the receiver job until ReceiverTracker is stopped
	future.onComplete {
		case Success(_) =>
			if (!shouldStartReceiver) {
				onReceiverJobFinish(receiverId)
			} else {
				logInfo(s"Restarting Receiver $receiverId")
				self.send(RestartReceiver*receiver))
			}
		case Failure(e) =>
			if(!shouldStartReceiver) {
				onReceiverJobFinish(receiverId)
			} else {
				logError("Receiver has been stopped. Try tp restart it.", e)
				logInfo(s"Restarting Receiver $receiverId")
				self.send(RestartReceiver(receiver))
			}
	// 使用线程池的方式提交Job,这样的好处是可以并发地启动Receiver
	}(submitJobThreadPool)
	logInfo(s"Receiver ${receiver.streamId} started")

}

从注释中可以看到,Spark Streaming指定Receiver在哪些Executor上运行,而不是基于Spark Core中的Task来指定。
函数startReceiverFunc虽然是在submitJob之前定义,但要通过SparkContext.submitJob提交到Executor端去执行。将Receiver转化为
Spark能够处理的RDD,这个RDD也是SparkContext.submitJob的一个参数。所以Spark Streaming是使用SparkContext.submitJob的方式在Executor启动一个Rceiver。一个Job只启动一个Receiver。
当Receiver启动失败时,就会触发ReceiverTrackEndpoint重新启动一个Spark Job去启动Receiver。
函数在Executor中运行时,会调用ReceiverSupervisor.start:

// ReceiverSupervisor.start
/** Start the supervisor */

def start() {
	onStart()	//具体实现是通过子类完成的
	startReceiver()
	
}

下面看ReceiverSupervisor子类ReceiverSupervisorImpl中的onStart:

// ReceiverSupervisorImpl.onStart

override protected def onStart() {
	registeredBlockGenerators.foreach{ _.start() }

}

这实际上是启动其中的每一个BlockGenerator。其中的_.start()是BlockGenerator.start。
这是启动数据Block的生成器。按照流程图暂且分析到这里。

ReceiverSupervisor.start

回到ReceiverSupervisor.start,再看startReceiver

// ReceiverSupervisor.startReceiver

/** Start receiver */

def startReceiver(): Unit = synchronized {
	try {
		if (onReceiverStart()) {
			logInfo("Starting receiver")
			receiverState = Started
			receiver.onStart()
			logInfo("Called receiver onStart")
		} else {
			// The driver refused us
			stop("Registered unsuccessfully because Driver refused to start receiver "+ streamId,None)
		}
	} catch {
		case NonFatal(t) =>
			stop("Error starting receiver " + streamId, Some(t))
		}

}

正常流程情况下,会满足条件ReceiverSupervisorImpl.onReceiverStart,然后调用receiver.onStart。这个receiver是ReceverInputDStream子类中定义的Receiver子类,是由Spark Steraming应用程序设置流数据接收时所关联的InputDStream决定的。
至此分析完了上一篇博客中的流程图。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值