这篇文章主要讲解Spark Streaming启动后,在StreamingContext
中启动的ReceiverTracker
以及这位大哥管理的Receiver
的生成以及发布详细过程。我们先介绍Spark Streaming启动ReceiverTracker
的过程,然后再提出一些问题,围绕这些问题对Receiver
进行进一步的探索。
Spark Streaming启动时通过JobScheduler
调用的ReceiverTracker
实例,这个实例就是负责管理所有在Executor上运行的数据接收器Receiver
,我们通过看ReceiverTracker
的类的成员变量,提取出主意的一些成员变量声明如下:
---> ReceiverTracker
private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
private val receivedBlockTracker = new ReceivedBlockTracker(
ssc.sparkContext.conf,
ssc.sparkContext.hadoopConfiguration,
receiverInputStreamIds,
ssc.scheduler.clock,
ssc.isCheckpointPresent,
Option(ssc.checkpointDir)
)
...
private val schedulingPolicy = new ReceiverSchedulingPolicy()
其中receiverInputStreams
是负责获取Receiver
的,receivedBlockTracker
则是负责对接收到的Block
进行管理的,schedulingPolicy
是制定分发策略的。
在JobScheduler
的start()
方法中调用receiverTracker.start()
启动,ReceiverTracker
的start
方法主要就是在driver端创建和各个Receiver
通信的ReceiverTrackerEndpoint
,然后调用launchReceivers
方法给自己发StartAllReceivers
的消息,从而启动所有Receiver
的分发到对应Executor
,对于Receiver
的启动策略就是调用之前说的schedulingPolicy
,调用该类的scheduleReceivers
方法来对所有的Receiver
找到一个合适的Executor
,每个Receiver
都会有一个preferredLocation
方法,在分配合适的Executor
的时候,会选择Receiver
的preferredLocation
返回的结果,而如果preferredLocation
没有指定结果,则会在所有Executor
中随机均匀分配。
至此,ReceiverTracker
的start
方法已经执行完毕,所有Receiver
都已经找到自己的归属。不过,其实这里只是讲了Receiver
找到自己的归属,然后启动Receiver
,但是这个Receiver
具体的启动流程是如何的呢?
为了解释这个问题,我们回到StartAllReceivers
消息对应的处理方法体中,如下:
--> ReceiverTracker.ReceiverTrackerEndpoint.receiver()
override def receive: PartialFunction[Any, Unit] = {
// Local messages
case StartAllReceivers(receivers) =>
val scheduledExecutors = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledExecutors(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
startReceiver(receiver, executors)
}
...
}
很显然,在我们是通过schedulingPolicy.scheduleReceivers(receivers, getExecutors)
获取到了每个Receiver
的归属,然后再对receivers
进行遍历,进行真正的分发启动receiver
,这里查看startReceiver
方法的代码,具体看注释内容:
private def startReceiver(receiver: Receiver[_], scheduledExecutors: Seq[String]): Unit = {
def shouldStartReceiver: Boolean = {
// It's okay to start when trackerState is Initialized or Started
!(isTrackerStopping || isTrackerStopped)
}
val receiverId = receiver.streamId
//如果因为某些原因receiverTracker突然停止或正在停止,那么直接不用启动receiver返回执行完成即可。
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
return
}
val checkpointDirOption = Option(ssc.checkpointDir)
//序列化对应参数
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
// 用于启动Receiver的方法,作为入参传给后面的future
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
//判断是否是第一次启动,因为只有第一次启动对应的attemptNumber才为0,之后尝试一次均会+1
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
//Executor端创建的```Receiver```的小监工,会负责对Receiver接收到的数据进行转储。
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// Create the RDD using the scheduledExecutors to run the receiver in a Spark job
// 创建一个测试用的RDD
val receiverRDD: RDD[Receiver[_]] =
if (scheduledExecutors.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
ssc.sc.makeRDD(Seq(receiver -> scheduledExecutors))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
//通过Spark Core对RDD对应任务进行发布,并在执行的时候启动了ReceiverSupervisorImpl来作为小监工,小监工启动后,
//再自己创建好对应自己管理的Receiver来接收数据。
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
...
}(submitJobThreadPool)
logInfo(s"Receiver ${receiver.streamId} started")
}
如上,就是注意启动Receiver
的具体流程,也就是每个Receiver
的启动其实是依托于ReceiverSupervisorImpl
来启动的,注意以上的onStart()
方法为:registeredBlockGenerators.foreach { _.start() }
,即就是注册BlockGenerator
,这个类的作用就是对从Receiver
的数据进行集中管理合并为一个个的Block
,至于Block
的管理以及存储策略之类的具体内容由于篇幅原因暂时忽略,在后续文章中在对此处进行深入研究。
介绍完以上内容,对于Receiver
的启动流程已经基本明朗,我们知道了每个Receiver
是如何进行发布和启动的,但是关于它的起源还是有一些问题,例如一个任务有多少个Receiver
,每个Receiver
是怎么来的?
对于这个问题,我们继续来看代码,追溯回launchReceivers()
方法,发现```receivers来源为:
val receivers = receiverInputStreams.map(nis => {
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
})
这里的receiverInputStreams
前文已经提起过,就是通过DStreamGraph
得到的,通过追踪不难发现一个DStreamGraph
对应生成的receiver
的个数,就是inputStream
的个数。那么,我们继续来看receiver
的具体来源,发现是来源于一个ReceiverInputDStream
抽象类中的getReceiver
方法,但是这里就发现了一个问题,在查看这个ReceiverInpuDStream
的具体实现类的时候,发现只有四个KafkaInputDStream
、SocketInputDStream
等。而且可以看到ReceiverInputDStream
是继承自InputDStream
的一个抽象类,而之前常过的InputDStream
是有自己直接的实现子类的,而这个getReceiver
是在ReceiverInputDStream
中新增的一个抽象方法,属于对InputDStream
的扩充,所以,应该不是所有的InputDStream
都是getReceiver
方法,只有继承自ReceiverInputDStream
的才有getReceiver
,而其它直接继承自InputDStream
的类,是没有getReceiver
的,例如我们用的最多的DirectKafkaInputDStream
,这个类就是直接继承自InputDStream
,但是KafkaInputDStream
则是继承自ReceiverInputDStream
,那么这里为什么会分为这两大类呢?
对于上面的问题,从代码中没有找到这样做的解释,看到一篇文章:https://github.com/allwefantasy/my-life/blob/master/Spark-Streaming-Direct-Approach-(No-Receivers)-分析.md, 才知道这一切是由于接受数据的Received-based方式和No Receivers方式,而DirectKafkaInputDStream
是通过createDirectStream
得到的,另外一个KafkaInputDStream
则是通过createStream
得到的,具体详细内容对比可以参考以上文章内容,不再赘述。
自己建了一个源码交流群:936037639,如果你也在看Spark或是大数据相关框架的源码,可以进群大家互相交流哦,一个人看源码有些细节是真的不容易弄明白的,人多力量大!