Spark Streaming的Receiver的那些事儿!

  这篇文章主要讲解Spark Streaming启动后,在StreamingContext中启动的ReceiverTracker以及这位大哥管理的Receiver的生成以及发布详细过程。我们先介绍Spark Streaming启动ReceiverTracker的过程,然后再提出一些问题,围绕这些问题对Receiver进行进一步的探索。
  Spark Streaming启动时通过JobScheduler调用的ReceiverTracker实例,这个实例就是负责管理所有在Executor上运行的数据接收器Receiver,我们通过看ReceiverTracker的类的成员变量,提取出主意的一些成员变量声明如下:

---> ReceiverTracker
    private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
    private val receivedBlockTracker = new ReceivedBlockTracker(
    ssc.sparkContext.conf,
    ssc.sparkContext.hadoopConfiguration,
    receiverInputStreamIds,
    ssc.scheduler.clock,
    ssc.isCheckpointPresent,
    Option(ssc.checkpointDir)
    )
    ...
    private val schedulingPolicy = new ReceiverSchedulingPolicy()

  其中receiverInputStreams是负责获取Receiver的,receivedBlockTracker则是负责对接收到的Block进行管理的,schedulingPolicy是制定分发策略的。
  在JobSchedulerstart()方法中调用receiverTracker.start()启动,ReceiverTrackerstart方法主要就是在driver端创建和各个Receiver通信的ReceiverTrackerEndpoint,然后调用launchReceivers方法给自己发StartAllReceivers的消息,从而启动所有Receiver的分发到对应Executor,对于Receiver的启动策略就是调用之前说的schedulingPolicy,调用该类的scheduleReceivers方法来对所有的Receiver找到一个合适的Executor,每个Receiver都会有一个preferredLocation方法,在分配合适的Executor的时候,会选择ReceiverpreferredLocation返回的结果,而如果preferredLocation没有指定结果,则会在所有Executor中随机均匀分配。
  至此,ReceiverTrackerstart方法已经执行完毕,所有Receiver都已经找到自己的归属。不过,其实这里只是讲了Receiver找到自己的归属,然后启动Receiver,但是这个Receiver具体的启动流程是如何的呢
  为了解释这个问题,我们回到StartAllReceivers消息对应的处理方法体中,如下:

--> ReceiverTracker.ReceiverTrackerEndpoint.receiver()
override def receive: PartialFunction[Any, Unit] = {
      // Local messages
      case StartAllReceivers(receivers) =>
        val scheduledExecutors = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        for (receiver <- receivers) {
          val executors = scheduledExecutors(receiver.streamId)
          updateReceiverScheduledExecutors(receiver.streamId, executors)
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          startReceiver(receiver, executors)
        }
    ...
}

  很显然,在我们是通过schedulingPolicy.scheduleReceivers(receivers, getExecutors)获取到了每个Receiver的归属,然后再对receivers进行遍历,进行真正的分发启动receiver,这里查看startReceiver方法的代码,具体看注释内容:

private def startReceiver(receiver: Receiver[_], scheduledExecutors: Seq[String]): Unit = {
      def shouldStartReceiver: Boolean = {
        // It's okay to start when trackerState is Initialized or Started
        !(isTrackerStopping || isTrackerStopped)
      }

      val receiverId = receiver.streamId
      //如果因为某些原因receiverTracker突然停止或正在停止,那么直接不用启动receiver返回执行完成即可。
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
        return
      }

      val checkpointDirOption = Option(ssc.checkpointDir)
      //序列化对应参数
      val serializableHadoopConf =
        new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

      // 用于启动Receiver的方法,作为入参传给后面的future
      val startReceiverFunc: Iterator[Receiver[_]] => Unit =
        (iterator: Iterator[Receiver[_]]) => {
          if (!iterator.hasNext) {
            throw new SparkException(
              "Could not start receiver as object not found.")
          }
          //判断是否是第一次启动,因为只有第一次启动对应的attemptNumber才为0,之后尝试一次均会+1
          if (TaskContext.get().attemptNumber() == 0) {
            val receiver = iterator.next()
            assert(iterator.hasNext == false)
            //Executor端创建的```Receiver```的小监工,会负责对Receiver接收到的数据进行转储。
            val supervisor = new ReceiverSupervisorImpl(
              receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
            supervisor.start()
            supervisor.awaitTermination()
          } else {
            // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
          }
        }

      // Create the RDD using the scheduledExecutors to run the receiver in a Spark job
      // 创建一个测试用的RDD
      val receiverRDD: RDD[Receiver[_]] =
        if (scheduledExecutors.isEmpty) {
          ssc.sc.makeRDD(Seq(receiver), 1)
        } else {
          ssc.sc.makeRDD(Seq(receiver -> scheduledExecutors))
        }
      receiverRDD.setName(s"Receiver $receiverId")
      ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
        
      //通过Spark Core对RDD对应任务进行发布,并在执行的时候启动了ReceiverSupervisorImpl来作为小监工,小监工启动后,
      //再自己创建好对应自己管理的Receiver来接收数据。
      val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
        receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
      // We will keep restarting the receiver job until ReceiverTracker is stopped
      future.onComplete {
        case Success(_) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
        case Failure(e) =>
          ...
      }(submitJobThreadPool)
      logInfo(s"Receiver ${receiver.streamId} started")
    }

  如上,就是注意启动Receiver的具体流程,也就是每个Receiver的启动其实是依托于ReceiverSupervisorImpl来启动的,注意以上的onStart()方法为:registeredBlockGenerators.foreach { _.start() },即就是注册BlockGenerator,这个类的作用就是对从Receiver的数据进行集中管理合并为一个个的Block,至于Block的管理以及存储策略之类的具体内容由于篇幅原因暂时忽略,在后续文章中在对此处进行深入研究。
介绍完以上内容,对于Receiver的启动流程已经基本明朗,我们知道了每个Receiver是如何进行发布和启动的,但是关于它的起源还是有一些问题,例如一个任务有多少个Receiver每个Receiver是怎么来的
  对于这个问题,我们继续来看代码,追溯回launchReceivers()方法,发现```receivers来源为:

val receivers = receiverInputStreams.map(nis => {
      val rcvr = nis.getReceiver()
      rcvr.setReceiverId(nis.id)
      rcvr
    })

  这里的receiverInputStreams前文已经提起过,就是通过DStreamGraph得到的,通过追踪不难发现一个DStreamGraph对应生成的receiver的个数,就是inputStream的个数。那么,我们继续来看receiver的具体来源,发现是来源于一个ReceiverInputDStream抽象类中的getReceiver方法,但是这里就发现了一个问题,在查看这个ReceiverInpuDStream的具体实现类的时候,发现只有四个KafkaInputDStreamSocketInputDStream等。而且可以看到ReceiverInputDStream是继承自InputDStream的一个抽象类,而之前常过的InputDStream是有自己直接的实现子类的,而这个getReceiver是在ReceiverInputDStream中新增的一个抽象方法,属于对InputDStream的扩充,所以,应该不是所有的InputDStream都是getReceiver方法,只有继承自ReceiverInputDStream的才有getReceiver,而其它直接继承自InputDStream的类,是没有getReceiver的,例如我们用的最多的DirectKafkaInputDStream,这个类就是直接继承自InputDStream,但是KafkaInputDStream则是继承自ReceiverInputDStream,那么这里为什么会分为这两大类呢?
  对于上面的问题,从代码中没有找到这样做的解释,看到一篇文章:https://github.com/allwefantasy/my-life/blob/master/Spark-Streaming-Direct-Approach-(No-Receivers)-分析.md, 才知道这一切是由于接受数据的Received-based方式和No Receivers方式,而DirectKafkaInputDStream是通过createDirectStream得到的,另外一个KafkaInputDStream则是通过createStream得到的,具体详细内容对比可以参考以上文章内容,不再赘述。
  自己建了一个源码交流群:936037639,如果你也在看Spark或是大数据相关框架的源码,可以进群大家互相交流哦,一个人看源码有些细节是真的不容易弄明白的,人多力量大!

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值