Spark Streaming-Receiver启动和数据接收

养成一个记笔记的习惯!!!

小案例:

def fromsocket(ssc: StreamingContext): Unit ={
  /**使用updateStateByKey前必须要设置checkpoint*/
  ssc.checkpoint("hdfs://zhumaster:8020/data/checkpoint")
  /** 如使用socket网络嵌套字作为输入模式,使用命令nc -lk 端口号 进行开启网络的等待
    * 此处用的端口号是9999,此端口号可自定义
    * 使用netstat -ap | grep 9999  查看9999的开启状态 */
  val soDS: ReceiverInputDStream[String] = ssc.socketTextStream("zhumaster",9999)
  val wordCount: DStream[(String, Int)] = soDS.flatMap(_.split(" ")).map(x => (x,1)).reduceByKey(_ + _)
  wordCount.foreachRDD{rdd =>
    /**过滤掉rdd为空的数据,从而避免空rdd的job提交,因此只有收到数据时才会提交job进行处理*/
    if (!rdd.isEmpty()) rdd.foreach(println)
  }
  wordCount.print()
  ssc.start()
  ssc.awaitTermination()
  ssc.stop(true)
}

StreamingContext依然使用SparkContext作为Spark的入口:

def this(conf: SparkConf, batchDuration: Duration) = {
  this(StreamingContext.createNewSparkContext(conf), null, batchDuration)}
private[streaming] def createNewSparkContext(conf: SparkConf): SparkContext = {
  new SparkContext(conf)}

StreamingContext的start方法主要是调用JobScheduler的start方法。JobScheduler对象启动时,启动了EventLoop、StreamingLinstenerBus、ReceiverTracker和JobGenerator对象。JobGenerator负责job生成(将每个batch生成具体的RDD GAG);JobScheduler负责job调度,相当于spark core 的DAGSchedulerReceiverTracker负责获取数据

def start(): Unit = synchronized {
  if (eventLoop != null) return // scheduler has already been started
  logDebug("Starting JobScheduler")
  eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
    override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
    override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
  }
  eventLoop.start() //启动消息循环线程,用于处理JobScheduler事件
  for {
    inputDStream <- ssc.graph.getInputStreams
    rateController <- inputDStream.rateController
  } ssc.addStreamingListener(rateController)

  listenerBus.start(ssc.sparkContext)  //启动监听器,用于更新Spark UI的streamTab内容
  receiverTracker = new ReceiverTracker(ssc)
  inputInfoTracker = new InputInfoTracker(ssc) //管理所有的输入流以及输入的数据统计
  receiverTracker.start() //用户处理数据接收、数据缓存、Block生成
  jobGenerator.start() //用于DStreamGraph初始化、DStream与RDD转换、Job生成提交等
  logInfo("Started JobScheduler")
}

在此记录Receiver的启动原理和数据接收。

Receiver不断持续的接收外部数据源流入数据,并将不断接收到的数据的元数据汇报给Driver,每个batchDuration会根据汇报的数据生成不同的job并执行相应的RDD transformation操作。

数据的接收是从ReceiverTracker的start方法开始的。在Executor上启动Receiver以持续接收流数据。接收的数据是离散的,必须收集成Block,让BlockManager进行保存,并通知Driver端已保存的Block的元数据信息。ReceiverTracker的 start方法中首先判InputStreams是否存在,并且使用launchReceivers启动Executor中的相关进程:

def start(): Unit = synchronized {
  if (!receiverInputStreams.isEmpty) {
    endpoint = ssc.env.rpcEnv.setupEndpoint("ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
    if (!skipReceiverLaunch) launchReceivers()
    logInfo("ReceiverTracker started")
    trackerState = Started
  }
}

启动Receiver的执行线程会判断receiverInputStreams对象是否存在,只有当其不会空是才会启动Receiver,receiverInputStreams是注册到DStreamGraph中的ReceiverInputDStream,因此Receiver的启动依赖输入源。

launchReceivers方法实现的是:把接收的数据通过endpoint发送到workerexecutor上。在其方法中实现了ReceiverTracker的runDummySparkJob方法,runDummySparkJob的目的是将接收的数据分配到所有的节点上。

ReceiverTracker的launchReceivers方法的源码:

private def launchReceivers(): Unit = {
  val receivers = receiverInputStreams.map(nis => {
    val rcvr = nis.getReceiver()
    rcvr.setReceiverId(nis.id)
    rcvr
  })
  runDummySparkJob()  //防止所有的Receiver都分配到同一个slaveNode上,保证负载均衡
  endpoint.send(StartAllReceivers(receivers)) //此处的endpoint其实就是ReceiverTrackerEndPoint
}

ReceiverTrackerEndPoint收到StartAllReceivers的处理方式,其源码:

override def receive: PartialFunction[Any, Unit] = {
  // Local messages
  case StartAllReceivers(receivers) =>
    val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
    for (receiver <- receivers) {
      val executors = scheduledLocations(receiver.streamId)
      updateReceiverScheduledExecutors(receiver.streamId, executors)
      receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
      startReceiver(receiver, executors)
    }
。。。。。。

根据一定的调度策略把传入receivers分配相应的executors,Receiver的调度并不是交给spark内核完成的,而是由spark streaming框架完成调度,目的是避免spark内核将Receiver当作普通的job而将多个Receiver调度到同一个节点上。

ReceiverTracker的startReceiver方法比较复杂,主要是根据规划的executor来启动receiver,并在spark job中执行scheduledLocations创建的RDD。ReceiverTracker类的startReceiver方法的源码:

private def startReceiver(receiver: Receiver[_],scheduledLocations: Seq[TaskLocation]): Unit = {
。。。。。。
  val checkpointDirOption = Option(ssc.checkpointDir)
  val serializableHadoopConf =
    new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

  //在worker上启动Receiver,调用ReceiverSupervisorImpl的start来启动该节点上的supervisor
  val startReceiverFunc: Iterator[Receiver[_]] => Unit =
//当startReceiverFunc被调用时,ReceiverSupervisor就会被启动
      if (TaskContext.get().attemptNumber() == 0) {
        val receiver = iterator.next()
        assert(iterator.hasNext == false)
        val supervisor = new ReceiverSupervisorImpl(
          receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
        supervisor.start()
        supervisor.awaitTermination()
      }。。。。。。  
// 调用SparkContext的makeRDD创建了RDD,该RDD只有一条数据,就是Receiver对象
  val receiverRDD: RDD[Receiver[_]] =
    if (scheduledLocations.isEmpty) {
      ssc.sc.makeRDD(Seq(receiver), 1) //只有一个Receiver
    } else {
      val preferredLocations = scheduledLocations.map(_.toString).distinct
      ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
    }
  receiverRDD.setName(s"Receiver $receiverId")
  ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
  ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
//SparkContext.submitJob方法说明Streaming为每个Receiver启动了一个job,而不是由Streaming的action操作来提交和执行Streaming job
  val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
    receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
  future.onComplete {
    case Success(_) =>
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
      } else {
        self.send(RestartReceiver(receiver))
      }
    case Failure(e) =>
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
      } else {
        self.send(RestartReceiver(receiver))
      }
  }(submitJobThreadPool)
}

ReceiverSupervisiorImpl 是继承自ReceiverSupervisor,ReceiverSupervisor中调用了startReceiver方法。首先调用onReceiverStart方法,将Receiver注册给receiverTracker。若注册成功,调用Receiver的onStart在Executor上启动Receiver并持续的接收数据,将接收到的数据提交给BlockManager管理,至此Receiver启动完成。

Receiver的管理者是ReceiverSupervisor。对接收的Receiver进行遍历,每个Receiver对应一个ReceiverSupervisorImpl,每个ReceiverSupervisorImpl对应一个ReceiverSupervisor,即一个ReceiverSupervisor启动和管理一个Receiver。ReceiverSupervisorImpl的start启动了调用的ReceiverSupervisor的start方法,是为了启动ReceiverSupervisor,其源码:

def start() {
  onStart()
  startReceiver()
}

依然查看ReceiverTracker的startReceiver方法。若Receiver对应的job完成,其成功与否只要ReceiverTracker还没有停止则就一直发送RestartReceiver消息给ReceiverTrackerEndpoint,重启Receiver。

总结:Driver端的ReceiverTracker管理所有Executor上的Receiver任务,有一个ReceiverTrackerEndpoint消息通信体,此消息通信体在startReceiver方法中提交Receiverjob在具体Executor上执行,并接收Executor端发送过来的消息(如Receiver注册结果)。在Executor端有一个ReceiverSupervison专门管理Receiver,负责Receiver的注册启动与ReceiverTracker的信息交互。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值