第10课:Spark Streaming 源码解读之流数据不断接收全生命周期彻底研究与思考




 作者:大数据技术研发人员:谢彪


1ReceiverSupervisor是处理Receiver接收的数据,“Receive”不断接收数据并把数据交给ReceiverSupervisor进行处理,ReceiverSupervisor接收到数据之后,会“存储数据”,之后把“信息”汇报给“ReceiverTracker”,其实是汇报给ReceiverTracker之中的IPC

2"BlockGenerator"是存储数据,它继承于“rateLimiter, rateLimiter是限制存储数据,在Spark Streaming接收数据过程中我们无法控制“接收数据”的速度,但可以控制“rateLimiter”来存储数据速度,从而控制"存储数据"的速度,BlockGenerator的产生是由ReceiverSupervisor其里面有一个类ReceiverSupervisorImplCreatBlockGenerator产生的,代码:

 valnewBlockgenerator = new BlockGenerator(blockGenerationListenerStreamid env.cof、、、、)

 

以上作用显而可见,1BlockGenerator只服务于1inputDstream.

 

 

   接收数据的时候肯定有一个循环器不断的接收数据,接收到数据肯定也有存储器,存储过之后向Driver汇报。接收数据和存储数据当然要分为两个不同的模块。

Sparkstreaming接收数据的整个流程类似于MVC模式,M就是ReceiverV就是DriverC就是ReceiverSupervisor

ReceiverSupervisorImplreceiver的监控器,同时负责receiver的写操作  这个方法需要传入一个Iterator,实时上里边就只有一个Receiver

      valstartReceiverFunc: Iterator[Receiver[_]] => Unit =

        (iterator:Iterator[Receiver[_]]) => {

          if(!iterator.hasNext) {

            thrownew SparkException(

              "Couldnot start receiver as object not found.")

          }

  //重试次数必须是0Receiver失败后只能重新调度启动

          if(TaskContext.get().attemptNumber() == 0) {

 获得receiver,这个receiver是根据数据输入来源InputDstream获得的receiver。以SocketInputDstream为例,它的receiver就是SocketReceiver.这里的receiver只是一个引用,并没有被实例化。作为一个参数传入ReceiverSupervisorImpl

 

            valreceiver = iterator.next()

            assert(iterator.hasNext== false)

            valsupervisor = new ReceiverSupervisorImpl(

              receiver,SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)

            supervisor.start()

            supervisor.awaitTermination()

          }else {

            //It's restarted by TaskScheduler, but we want to reschedule it again. So exitit.

          }

        }

 

      //Create the RDD using the scheduledLocations to run the receiver in aSpark job

      //receiver封装进一个RDD

      valreceiverRDD: RDD[Receiver[_]] =

        if(scheduledLocations.isEmpty) {

          ssc.sc.makeRDD(Seq(receiver),1)

        }else {

          valpreferredLocations = scheduledLocations.map(_.toString).distinct

          ssc.sc.makeRDD(Seq(receiver-> preferredLocations))

        }

      receiverRDD.setName(s"Receiver$receiverId")

      ssc.sparkContext.setJobDescription(s"Streamingjob running receiver $receiverId")

      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

 为了启动Receiver启动了一个spark作业,每一个Receiver的启动都会有一个作业来负责,Receiver是一个一个的启动的如果是将所有的Receiver作为一个作业的不同tesk来启动会有很多弱点。

 1.Reciver启动可能失败进而导致应用程序失败

 2.运行的过程中会有任务倾斜的问题,将所有的Receiver作为一个作业的不同tesk来运行是采用的spark core的调度方式,在很不幸的情况下会出现所有Receiver运行在一个节点上,Receiver要不断的接收数据,需要消耗很多资源,就会导致这个节点负载特别大。将每个Receiver都作为一个job来运行就会最大可能的负载均衡,不过这样也有可能失败,失败之后不会重试job,而是从新schedule提交一个新的job来运行Receiver,并且不会在之前运行的executor上启动,只要sparkstreaming程序不停止,假如Receiver出故障就会不休止的进行重新echedule并启动,确保Receiver一定会启动还有很重要的一点是,当重新启动一个Receiver时,是用一个线程池在新的线程中启动的

      valfuture = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](

        receiverRDD,startReceiverFunc, Seq(0), (_, _) => Unit, ())

      //We will keep restarting the receiver job until ReceiverTracker is stopped

      future.onComplete{

        caseSuccess(_) =>

          if(!shouldStartReceiver) {

            onReceiverJobFinish(receiverId)

          }else {

            logInfo(s"RestartingReceiver $receiverId")

            self.send(RestartReceiver(receiver))

          }

        caseFailure(e) =>

          if(!shouldStartReceiver) {

            onReceiverJobFinish(receiverId)

          }else {

            logError("Receiverhas been stopped. Try to restart it.", e)

            logInfo(s"RestartingReceiver $receiverId")

            self.send(RestartReceiver(receiver))

          }

      }(submitJobThreadPool)

      logInfo(s"Receiver${receiver.streamId} started")

    }

 

ReceiverSupervisorImpl负责处理Receiver接收到的数据,处理之后汇报给ReceiverTracker,所以ReceiverSupervisorImpl内部有和ReceiverTracker进行通信的endpoint。这个负责向ReceiverTracker发送消息。

private valtrackerEndpoint= RpcUtils.makeDriverRef("ReceiverTracker", env.conf,env.rpcEnv)

这个负责接收ReceiverTracker发送的消息,CleanupOldBlocks是用来清除运行完的每个batchBlocksUpdateRateLimit是用来随时调整限流(限流其实是限的数据存储的速度)。

/** RpcEndpointRef forreceiving messages from the ReceiverTracker in the driver */
private valendpoint=env.rpcEnv.setupEndpoint(
  
"Receiver-"+streamId+"-"+ System.currentTimeMillis(),newThreadSafeRpcEndpoint {
    
override valrpcEnv: RpcEnv =env.rpcEnv

    
override defreceive:PartialFunction[Any, Unit] = {
      
caseStopReceiver=>
        logInfo(
"Receivedstop signal")
        ReceiverSupervisorImpl.
this.stop("Stoppedby driver", None)
      
caseCleanupOldBlocks(threshTime)=>
        logDebug(
"Receiveddelete old batch signal")
        cleanupOldBlocks(threshTime)
      
caseUpdateRateLimit(eps) =>
        logInfo(
s"Receiveda new rate limit:$eps.")
        
registeredBlockGenerators.foreach {bg =>
          bg.updateRate(eps)
        }
    }
  })

下面是ReceiverSupervisorstart方法

def start() {
  onStart()
  startReceiver()
}

onStart中启动的是BlockGeneratorBlockGenerator是把接收到的一条一条的数据生成block存储起来,一个BlockGenerator只服务于一个Receiver。所以BlockGenerator要在Receiver启动之前启动。

override protected defonStart() {
  
registeredBlockGenerators.foreach { _.start() }
}

BlockGenerator种有一个定时器。这个定时器每隔一定(默认是200ms,和设定的batchduration无关)的时间就执行如下方法。这个方法就是把接收到的数据一条一条的放入到这个buffer缓存中,再把这个buffer按照一定的时间或者尺寸合并成block。除了定时器以外还有一条线程不停的把生成的block交给blockmanager存储起来。

/** Change the buffer towhich single records are added to. */
private defupdateCurrentBuffer(time: Long): Unit = {
  
try{
    
varnewBlock: Block =null
    
synchronized{
      
if(currentBuffer.nonEmpty){
        
valnewBlockBuffer=currentBuffer
        currentBuffer
=newArrayBuffer[Any]
        
valblockId =StreamBlockId(receiverId, time - blockIntervalMs)
        listener.onGenerateBlock(blockId)
        newBlock =
newBlock(blockId,newBlockBuffer)
      }
    }

    
if(newBlock !=null) {
      
blocksForPushing.put(newBlock) // put is blocking when queue is full
    
}
  }
catch{
    
caseie:InterruptedException=>
      logInfo(
"Block updatingtimer thread was interrupted")
    
casee:Exception=>
      reportError(
"Errorin block updating thread", e)
  }
}

下面来看startReceiver方法

/** Start receiver */
defstartReceiver(): Unit = synchronized {
  
try{
    
if(onReceiverStart()) {
      logInfo(
"Startingreceiver")
      
receiverState=Started
      
receiver.onStart()
      logInfo(
"Called receiveronStart")
    }
else{
      
// The driver refused us
      
stop("Registeredunsuccessfully because Driver refused to start receiver " +streamId, None)
    }
  }
catch{
    
caseNonFatal(t) =>
      stop(
"Error startingreceiver "+streamId,Some(t))
  }
}

在启动Receiver之前还要向ReceiverTracker请求是否可以启动Receiver。当返回是true才会启动。ReceiverTracker接收到汇报的信息就把注册Receiver的信息。

override protected defonReceiverStart():Boolean = {
  
valmsg =RegisterReceiver(
    
streamId,receiver.getClass.getSimpleName,host,executorId,endpoint)
  
trackerEndpoint.askWithRetry[Boolean](msg)
}

Receiver的启动只是调用receiver.onStart()Receiver就在work节点上运行了。

SocketReceiver为例我看看它的onStart方法

private[streaming]
classSocketReceiver[T: ClassTag](
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  )
extendsReceiver[T](storageLevel)withLogging{

  
defonStart(){
    
// Start the thread that receives data over aconnection
    
newThread("SocketReceiver") {
      setDaemon(
true)
      
override defrun() { receive() }
    }.start()
  }

  
defonStop(){
    
// There is nothing much to do as the threadcalling receive()
    // is designed to stop by itselfisStopped() returns false
  
}

  
/** Create a socket connection and receive data untilreceiver is stopped */
  
defreceive() {
    
varsocket: Socket =null
    try
{
      logInfo(
"Connecting to"+ host +":"+ port)
      socket =
newSocket(host, port)
      logInfo(
"Connected to"+ host +":"+ port)
      
valiterator =bytesToObjects(socket.getInputStream())
      
while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      
if(!isStopped()) {
        restart(
"Socketdata stream had no more data")
      }
else{
        logInfo(
"Stoppedreceiving")
      }
    }
catch{
      
casee:Java.net.ConnectException =>
        restart(
"Errorconnecting to "+host +":"+ port, e)
      
caseNonFatal(e) =>
        logWarning(
"Errorreceiving data",e)
        restart(
"Errorreceiving data",e)
    }
finally{
      
if(socket !=null) {
        socket.close()
        logInfo(
"Closedsocket to "+host +":"+ port)
      }
    }
  }
}

作者:大数据技术研发人员:谢彪

  • 资料来源于:DT_大数据梦工厂(Spark发行版本定制) 

  • DT大数据梦工厂微信公众号:DT_Spark 

  • 新浪微博:http://www.weibo.com/ilovepains

  • 王家林老师每晚20:00免费大数据实战

YY直播:68917580




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值