第10课:Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考

10课:Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考

这一个课继续解读sparkstreaming在接收数据的时候生命周期的研究和思考,首先会考虑一下接收数据的架构模式,在这个基础上研究源码。

大数据和其他的IT系统不太一样,例如j2EE或者其他软件开发,架构和设计是首要考虑的。而对于大数据应用程序而言,性能最重要,其次才是架构设计。

 

Sparkstreaming接收数据是不断的持续接收(具有ReceiverSparkstreaming),一般ReceiverDriver是不在一个进程中,所以Receiver在接收数据时要不断的汇报给Driver.即使没有Receiver在接收到数据也汇报给DriverDriectAPI的方式),如果接收到数据不汇报给DriverDriver在调度的时候就不能把数据纳入到调度系统。汇报的元数据,RDD本来就是基于接收的数据不断构造出来的(没有接收到数据也是数据,构造空的RDD)。

接收数据的时候肯定有一个循环器不断的接收数据,接收到数据肯定也有存储器,存储过之后向Driver汇报。接收数据和存储数据当然要分为两个不同的模块。

Sparkstreaming接收数据的整个流程类似于MVC模式,M就是ReceiverV就是DriverC就是ReceiverSupervisor

 

既要看真相,又要看心向(个人觉得是这个向,心之所向)。

 

 

      // Function to start the receiver on the worker node

      //这是一个在work node上启动receiver的函数,receiver启动后出现问题不能重试,只能从新schedule

      //ReceiverSupervisorImplreceiver的监控器,同时负责receiver的写操作

      //这个方法需要传入一个Iterator,实时上里边就只有一个Receiver

      val startReceiverFunc: Iterator[Receiver[_]] => Unit =

        (iterator: Iterator[Receiver[_]]) => {

//iterator中需要有元素

          if (!iterator.hasNext) {

            throw new SparkException(

              "Could not start receiver as object not found.")

          }

  //重试次数必须是0Receiver失败后只能重新调度启动

          if (TaskContext.get().attemptNumber() == 0) {

  //获得receiver,这个receiver是根据数据输入来源InputDstream获得的receiver

  //SocketInputDstream为例,它的receiver就是SocketReceiver.这里的receiver只是一个引用,并没有被实例化。作为一个参数传入ReceiverSupervisorImpl

 

            val receiver = iterator.next()

            assert(iterator.hasNext == false)

            val supervisor = new ReceiverSupervisorImpl(

              receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)

            supervisor.start()

            supervisor.awaitTermination()

          } else {

            // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.

          }

        }

 

      // Create the RDD using the scheduledLocations to run the receiver in a Spark job

      //receiver封装进一个RDD

      val receiverRDD: RDD[Receiver[_]] =

        if (scheduledLocations.isEmpty) {

          ssc.sc.makeRDD(Seq(receiver), 1)

        } else {

          val preferredLocations = scheduledLocations.map(_.toString).distinct

          ssc.sc.makeRDD(Seq(receiver -> preferredLocations))

        }

      receiverRDD.setName(s"Receiver $receiverId")

      ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")

      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

 

//为了启动Receiver启动了一个spark作业,每一个Receiver的启动都会有一个作业来负责,Receiver是一个一个的启动的

//如果是将所有的Receiver作为一个作业的不同tesk来启动会有很多弱点。

//1.Reciver启动可能失败进而导致应用程序失败

//2.运行的过程中会有任务倾斜的问题,将所有的Receiver作为一个作业的不同tesk来运行是采用的spark core的调度方式,在很不幸的情况下会出现所有Receiver运行在一个节点上,

//Receiver要不断的接收数据,需要消耗很多资源,就会导致这个节点负载特别大。

//将每个Receiver都作为一个job来运行就会最大可能的负载均衡,不过这样也有可能失败,失败之后不会重试job,而是从新schedule提交一个新的job来运行Receiver,并且不会在之前

//运行的executor上启动,只要sparkstreaming程序不停止,假如Receiver出故障就会不休止的进行重新echedule并启动,确保Receiver一定会启动

//还有很重要的一点是,当重新启动一个Receiver时,是用一个线程池在新的线程中启动的

      val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](

        receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())

      // We will keep restarting the receiver job until ReceiverTracker is stopped

      future.onComplete {

        case Success(_) =>

          if (!shouldStartReceiver) {

            onReceiverJobFinish(receiverId)

          } else {

            logInfo(s"Restarting Receiver $receiverId")

            self.send(RestartReceiver(receiver))

          }

        case Failure(e) =>

          if (!shouldStartReceiver) {

            onReceiverJobFinish(receiverId)

          } else {

            logError("Receiver has been stopped. Try to restart it.", e)

            logInfo(s"Restarting Receiver $receiverId")

            self.send(RestartReceiver(receiver))

          }

      }(submitJobThreadPool)

      logInfo(s"Receiver ${receiver.streamId} started")

    }

 

ReceiverSupervisorImpl负责处理Receiver接收到的数据,处理之后汇报给ReceiverTracker,所以ReceiverSupervisorImpl内部有和ReceiverTracker进行通信的endpoint。

这个负责向ReceiverTracker发送消息。

private val trackerEndpoint = RpcUtils.makeDriverRef("ReceiverTracker", env.conf, env.rpcEnv)

这个负责接收ReceiverTracker发送的消息,CleanupOldBlocks是用来清除运行完的每个batch的Blocks,UpdateRateLimit是用来随时调整限流(限流其实是限的数据存储的速度)。

/** RpcEndpointRef for receiving messages from the ReceiverTracker in the driver */
private val endpoint = env.rpcEnv.setupEndpoint(
  "Receiver-" + streamId + "-" + System.currentTimeMillis(), new ThreadSafeRpcEndpoint {
    override val rpcEnv: RpcEnv = env.rpcEnv

    override def receive: PartialFunction[Any, Unit] = {
      case StopReceiver =>
        logInfo("Received stop signal")
        ReceiverSupervisorImpl.this.stop("Stopped by driver", None)
      case CleanupOldBlocks(threshTime) =>
        logDebug("Received delete old batch signal")
        cleanupOldBlocks(threshTime)
      case UpdateRateLimit(eps) =>
        logInfo(s"Received a new rate limit: $eps.")
        registeredBlockGenerators.foreach { bg =>
          bg.updateRate(eps)
        }
    }
  })

下面是ReceiverSupervisor的start方法

def start() {
  onStart()
  startReceiver()
}

在onStart中启动的是BlockGenerator,BlockGenerator是把接收到的一条一条的数据生成block存储起来,一个BlockGenerator只服务于一个Receiver。所以BlockGenerator要在Receiver启动之前启动。

override protected def onStart() {
  registeredBlockGenerators.foreach { _.start() }
}

BlockGenerator种有一个定时器。这个定时器每隔一定(默认是200ms,和设定的batchduration无关)的时间就执行如下方法。这个方法就是把接收到的数据一条一条的放入到这个buffer缓存中,再把这个buffer按照一定的时间或者尺寸合并成block。除了定时器以外还有一条线程不停的把生成的block交给blockmanager存储起来。

/** Change the buffer to which single records are added to. */
private def updateCurrentBuffer(time: Long): Unit = {
  try {
    var newBlock: Block = null
    
synchronized {
      if (currentBuffer.nonEmpty) {
        val newBlockBuffer = currentBuffer
        currentBuffer
= new ArrayBuffer[Any]
        val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
        listener.onGenerateBlock(blockId)
        newBlock = new Block(blockId, newBlockBuffer)
      }
    }

    if (newBlock != null) {
      blocksForPushing.put(newBlock)  // put is blocking when queue is full
    
}
  } catch {
    case ie: InterruptedException =>
      logInfo("Block updating timer thread was interrupted")
    case e: Exception =>
      reportError("Error in block updating thread", e)
  }
}

下面来看startReceiver方法

/** Start receiver */
def startReceiver(): Unit = synchronized {
  try {
    if (onReceiverStart()) {
      logInfo("Starting receiver")
      receiverState = Started
      
receiver.onStart()
      logInfo("Called receiver onStart")
    } else {
      // The driver refused us
      
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
    }
  } catch {
    case NonFatal(t) =>
      stop("Error starting receiver " + streamId, Some(t))
  }
}

在启动Receiver之前还要向ReceiverTracker请求是否可以启动Receiver。当返回是true才会启动。ReceiverTracker接收到汇报的信息就把注册Receiver的信息。

override protected def onReceiverStart(): Boolean = {
  val msg = RegisterReceiver(
    streamId, receiver.getClass.getSimpleName, host, executorId, endpoint)
  trackerEndpoint.askWithRetry[Boolean](msg)
}

Receiver的启动只是调用receiver.onStart(),Receiver就在work节点上运行了。

以SocketReceiver为例我看看它的onStart方法

private[streaming]
class SocketReceiver[T: ClassTag](
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends Receiver[T](storageLevel) with Logging {

  def onStart() {
    // Start the thread that receives data over a connection
    
new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }

  def onStop() {
    // There is nothing much to do as the thread calling receive()
    // is designed to stop by itself isStopped() returns false
  
}

  /** Create a socket connection and receive data until receiver is stopped */
  
def receive() {
    var socket: Socket = null
    try
{
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
  }
}

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值