SparkStreaming案例：NetworkWordCount--ReceiverSupervisorImpl.onStart()如何将Reciver数据写到BlockManager中

最新推荐文章于 2021-03-31 11:27:31 发布

水中舟_luyl

最新推荐文章于 2021-03-31 11:27:31 发布

阅读量349

点赞数 1

分类专栏： spark Streaming

本文链接：https://blog.csdn.net/luyllyl/article/details/79298818

版权

spark 同时被 2 个专栏收录

38 篇文章 1 订阅

订阅专栏

Streaming

9 篇文章 0 订阅

订阅专栏

上文提到“ReceiverInputDstream的Receiver是如何被放到Executor上执行的”关键代码ReceiverSupervisorImpl的start方法。

val startReceiverFunc: Iterator[Receiver[_]] => Unit =
  (iterator: Iterator[Receiver[_]]) => {
    if (!iterator.hasNext) {
      throw new SparkException(
        "Could not start receiver as object not found.")
    }
    //得到当前活动的TaskContext, attemptNumber:表示这个任务尝试了多少次。

第一个任务尝试将被分配 attemptNumber = 0，随后的尝试将有增加的尝试次数。
    if (TaskContext.get().attemptNumber() == 0) {
      val receiver = iterator.next()
      //任务第一次进来时，iterator被上面next之后里面就没有元素。不然会报错
      assert(iterator.hasNext == false)
      //ReceiverSupervisorImpl是Receiver的监控器，同时负责数据的写等操作,也就是每个周期receiver得到的数据写入到
      //BlockGenerator中，然后让ReceiverInputStream中的compute方法将每个周期内的数据从BlockGenerator取出，然后写入RDD中
      val supervisor = new ReceiverSupervisorImpl(
        receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
      //背后是调用ReceiverSupervisor父类实现方法，最终由ReceiverSupervisorImpl子类实现onStart方法
      /*def start() {
          onStart()
          startReceiver()
        }*/
      supervisor.start()
      supervisor.awaitTermination()
    } else {
      //意思是当.attemptNumber大于0时，表示重启TaskScheduler，Receiver不需要重启，直接退出就可以
      // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
    }
  }

1，supervisor.start()该方法是启动Receiver开始在Executor上接收数据的入口

start()方法是在ReceiverSupervisorImpl的父类ReceiverSupervisor中实现的

/** Start the supervisor */
def start() {
  onStart()
  startReceiver() //真正调用的Receiver实现类中的onStart方法，如SocketInputDstream中实现的Receive中onStart方法
}

2，进入onStart(),注解说得很清楚，该方法必须在receiver.onStart()方法调用之前，先进行调用。从而保证BlockGenerator提前启动就绪，来接收receiver传过来的数据

/**
 * Called when supervisor is started.
 * Note that this must be called before the receiver.onStart() is called to ensure
 * things like [[BlockGenerator]]s are started before the receiver starts sending data.
 */

override protected def onStart() {
   registeredBlockGenerators.foreach { _.start() }
}

a,其中registeredBlockGenerators就是ArrayBuffer[BlockGenerator]，这个集合会在当前ReceiverSupervisorImpl初始化时将BlockGenerator生成出来放到ArrayBuffer集合中,其中的_.start()是BlockGenerator.start

/**
 * Concrete implementation of [[org.apache.spark.streaming.receiver.ReceiverSupervisor]]
 * which provides all the necessary functionality for handling the data received by
 * the receiver. Specifically, it creates a [[org.apache.spark.streaming.receiver.BlockGenerator]]
 * object that is used to divide the received data stream into blocks of data.
  * a,该类实现ReceiverSupervisor提供了所有处理receiver接收过来的数据，
  * b,它会在初始化时createBlockGenerator创建BlockGenerator对象，
  * c,同时它使用BlockGenerator对象来切分receiver得到的数据放到block块中
  *
  *ReceiverSupervisorImpl是Receiver的监控器，同时负责数据的写等操作,也就是每个周期receiver得到的数据写入到BlockGenerator中，

然后让ReceiverInputStream中的compute方法将每个周期内的数据从BlockGenerator取出，然后写入RDD中
 */
private[streaming] class ReceiverSupervisorImpl(
    receiver: Receiver[_],
    env: SparkEnv,
    hadoopConf: Configuration,
    checkpointDirOption: Option[String]
  ) extends ReceiverSupervisor(receiver, env.conf) with Logging {

 ….

  /** Unique block ids if one wants to add blocks directly */
  private val newBlockId = new AtomicLong(System.currentTimeMillis())
  //它初始化后会被下面createBlockGenerator方法调用，将BlockGenerator填充到该集合中ArrayBuffer[BlockGenerator]
  private val registeredBlockGenerators = new mutable.ArrayBuffer[BlockGenerator]
    with mutable.SynchronizedBuffer[BlockGenerator]

  /** Divides received data records into data blocks for pushing in BlockManager.
    * 初始一个BlockGeneratorListener，帮助切分数据放到BlockManager中
    * */
  private val defaultBlockGeneratorListener = new BlockGeneratorListener {
    def onAddData(data: Any, metadata: Any): Unit = { }

    def onGenerateBlock(blockId: StreamBlockId): Unit = {
      print("onGenerateBlock no impl.....")
    }

    def onError(message: String, throwable: Throwable) {
      reportError(message, throwable)
    }

    def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
      //将receiver使用store存放在ArrayBuffer放到内存中
      pushArrayBuffer(arrayBuffer, None, Some(blockId))
    }
  }
  //初始化时createBlockGenerator创建BlockGenerator对象,用该BlockGenerator来切分Receiver接收到的数据,

每初始化一次创建一个BlockGenerator，这个BlockGenerator除放在registeredBlockGenerators集合中，还跟方法返回了
  private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)

b，查看一下createBlockGenerator(),该方法用于创建BlockGenerator对象，同时将BlockGeneratorListener注入到BlockGenerator中

override def createBlockGenerator(
    blockGeneratorListener: BlockGeneratorListener): BlockGenerator = {
  // Cleanup BlockGenerators that have already been stopped
  //registeredBlockGenerators集合是ArrayBuffer[BlockGenerator]，先将BlockGenerator停止掉的内容去掉
  registeredBlockGenerators --= registeredBlockGenerators.filter{ _.isStopped() }
  //生成一个BlockGenerator实例加到registeredBlockGenerators中，并返回生成的BlockGenerator
  val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf)
  registeredBlockGenerators += newBlockGenerator
  newBlockGenerator
}

3，查看一下BlockGenerator.start()方法

/**
 * Generates batches of objects received by a
 * [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately
 * named blocks at regular intervals. This class starts two threads,
 * one to periodically start a new batch and prepare the previous batch of as a block,
 * the other to push the blocks into the block manager.
  *
  *一，通过Receiver接收到一批数据，同时将它们放到内部对应的规则名称中，这个类有两个线程：
  * a,会周期性的去开始一个新的批次，并准备块的前一批，也就是下面RecurringTimer对应的updateCurrentBuffer方法
  * b,第二个线程将块放到blockManager中，对应下面的blockPushingThread线程
 *
 * Note: Do not create BlockGenerator instances directly inside receivers. Use
 * `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it.
 */
private[streaming] class BlockGenerator(
    listener: BlockGeneratorListener,
    receiverId: Int,
    conf: SparkConf,
    clock: Clock = new SystemClock()
  ) extends RateLimiter(conf) with Logging {
  // 第一个参数是StreamBlockId(receiverid,下一次处理的时间点-200ms)：第二个参数receiver从数据源store进来的数据
  private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any])

  /**
    * BlockGenerator的成员state，有五个可能状态：
    * Initialized：表示还没有开始
    * Active：start()已被调用，将数据放到生成block中
    * StoppedAddingData：stop()已被调用，数据添加被停止，block还在生并数据可以推送到block中
    * StoppedGeneratingBlocks：停止block的生成，但数据还可以被推送。
    * StoppedAll：停止了所有，BlockGenerator对象被GC掉


   * The BlockGenerator can be in 5 possible states, in the order as follows.
   *  - Initialized: Nothing has been started
   *  - Active: start() has been called, and it is generating blocks on added data.
   *  - StoppedAddingData: stop() has been called, the adding of data has been stopped,
   *                       but blocks are still being generated and pushed.
   *  - StoppedGeneratingBlocks: Generating of blocks has been stopped, but
   *                             they are still being pushed.
   *  - StoppedAll: Everything has stopped, and the BlockGenerator object can be GCed.
   */
  private object GeneratorState extends Enumeration {
    type GeneratorState = Value
    val Initialized, Active, StoppedAddingData, StoppedGeneratingBlocks, StoppedAll = Value
  }
  import GeneratorState._
  //receive接收的数据在存储到spark里面时，将数据切分成块的时间间隔，建议最低不超过50ms
  private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
  require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")
  //每200ms调用一次updateCurrentBuffer方法
  private val blockIntervalTimer =
    new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
  private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
  private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
  private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

  @volatile private var currentBuffer = new ArrayBuffer[Any]
  @volatile private var state = Initialized

  /** Start block generating and pushing threads. */
  def start(): Unit = synchronized {
    if (state == Initialized) {
      state = Active
      blockIntervalTimer.start()
      blockPushingThread.start()
      logInfo("Started BlockGenerator")
    } else {
      throw new SparkException(
        s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
    }
  }

a，先看一下blockIntervalTimer周期性线程，它是不断去生成Block用的

//receive接收的数据在存储到spark里面时，将数据切分成块的时间间隔，建议最低不超过50ms
private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")
//每200ms调用一次updateCurrentBuffer方法
private val blockIntervalTimer =
  new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")

b,updateCurrentBuffer()如果receiver有数据存放进来，会每200ms生成一个Block，放到ArrayBlockQueue队列中，当这个队列有元素时，就会不断将Block存放到BlockManager进行存储

/** Change the buffer to which single records are added to.
  * 更改将单个记录添加到的缓冲区 
  * */
private def updateCurrentBuffer(time: Long): Unit = {
  try {
    var newBlock: Block = null
    //上面初始化进来的@volatile private var currentBuffer = new ArrayBuffer[Any]
    //这个currentBuffer集合需要Receiver的实现类，调用store（）方法后，再调用当前类中的addData方法，集合才会有值
    synchronized {
      if (currentBuffer.nonEmpty) {
        //如果currentBuffer存在元素就将它赋给另一个变量，然后给它赋一个新的ArrayBuffer[Any]
        val newBlockBuffer = currentBuffer
        currentBuffer = new ArrayBuffer[Any]
        //streaming的Batch周期最好不要小于500ms，分块的blockIntervalMs的默认值是200ms，最好不低于50ms
        val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
        //listener如果是Streaming传过来的BlockGeneratorListener对象,该listener并没有实现onGenerateBlock方法
        listener.onGenerateBlock(blockId)
        newBlock = new Block(blockId, newBlockBuffer)
      }
    }
    if (newBlock != null) {
      //将Block放到blocksForPushin这个ArrayBlockingQueue[Block]里面
      blocksForPushing.put(newBlock)  // put is blocking when queue is full
    }
  } catch {
    case ie: InterruptedException =>
      logInfo("Block updating timer thread was interrupted")
    case e: Exception =>
      reportError("Error in block updating thread", e)
  }
}

c，再看blockPushingThread线程：它的线程会不断监听ArrayBlockingQueue[Block]队列是否有数据进来，然后放到BlockManager中进行存储。

private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

/** Keep pushing blocks to the BlockManager. */
private def keepPushingBlocks() {
  logInfo("Started block pushing thread")

  def areBlocksBeingGenerated: Boolean = synchronized {
    state != StoppedGeneratingBlocks
  }

  try {
    // While blocks are being generated, keep polling for to-be-pushed blocks and push them.
    //如果state没有停止block的生成，就一直从blocksForPushing对应的ArrayBlockingQueue[Block]取出来，

这个blocksForPushing队列需要上面updateCurrentBuffer方法，对应的currentBuffer的集合ArrayBuffer[Any]里面有元素才会执行pushBlock()方法
    while (areBlocksBeingGenerated) {
      //在10ms内取出队列头部元素，如果超时返回Null
      Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
          //pushBlock方法就是将Receiver的store()接收到的数据放BlockManager的diskStore或MemoryStore中
        case Some(block) => pushBlock(block)
        case None =>
      }
    }

    // At this point, state is StoppedGeneratingBlock. So drain the queue of to-be-pushed blocks.
    logInfo("Pushing out the last " + blocksForPushing.size() + " blocks")
    while (!blocksForPushing.isEmpty) {
      val block = blocksForPushing.take()
      logDebug(s"Pushing block $block")
      //会将block放到初始化生成的defaultBlockGeneratorListener中
      pushBlock(block)
      logInfo("Blocks left to push " + blocksForPushing.size())
    }
    logInfo("Stopped block pushing thread")
  } catch {
    case ie: InterruptedException =>
      logInfo("Block pushing thread was interrupted")
    case e: Exception =>
      reportError("Error in block pushing thread", e)
  }
}

4，接下来看一下pushBlock(block),它会将生成的block放到ReceiverSupervisorImpl中的defaultBlockGeneratorListener中

//pushBlock方法就是将Receiver的store()接收到的数据放BlockManager的diskStore或MemoryStore中
private def pushBlock(block: Block) {
  //这个block.id就是StreamBlockId(receiverId,time-200ms),block.buffer就是对应receiver.store进来的数据，

而这个listener就是ReceiverSupervisorImpl，初始化生成的defaultBlockGeneratorListener
  listener.onPushBlock(block.id, block.buffer)
  logInfo("Pushed block " + block.id)
}

a,看一下BlockGeneratorListener这个监听器是onPushBlock()如何将Block数据放到BlockManager中的

/** Divides received data records into data blocks for pushing in BlockManager.
  * 初始一个BlockGeneratorListener，帮助切分数据放到BlockManager中
  * */
private val defaultBlockGeneratorListener = new BlockGeneratorListener {
  def onAddData(data: Any, metadata: Any): Unit = { }
  def onGenerateBlock(blockId: StreamBlockId): Unit = {
    print("onGenerateBlock no impl.....")
  }
  def onError(message: String, throwable: Throwable) {
    reportError(message, throwable)
  }
  def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
    //将receiver使用store存放在ArrayBuffer放到内存中
    pushArrayBuffer(arrayBuffer, None, Some(blockId))
  }
}

b,进pushArrayBuffer查看一下具体的实现：

/** Store an ArrayBuffer of received data as a data block into Spark's memory.
  * 将receiver使用store存放在arrayBuffer这个ArrayBuffer内存中
  * pushArrayBuffer(arrayBuffer[Any], None, Some(StreamBlockId(receiverId,time-200ms)))
  * */
def pushArrayBuffer(
    arrayBuffer: ArrayBuffer[_],
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  //将信息告诉driver，ArrayBufferBlock是ReceivedBlock的cass class表示数据块存放在ArrayBuffer中
  pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}

==》查看一下pushAndReportBlock是如何保存block并报告driver的

/** Store block and report it to driver
  * 保存block并报告给driver
  * ArrayBufferBlock(arrayBuffer[Any]), None, Some(StreamBlockId(receiverId,time-200ms)
  * */
def pushAndReportBlock(
    receivedBlock: ReceivedBlock,
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  //如果没有得到StreamBlockId，会生成一个新的StreamBlockId
  val blockId = blockIdOption.getOrElse(nextBlockId)
  val time = System.currentTimeMillis
  //将block保存到BlockManager中，按指定的storageLevel,返回BlockManagerBasedStoreResult(blockId, numRecords)，
  // 其中numRecords就是store（）存放进来的个数
  val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
  logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
  val numRecords = blockStoreResult.numRecords
  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
//将ReceiverBlockInfo给ReceiverTrackerEndpoint对应的receiveAndReply，
// 从而将ReceivedBlockInfo会将Block的元数据信息放在AddBlock身上 通过Driver
  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
  logDebug(s"Reported block $blockId")
}

===》从源码可知将接收到的数据存储到BlockManager中是receiverdBlockHandler.storeBlock()

=== 》先看一下receivedBlockHandler是如何实现的，从源码可以分析当前案例得到的是BlockManagerBaseBlockHandler实例

//ReceivedBlockHandler处理接收receiver的Block信息的类
private val receivedBlockHandler: ReceivedBlockHandler = {
  //如果conf没有设置：spark.streaming.receiver.writeAheadLog.enable这个值，默认是false
  //即不会生成WriteAheadLogBasedBlockHandler实例
  if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
    if (checkpointDirOption.isEmpty) {
      throw new SparkException(
        "Cannot enable receiver write-ahead log without checkpoint directory set. " +
          "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
          "See documentation for more details.")
    }
    new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
      receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
  } else {
    //取得SprakEnv对应Executor的BlockManager，它用于存储block，如果是socketTextStream得到的receiver就是SocketReceiver
    new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
  }
}

===》从源码得知当前storeBlock会将receiver得到的数据，以BlockManager.putIterator方式以storageLevel的方式存储在spark集群中。然后将StreamBlockId及当前存储的记录数通过BlockManagerBaseStoreResult()实例返回。

（对于BlockManager.putIterator（）相关代码，后面针对性的分析一下）

/**
 * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which
 * stores the received blocks into a block manager with the specified storage level.
  * 将block保存到BlockManager中，按指定的storageLevel
 */
private[streaming] class BlockManagerBasedBlockHandler(
    blockManager: BlockManager, storageLevel: StorageLevel)
  extends ReceivedBlockHandler with Logging {
  //ReceivedBlock就是store()存放进来的数据对应ArrayBufferBlock(arrayBuffer[Any])
  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {
    var numRecords = None: Option[Long]
    val putResult: Seq[(BlockId, BlockStatus)] = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        //得到存放到arrayBuffer[Any]集合的元素个数
        numRecords = Some(arrayBuffer.size.toLong)
        //将block保存到BlockManager中，按指定的storageLevel
        blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,
          tellMaster = true)
      case IteratorBlock(iterator) =>
       。。。。

    BlockManagerBasedStoreResult(blockId, numRecords)
  }

===>最后会使用ReceiverTrackerEndpoint通知Driver

//将ReceiverBlockInfo给ReceiverTrackerEndpoint对应的receiveAndReply，
// 从而将ReceivedBlockInfo会将Block的元数据信息放在AddBlock身上 通过Driver
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))

===》被ReceiverTrackerEndpoint的receiveAndRely接收到，会回复addBlock方法返回的boolean值

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
  // Remote messages
  case RegisterReceiver(streamId, typ, host, executorId, receiverEndpoint) =>
    val successful =
      registerReceiver(streamId, typ, host, executorId, receiverEndpoint, context.senderAddress)
    context.reply(successful)
  case AddBlock(receivedBlockInfo) =>
    if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) {
      walBatchingThreadPool.execute(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          if (active) {
            context.reply(addBlock(receivedBlockInfo))
          } else {
            throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.")
          }
        }
      })
    } else {
      context.reply(addBlock(receivedBlockInfo))
    }
  case DeregisterReceiver(streamId, message, error) =>
    。。。。。
}

===>addBlock方法作用就是ReceivedBlockInfo这个元数据信息放到一个ReceivedBlockQueue队列中，元素就是这个ReceivedBlockInfo元数据信息。该方法发生异常时会返回false

==》该方法在ReceivedBlockTracker中实现

/** Add received block. This event will get written to the write ahead log (if enabled). */
def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
  try {
    val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
    if (writeResult) {
      synchronized {
        getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
      }
      logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
        s"block ${receivedBlockInfo.blockStoreResult.blockId}")
    } else {
      logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " +
        s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.")
    }
    writeResult
  } catch {
    case NonFatal(e) =>
      logError(s"Error adding block $receivedBlockInfo", e)
      false
  }
}

至此，ReceiverSupervisorImpl的onStart()方法是如何得到Reciver的数据写到spark的BlockManager中结束

下面来分析ReceiverSupervisorImpl中的startReceiver(),Receiver如何将数据store到RDD的?

水中舟_luyl

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SparkStreaming案例：NetworkWordCount--ReceiverSupervisorImpl.onStart()如何将Reciver数据写到BlockManager中

上文提到“ReceiverInputDstream的Receiver是如何被放到Executor上执行的”关键代码ReceiverSupervisorImpl的start方法。 val startReceiverFunc: Iterator[Receiver[_]] =&gt; Unit = (iterator: Iterator[Receiver[_]]) =&gt; { if (!...
复制链接

扫一扫