Spark Streaming源码解读之Driver中的ReceiverTracker详解

本文深入探讨了Spark Streaming中Driver的ReceiverTracker如何管理接收到的数据。Receiver作为Task启动,ReceiverSupervisorImpl负责接收数据并转化为Block。ReceiverTrackerEndpoint通过RPC接收元数据,并使用ReceivedBlockTracker分配Block到batches。当Driver失败时,可以从checkpoint恢复。JobGenerator根据ReceiverTracker的元数据生成RDD,ReceiverBlockTracker则管理Block分配。ReceiverTracker与ReceivedBlockTracker采用门面设计模式,ReceiverTracker对外提供接口,ReceivedBlockTracker执行实际操作。
摘要由CSDN通过智能技术生成

本篇博文的目标是:
Driver的ReceiverTracker接收到数据之后,下一步对数据是如何进行管理

一:ReceiverTracker的架构设计
1. Driver在Executor启动Receiver方式,每个Receiver都封装成一个Task,此时一个Job中就一个Task,而Task中就一条数据,也就是Receiver数据。由此,多少个Job也就可以启动多少个Receiver.
2. ReceiverTracker在启动Receiver的时候他有ReceiverSupervisor,其实现是ReceiverSupervisorImpl, ReceiverSupervisor本身启动的时候会启动Receiver,Receiver不断的接收数据,通过BlockGenerator将数据转换成Block。定时器会不断的把Block数据通过BlockManager或者WAL进行存储,数据存储之后ReceiverSupervisorImpl会把存储后的数据的元数据Metadate汇报给ReceiverTracker,其实是汇报给ReceiverTracker中的RPC实体ReceiverTrackerEndpoint.
ReceiverSupervisorImpl会将元数据汇报给ReceiverTracker,那么接收到之后,下一步就对数据进行管理。

  1. 通过receivedBlockHandler写数据
private val receivedBlockHandler: ReceivedBlockHandler = {
  if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
    if (checkpointDirOption.isEmpty) {
      throw new SparkException(
        "Cannot enable receiver write-ahead log without checkpoint directory set. " +
          "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
          "See documentation for more details.")
    }
//WAL
    new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
      receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
  } else {
//BlockManager
    new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
  }
}
2.  PushAndReportBlock存储Block数据,且把信息汇报给Driver。
/** Store block and report it to driver */
def pushAndReportBlock(
    receivedBlock: ReceivedBlock,
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  val blockId = blockIdOption.getOrElse(nextBlockId)
  val time = System.currentTimeMillis
  val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
  logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
  val numRecords = blockStoreResult.numRecords
  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
  logDebug(s"Reported block $blockId")
}
3.  此时trackerEndpoint是ReceiverTrackerEndpoint
/** Remote RpcEndpointRef for the ReceiverTracker */
private val trackerEndpoint = RpcUtils.makeDriverRef("ReceiverTracker", env.conf, env.rpcEnv)
4.  ReceivedBlockInfo:封装Block的存储信息。
/** Information about blocks received by the receiver */
private[streaming] case class ReceivedBlockInfo(
    streamId: Int, //block属于哪个接收的流
    numRecords: Option[Long],//多少条记录
    metadataOption: Option[Any],//元数据信息
    blockStoreResult: ReceivedBlockStoreResult
  ) {
   

  require(numRecords.isEmpty || numRecords.get >= 0, "numRecords must not be negative")

  @volatile private var _isBlockIdValid = true

  def blockId: StreamBlockId = blockStoreResult.blockId

  def walRecordHandleOption: Option[WriteAheadLogRecordHandle] = {
    blockStoreResult match {
      case walStoreResult: WriteAheadLogBasedStoreResult => Some(walStoreResult.walRecordHandle)
      case _ => None
    }
  }

  /** Is the block ID valid, that is, is the block present in the Spark executors. */
  def isBlockIdValid(): Boolean = _isBlockIdValid

  /**
   * Set the block ID as invalid. This is useful when it is known that the block is not present
   * in the Spark executors.
   */
  def setBlockIdInvalid(): Unit = {
    _isBlockIdValid = false
  }
}
5.  ReceivedBlockStoreResult:
/** Trait that represents the metadata related to storage of blocks */
private[streaming] trait ReceivedBlockStoreResult {
  // Any implementation of this trait will store a block id
  def blockId: StreamBlockId
  // Any implementation of this trait will have to return the number of records
  def numRecords: Option[Long]
}

ReceiverTracker的源码源码遍历
1. 下面的消息是完成Receiver和ReceiverTracker之间通信的。

/**
 * Messages used by the NetworkReceiver and the ReceiverTracker to communicate
 * with each other.
 */
//这里使用sealed意思是ReceiverTrackerMessage包含所有的消息。
private[streaming] sealed trait ReceiverTrackerMessage
private[streaming] case class RegisterReceiver(
    streamId: Int,
    typ: String,
    host: String,
    executorId: String,
    receiverEndpoint: RpcEndpointRef
  ) extends ReceiverTrackerMessage
private[streaming] case class AddBlock(receivedBlockInfo: ReceivedBlockInfo)
  extends ReceiverTrackerMessage
private[streaming] case class ReportError(streamId: Int, message: String, error: String)
private[streaming] case class DeregisterReceiver(streamId: Int, msg: String, error: String)
  extends ReceiverTrackerMessage
2.  Driver和ReceiverTrackerEndpoint之间的交流通过ReceiverTrackerLocalMessage。
/**
 * Messages used by the driver and ReceiverTrackerEndpoint to communicate locally.
 */
private[streaming] sealed trait ReceiverTrackerLocalMessage
3.  ReceiverTrackerLocalMessage中的子类
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值