SparkStreaming案例:NetworkWordCount--ReceiverInputDstream的compute方法如何取得Socket预先存放在BlockManager中的数据

1, 还是从这个案例开始

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()
    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[5]")
    val ssc = new StreamingContext(sparkConf, Seconds(40))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream("192.168.4.41", 9999, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2,在“SparkStream例子HdfsWordCount--从Dstream到RDD全过程解析”这一文中详细说明了DstreamGraph回溯生成RDD的。这边再简单回顾一下:

a,Dstream.print()==>对应的ForEachDStream的generateJob(time:Time)方法会被DstreamGraph.generateJobs(time)调用

b, ForEachDStream的generateJob(time:Time){ parent.getOrCompute(time)….},通过parent对应的Dstream一直找到FileInputDStream的compute方法,来生成RDD

===》此处NetworkWordCount对应的是ReceiverInputDstream,通过DstreamGraph回溯生成Rdd的过程是一样的。不过ReceiverInputDstream是取预先由SocketReceiver存放在spark的BlockManager中的数据来生成RDD的.

===》( 在这一文“ReceiverSupervisorImpl中的startReceiver(),Receiver如何将数据store到RDD的”中分析过,Receiver如何将数据存放到RDD中)

3,咱们直接查看一下SocketInputDstream的父类ReceiverInputDStream的compute方法.

abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {
 。。。。
  /**
   * Generates RDDs with blocks received by the receiver of this stream. */
  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.
// 发生返回空的Rdd,可能是因为driver失败后重启并且没有做 WAL
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream for this batch
//否则会通过ReceiverTracker取得当前批次所有ReceivedBlockInfo信息
        val receiverTracker = ssc.scheduler.receiverTracker
        //receiverTracker.getBlocksOfBatch(validTime)取得当前批次对应的所有的ReceiverId和每个receiverId对应的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] 
        // InputStream的id和receiverId 有对应关系
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
        // Register the input blocks information into InputInfoTracker
        //将注册的输入的blocks信息放到StreamInputInfo类中,id是ReceiverInputDstream对应的,numRecords存放到BlockManager中所有条数
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

4,分析一下这一段代码还是挺值得学习的:目标就是为了得到当前批次中所有receiverId对应的Seq[ReceivedBlockInfo]信息。

==》这个ReceivedBlockInfo类中,存放streamId,store()进来的总条数,及BlockId等原数据信息

val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)

a,在ReceiverTracker中getBlockOfBatch方法是要得到所有输入流的数据

private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
。。。。。
  private val receivedBlockTracker = new ReceivedBlockTracker(
    ssc.sparkContext.conf,
    ssc.sparkContext.hadoopConfiguration,
    receiverInputStreamIds,
    ssc.scheduler.clock,
    ssc.isCheckpointPresent,
    Option(ssc.checkpointDir)
  )
。。。。。
/** Get the blocks for the given batch and all input streams. */
def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = {
   receivedBlockTracker.getBlocksOfBatch(batchTime)
  }

b,从下面的代码可以得知,所有批次对应的数据信息都是通过timeToAllocatedBlocks这个map对应的AllocateBlocks中。

private[streaming] class ReceivedBlockTracker(
    conf: SparkConf,
    hadoopConf: Configuration,
    streamIds: Seq[Int],
    clock: Clock,
    recoverFromWriteAheadLog: Boolean,
    checkpointDirOption: Option[String])
  extends Logging {

  private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
  private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
  private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
 。。。。  。。。  
/** Get the blocks allocated to the given batch.
    *按当前批次取得指定 Map 里面是当前批次对应的所有receiverId 和receiverId对应的Seq[ReceivedBlockInfo]
    * */
  def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = 
synchronized {
        timeToAllocatedBlocks.get(batchTime).map{ _.streamIdToAllocatedBlocks }
.getOrElse(Map.empty)
  }

b,即然是从timeToAllocatedBlocks中取的数据,哪是由谁将当前数据放进去的呢?

==》当Reciver将数据store到spark的BlockManager之后,JobGenerate才开始工作.看一下JobScheduler的start方法执行流程,就可以证明这一点。

def start(): Unit = synchronized {
  。。。
  listenerBus.start(ssc.sparkContext)
  //处理ReceiverInputDstream的数据源,如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子类
  receiverTracker = new ReceiverTracker(ssc)
  inputInfoTracker = new InputInfoTracker(ssc)
  receiverTracker.start()
  jobGenerator.start()
  logInfo("Started JobScheduler")
}

c,再跟踪到JobGenerator. generateJobs方法,关键代码就是

ReceiverTracker.allocateBlockToBatch(time),从注释上看可以得知这个方法的作用是:分配接收到的Blocks到当前批次中

===》allocateBlockToBatch调用在先,ReceivedInputDstream的compute调用在后。

/** Generate jobs and perform checkpoint for the given `time`.  */
private def generateJobs(time: Time) {
  // Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
  SparkEnv.set(ssc.env)
  Try {
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
    //调用graph的generateJobs方法,通过scala的Try的apply函数,返回Success(jobs) 或者 Failure(e),
    // 其中的jobs就是该方法返回的Job对象集合,如果Job创建成功,再调用JobScheduler的submitJobSet方法将job提交给集群执行。
    graph.generateJobs(time) // generate jobs using allocated block
  } match {
    case Success(jobs) =>
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      //其中streamIdToInputInfos就是store接收到的数据对应的元数据
      //JobSet代表了一个batch duration中的一批jobs。就是一个普通对象,包含了未提交的jobs,提交的时间,执行开始和结束时间等信息。
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
  //发送执行CheckPoint时间,发送周期为streaming batch接收数据的时间
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

d,看一下ReceiverTracker.allocateBlockToBatch(time)是如何实现的?

==>,分配所有未分配的blocks到给定的batch中

/** Allocate all unallocated blocks to the given batch.* */
def allocateBlocksToBatch(batchTime: Time): Unit = {
  if (receiverInputStreams.nonEmpty) {
    receivedBlockTracker.allocateBlocksToBatch(batchTime)
  }
}

f,还是进入ReceivedBlockTracker中:

该allocateBlocksToBatch方法作用就是:填充timeToAllocatedBlocks是HashMap[Time, AllocatedBlocks],key表示每个time批次,value 是AllocatedBlocks, AllocatedBlocks(streamIdToAllocatedBlocks: Map[Int,Seq[ReceivedBlockInfo]])表示当前批次所有receiverId,  对应的Seq[ReceivedBlockInfo],放在这个map中

private[streaming] class ReceivedBlockTracker(
    conf: SparkConf,
    hadoopConf: Configuration,
    streamIds: Seq[Int],
    clock: Clock,
    recoverFromWriteAheadLog: Boolean,
    checkpointDirOption: Option[String])
  extends Logging {

  private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
  private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
  private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
  private var lastAllocatedBatchTime: Time = null
。。。。
  /**
   * Allocate all unallocated blocks to the given batch.
   * This event will get written to the write ahead log (if enabled).
    *  如果启用WAL,会将该事件将被写入日志。
   */
  def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
    if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
      //将所有Receiver的id(streamId就是receiver的id)及它的ReceivedBlockInfo放在一个Map[streamId, Seq[ReceivedBlockInfo]]中
      val streamIdToBlocks = streamIds.map { streamId =>
          //1,dequeueAll会遍历队列中所有元素:ReceivedBlockInfo即Block信息,传给匿名函数,如果返回true则元素被取出,并将该元素从队列中移除。
          //==》使用将元素从队列中移除这种特性来保证,即便下一次批次的ReceivedBlockInfo存放到这个队列中也没有关系,就当做当前批量进行处理,
           然后从队列中移除
          //2,能从队列中取数据是因为,先由receiver通过store将数据存放到BlockManager中-》executor会使用AddBlock(ReceivedBlockInfo)
         通知Driver的ReceiverTrackerEndPoint==>
          //然后将当前批次中,所有ReceiverBlockInfo放在一个HashMap[Int, ReceivedBlockQueue]的value中,这个map的key就是receiverId
          (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))
      }.toMap
      //将上面的streamIdToBlocks:Map[streamId, Seq[ReceivedBlockInfo]]放到批次对应的Block类中:AllocatedBlocks
      val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
      //BatchAllocationEvent,代表当前ReceivedBlockTracker事件的状态批次分配完成,即数据已存放到BlockManager中,它是给WAL使用的
      //writeToLog不管是否写入到日志中都会返回true的
      if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
        //timeToAllocatedBlocks表示HashMap[Time, AllocatedBlocks]
        timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
        //lastAllocatedBatchTime类型就是Time
        lastAllocatedBatchTime = batchTime
      } else {
        logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
      }
    } else {
      // This situation occurs when:
      // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent,
      // possibly processed batch job or half-processed batch job need to be processed again,
      // so the batchTime will be equal to lastAllocatedBatchTime.
      // 2. Slow checkpointing makes recovered batch time older than WAL recovered
      // lastAllocatedBatchTime.
      // This situation will only occurs in recovery time.
      logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
    }
  }

5,所以再回到ReceiverInputDStream中,

receiverTracker.getBlocksOfBatch(time).getOrElse(id),就是这个receiverId对应的Seq[ReceivedBlockInfo]SocketReceiver存放到saprk数据的元数据信息

abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {
 。。。。
  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {
      if (validTime < graph.startTime) {
      。。。      
} else {
//否则会通过ReceiverTracker取得当前批次所有ReceivedBlockInfo信息
        val receiverTracker = ssc.scheduler.receiverTracker
        //receiverTracker.getBlocksOfBatch(validTime)取得当前批次对应的所有的ReceiverId和每个receiverId对应的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] 
        // InputStream的id和receiverId 有对应关系
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
        // Register the input blocks information into InputInfoTracker
        //将注册的输入的blocks信息放到StreamInputInfo类中,id是ReceiverInputDstream对应的,numRecords存放到BlockManager中所有条数
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

6,接下来就是调用createBlockRDD方法,从spark的BlockManager中取得当前批次内的所有RDD,来创建BlockRDD

//传入当前批次,及对应的 Seq[ReceivedBlockInfo]表示存放到spark中的block信息
private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = {

  if (blockInfos.nonEmpty) {
    //当前案例会取出Array[StreamBlockId(streamId: Int, uniqueId: Long)]
    val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray

    // Are WAL record handles present with all the blocks
    val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty }
    if (areWALRecordHandlesPresent) {
      // If all the blocks have WAL record handle, then create a WALBackedBlockRDD
      val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray
      val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray
      new WriteAheadLogBackedBlockRDD[T](
        ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid)
    } else {
      // Else, create a BlockRDD. However, if there are some blocks with WAL info but not
      // others then that is unexpected and log a warning accordingly.
      if (blockInfos.find(_.walRecordHandleOption.nonEmpty).nonEmpty) {
        if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
          logError("Some blocks do not have Write Ahead Log information; " +
            "this is unexpected and data may not be recoverable after driver failures")
        } else {
          logWarning("Some blocks have Write Ahead Log information; this is unexpected")
        }
      }
      //让BlockManagerMaster去判断是有StreamBlockId()在集群中
      val validBlockIds = blockIds.filter { id =>
        ssc.sparkContext.env.blockManager.master.contains(id)
      }
      //如果当前记录的Array[StreamBlockId(streamId: Int, uniqueId: Long)]和集群中的数据不一致则记录一下
      if (validBlockIds.size != blockIds.size) {
        logWarning("Some blocks could not be recovered as they were not found in memory. " 
        +"To prevent such data loss, enabled Write Ahead Log (see programming guide " +
          "for more details.")
      }
      //按集群中拥有的Array[StreamBlockId(streamId: Int, uniqueId: Long)]来创建BlockRDD
      new BlockRDD[T](ssc.sc, validBlockIds)
    }
  } else {
    // If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD
    // according to the configuration
    if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
      new WriteAheadLogBackedBlockRDD[T](
        ssc.sparkContext, Array.empty, Array.empty, Array.empty)
    } else {
      new BlockRDD[T](ssc.sc, Array.empty)
    }
  }
}

到此,ReceiverInputDStream周期性去取,预先从SocketReceiver中存放到spark的BlockManager中的数据流程结束。。。

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值