SparkStreaming案例：NetworkWordCount--ReceiverInputDstream的compute方法如何取得Socket预先存放在BlockManager中的数据

最新推荐文章于 2019-09-08 20:26:00 发布

水中舟_luyl

最新推荐文章于 2019-09-08 20:26:00 发布

阅读量453

点赞数

分类专栏： spark Streaming

本文链接：https://blog.csdn.net/luyllyl/article/details/79377995

版权

spark 同时被 2 个专栏收录

38 篇文章 1 订阅

订阅专栏

Streaming

9 篇文章 0 订阅

订阅专栏

1，还是从这个案例开始

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()
    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[5]")
    val ssc = new StreamingContext(sparkConf, Seconds(40))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream("192.168.4.41", 9999, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2，在“SparkStream例子HdfsWordCount--从Dstream到RDD全过程解析”这一文中详细说明了DstreamGraph回溯生成RDD的。这边再简单回顾一下：

a,Dstream.print()==>对应的ForEachDStream的generateJob(time:Time)方法会被DstreamGraph.generateJobs(time)调用

b, ForEachDStream的generateJob(time:Time){ parent.getOrCompute(time)….},通过parent对应的Dstream一直找到FileInputDStream的compute方法，来生成RDD

===》此处NetworkWordCount对应的是ReceiverInputDstream，通过DstreamGraph回溯生成Rdd的过程是一样的。不过ReceiverInputDstream是取预先由SocketReceiver存放在spark的BlockManager中的数据来生成RDD的.

===》( 在这一文“ReceiverSupervisorImpl中的startReceiver(),Receiver如何将数据store到RDD的”中分析过，Receiver如何将数据存放到RDD中)

3，咱们直接查看一下SocketInputDstream的父类ReceiverInputDStream的compute方法.

abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {
 。。。。
  /**
   * Generates RDDs with blocks received by the receiver of this stream. */
  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.

// 发生返回空的Rdd，可能是因为driver失败后重启并且没有做 WAL
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream for this batch

//否则会通过ReceiverTracker取得当前批次所有ReceivedBlockInfo信息
        val receiverTracker = ssc.scheduler.receiverTracker
        //receiverTracker.getBlocksOfBatch(validTime)取得当前批次对应的所有的ReceiverId和每个receiverId对应的Seq[ReceivedBlockInfo]，

返回Map[receiverId,Seq[ReceivedBlockInfo]] 
        // InputStream的id和receiverId 有对应关系
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
        // Register the input blocks information into InputInfoTracker
        //将注册的输入的blocks信息放到StreamInputInfo类中,id是ReceiverInputDstream对应的,numRecords存放到BlockManager中所有条数
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

4，分析一下这一段代码还是挺值得学习的：目标就是为了得到当前批次中所有receiverId对应的Seq[ReceivedBlockInfo]信息。

==》这个ReceivedBlockInfo类中，存放streamId，store()进来的总条数，及BlockId等原数据信息

val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)

a,在ReceiverTracker中getBlockOfBatch方法是要得到所有输入流的数据

private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
。。。。。
  private val receivedBlockTracker = new ReceivedBlockTracker(
    ssc.sparkContext.conf,
    ssc.sparkContext.hadoopConfiguration,
    receiverInputStreamIds,
    ssc.scheduler.clock,
    ssc.isCheckpointPresent,
    Option(ssc.checkpointDir)
  )

。。。。。

/** Get the blocks for the given batch and all input streams. */

def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = {
   receivedBlockTracker.getBlocksOfBatch(batchTime)
  }

b,从下面的代码可以得知，所有批次对应的数据信息都是通过timeToAllocatedBlocks这个map对应的AllocateBlocks中。

private[streaming] class ReceivedBlockTracker(
    conf: SparkConf,
    hadoopConf: Configuration,
    streamIds: Seq[Int],
    clock: Clock,
    recoverFromWriteAheadLog: Boolean,
    checkpointDirOption: Option[String])
  extends Logging {

  private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
  private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
  private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
 。。。。  。。。

/** Get the blocks allocated to the given batch.
    *按当前批次取得指定 Map 里面是当前批次对应的所有receiverId 和receiverId对应的Seq[ReceivedBlockInfo]
    * */
  def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] =

synchronized {
        timeToAllocatedBlocks.get(batchTime).map{ _.streamIdToAllocatedBlocks }

.getOrElse(Map.empty)
  }

b,即然是从timeToAllocatedBlocks中取的数据，哪是由谁将当前数据放进去的呢？

==》当Reciver将数据store到spark的BlockManager之后，JobGenerate才开始工作.看一下JobScheduler的start方法执行流程，就可以证明这一点。

def start(): Unit = synchronized {
  。。。
  listenerBus.start(ssc.sparkContext)
  //处理ReceiverInputDstream的数据源，如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子类
  receiverTracker = new ReceiverTracker(ssc)
  inputInfoTracker = new InputInfoTracker(ssc)
  receiverTracker.start()
  jobGenerator.start()
  logInfo("Started JobScheduler")
}

c,再跟踪到JobGenerator. generateJobs方法，关键代码就是

ReceiverTracker.allocateBlockToBatch(time),从注释上看可以得知这个方法的作用是：分配接收到的Blocks到当前批次中

===》allocateBlockToBatch调用在先，ReceivedInputDstream的compute调用在后。

/** Generate jobs and perform checkpoint for the given `time`.  */
private def generateJobs(time: Time) {
  // Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
  SparkEnv.set(ssc.env)
  Try {
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
    //调用graph的generateJobs方法,通过scala的Try的apply函数，返回Success(jobs) 或者 Failure(e),
    // 其中的jobs就是该方法返回的Job对象集合,如果Job创建成功,再调用JobScheduler的submitJobSet方法将job提交给集群执行。
    graph.generateJobs(time) // generate jobs using allocated block
  } match {
    case Success(jobs) =>
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      //其中streamIdToInputInfos就是store接收到的数据对应的元数据
      //JobSet代表了一个batch duration中的一批jobs。就是一个普通对象，包含了未提交的jobs，提交的时间，执行开始和结束时间等信息。
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
  //发送执行CheckPoint时间，发送周期为streaming batch接收数据的时间
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

d,看一下ReceiverTracker.allocateBlockToBatch(time)是如何实现的？

==>,分配所有未分配的blocks到给定的batch中

/** Allocate all unallocated blocks to the given batch.* */
def allocateBlocksToBatch(batchTime: Time): Unit = {
  if (receiverInputStreams.nonEmpty) {
    receivedBlockTracker.allocateBlocksToBatch(batchTime)
  }
}

f,还是进入ReceivedBlockTracker中：

该allocateBlocksToBatch方法作用就是：填充timeToAllocatedBlocks是HashMap[Time, AllocatedBlocks]，key表示每个time批次，value 是AllocatedBlocks， AllocatedBlocks(streamIdToAllocatedBlocks: Map[Int,Seq[ReceivedBlockInfo]])表示当前批次所有receiverId，对应的Seq[ReceivedBlockInfo]，放在这个map中

private[streaming] class ReceivedBlockTracker(
    conf: SparkConf,
    hadoopConf: Configuration,
    streamIds: Seq[Int],
    clock: Clock,
    recoverFromWriteAheadLog: Boolean,
    checkpointDirOption: Option[String])
  extends Logging {

  private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
  private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
  private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
  private var lastAllocatedBatchTime: Time = null
。。。。

  /**
   * Allocate all unallocated blocks to the given batch.
   * This event will get written to the write ahead log (if enabled).
    *  如果启用WAL,会将该事件将被写入日志。
   */
  def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
    if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
      //将所有Receiver的id(streamId就是receiver的id)及它的ReceivedBlockInfo放在一个Map[streamId, Seq[ReceivedBlockInfo]]中
      val streamIdToBlocks = streamIds.map { streamId =>
          //1,dequeueAll会遍历队列中所有元素：ReceivedBlockInfo即Block信息，传给匿名函数，如果返回true则元素被取出,并将该元素从队列中移除。
          //==》使用将元素从队列中移除这种特性来保证，即便下一次批次的ReceivedBlockInfo存放到这个队列中也没有关系，就当做当前批量进行处理，

           然后从队列中移除
          //2,能从队列中取数据是因为，先由receiver通过store将数据存放到BlockManager中-》executor会使用AddBlock(ReceivedBlockInfo)

         通知Driver的ReceiverTrackerEndPoint==>
          //然后将当前批次中，所有ReceiverBlockInfo放在一个HashMap[Int, ReceivedBlockQueue]的value中，这个map的key就是receiverId
          (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))
      }.toMap
      //将上面的streamIdToBlocks：Map[streamId, Seq[ReceivedBlockInfo]]放到批次对应的Block类中：AllocatedBlocks
      val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
      //BatchAllocationEvent，代表当前ReceivedBlockTracker事件的状态批次分配完成，即数据已存放到BlockManager中，它是给WAL使用的
      //writeToLog不管是否写入到日志中都会返回true的
      if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
        //timeToAllocatedBlocks表示HashMap[Time, AllocatedBlocks]
        timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
        //lastAllocatedBatchTime类型就是Time
        lastAllocatedBatchTime = batchTime
      } else {
        logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
      }
    } else {
      // This situation occurs when:
      // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent,
      // possibly processed batch job or half-processed batch job need to be processed again,
      // so the batchTime will be equal to lastAllocatedBatchTime.
      // 2. Slow checkpointing makes recovered batch time older than WAL recovered
      // lastAllocatedBatchTime.
      // This situation will only occurs in recovery time.
      logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
    }
  }

5,所以再回到ReceiverInputDStream中，

receiverTracker.getBlocksOfBatch(time).getOrElse(id),就是这个receiverId对应的Seq[ReceivedBlockInfo]SocketReceiver存放到saprk数据的元数据信息

abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {
 。。。。
  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {
      if (validTime < graph.startTime) {
      。。。

} else {

//否则会通过ReceiverTracker取得当前批次所有ReceivedBlockInfo信息
        val receiverTracker = ssc.scheduler.receiverTracker
        //receiverTracker.getBlocksOfBatch(validTime)取得当前批次对应的所有的ReceiverId和每个receiverId对应的Seq[ReceivedBlockInfo]，

返回Map[receiverId,Seq[ReceivedBlockInfo]] 
        // InputStream的id和receiverId 有对应关系
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
        // Register the input blocks information into InputInfoTracker
        //将注册的输入的blocks信息放到StreamInputInfo类中,id是ReceiverInputDstream对应的,numRecords存放到BlockManager中所有条数
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

6,接下来就是调用createBlockRDD方法，从spark的BlockManager中取得当前批次内的所有RDD，来创建BlockRDD

//传入当前批次，及对应的 Seq[ReceivedBlockInfo]表示存放到spark中的block信息
private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = {

  if (blockInfos.nonEmpty) {
    //当前案例会取出Array[StreamBlockId(streamId: Int, uniqueId: Long)]
    val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray

    // Are WAL record handles present with all the blocks
    val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty }
    if (areWALRecordHandlesPresent) {
      // If all the blocks have WAL record handle, then create a WALBackedBlockRDD
      val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray
      val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray
      new WriteAheadLogBackedBlockRDD[T](
        ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid)
    } else {
      // Else, create a BlockRDD. However, if there are some blocks with WAL info but not
      // others then that is unexpected and log a warning accordingly.
      if (blockInfos.find(_.walRecordHandleOption.nonEmpty).nonEmpty) {
        if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
          logError("Some blocks do not have Write Ahead Log information; " +
            "this is unexpected and data may not be recoverable after driver failures")
        } else {
          logWarning("Some blocks have Write Ahead Log information; this is unexpected")
        }
      }
      //让BlockManagerMaster去判断是有StreamBlockId()在集群中
      val validBlockIds = blockIds.filter { id =>
        ssc.sparkContext.env.blockManager.master.contains(id)
      }
      //如果当前记录的Array[StreamBlockId(streamId: Int, uniqueId: Long)]和集群中的数据不一致则记录一下
      if (validBlockIds.size != blockIds.size) {
        logWarning("Some blocks could not be recovered as they were not found in memory. "

        +"To prevent such data loss, enabled Write Ahead Log (see programming guide " +
          "for more details.")
      }
      //按集群中拥有的Array[StreamBlockId(streamId: Int, uniqueId: Long)]来创建BlockRDD
      new BlockRDD[T](ssc.sc, validBlockIds)
    }
  } else {
    // If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD
    // according to the configuration
    if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
      new WriteAheadLogBackedBlockRDD[T](
        ssc.sparkContext, Array.empty, Array.empty, Array.empty)
    } else {
      new BlockRDD[T](ssc.sc, Array.empty)
    }
  }
}

到此,ReceiverInputDStream周期性去取，预先从SocketReceiver中存放到spark的BlockManager中的数据流程结束。。。

水中舟_luyl

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SparkStreaming案例：NetworkWordCount--ReceiverInputDstream的compute方法如何取得Socket预先存放在BlockManager中的数据

1，还是从这个案例开始object NetworkWordCount { def main(args: Array[String]) { if (args.length &lt; 2) { System.err.println("Usage: NetworkWordCount &lt;hostname&gt; &lt;port&gt;") System.exi...
复制链接

扫一扫

专栏目录