RDD生成全生命周期彻底研究和思考(第八篇)

DStream是RDD的模板,每隔一个batchInterval会根据DStream模板生成一个对应的RDD,DStream实际上本身就是RDD的集合,只不过加上了时间的维度,我们从DStream中的generatedRDDs数据结构中也能看出这一点:

 // RDDs generated, marked as private[streaming] so that testsuites can access it
  @transient
  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()

下面我们从一个简单的SparkStraming例子中来看RDD的

val lines = new SocketInputDStream("localhost", 9999)   
// lines 类型是 SocketInputDStream
val words = new FlatMappedDStream(lines, _.split(" ")) 
 // SocketInputDStream --> FlatMappedDStream
val pairs = new MappedDStream(words, word => (word, 1)) 
// FlatMappedDStream --> MappedDStream
val wordCounts = new ShuffledDStream(pairs, _ + _)      
// MappedDStream --> ShuffledDStream
wordCounts.print();     

有这个简单的例子我们看到了DStream相互转化的过程,此处我们从wordCounts.print()这句方法背后的源码,来解密RDD的生成过程
由于我们在前面的对此处的代码已经解析过,现在我们只关注这其中的部分重点

  private def foreachRDD(
      foreachFunc: (RDD[T], Time) => Unit,
      displayInnerRDDOps: Boolean): Unit = {
    new ForEachDStream(this,
      context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
  }

我们简单的说明一下这其中的过程

wordCounts.print();
    -->  print(10)
            -->foreachRDD

上面是print()方法到foreachRDD的调用过程,下面我们关注register()方法

/**
   * Register this streaming as an output stream. This would ensure that RDDs of this
   * DStream will be generated.
   */
  private[streaming] def register(): DStream[T] = {
    ssc.graph.addOutputStream(this)
    this
  }
}

当然这块的逻辑,其实在前面已经阐述,我们在这里依然简单的说明一下,作为一个输出的DStream(ForEachDStream),其generateJob在被调用时不光可能会产生Job也会产生RDD,在generateJob方法中首先调用的就是父类的getOrCompute

/**
   * Get the RDD corresponding to the given time; either retrieve it from cache
   * or compute-and-cache it.
   */
  private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
    // If RDD was already generated, then retrieve it from HashMap,
    // or else compute the RDD
    generatedRDDs.get(time).orElse {
      // Compute the RDD if time is valid (e.g. correct time in a sliding window)
      // of RDD generation, else generate nothing.
      if (isTimeValid(time)) {

        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details. We need to have this call here because
          // compute() might cause Spark jobs to be launched.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            compute(time)
          }
        }

        rddOption.foreach { case newRDD =>
          // Register the generated RDD for caching and checkpointing
          if (storageLevel != StorageLevel.NONE) {
            newRDD.persist(storageLevel)
            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
          }
          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
            newRDD.checkpoint()
            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
          }
          generatedRDDs.put(time, newRDD)
        }
        rddOption
      } else {
        None
      }
    }
  }

在此处我们看到了RDD的产生,当然这是SparkStreaming程序中最后一个DStream,因为DStream.print(action操作),我们知道RDD执行action操作之后会不会转换出新的RDD呢?我们知道是不会的,同样DStream在执行action之后同样也不会再产生DStream,因为DStram本身就是RDD的模板,DStream记录的RDD其实也是每个batchDuration时间最后一个RDD,我们可以通过DStream之间的dependencies进行回溯,从而形成一个计算链条,当然也涉及到DStream之间的转换,DStream之间的转换也必定调用其compute方法,compute方法之中又做了什么呢?
以FlatMappedDStream的compute方法为例:

override def compute(validTime: Time): Option[RDD[U]] = {
    parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc))
  }

我们依然可以看到起首先调用的是其父类DStream的getOrCompute方法,在getOrCompute会产生新的RDD,这种RDD的生成主要来自于DStream之间的转换而来的,这种DStream之间时存在依赖关系的

下面我们来看SocketInputDStream,SocketInputDStream作为数据的源头,其本身是没有依赖关系的

override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream
        // for this batch
        val receiverTracker = ssc.scheduler.receiverTracker
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)

        // Register the input blocks information into InputInfoTracker
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

作为SparkStreaming程序中第一个DStream,其本身是没有依赖关系的,其compute方法并不是调用父类的方法而是执行自己的方法逻辑,根据receiverTracker收集的元数据信息及blocks信息生成具体的RDD

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值