RDD生成全生命周期彻底研究和思考(第八篇)

最新推荐文章于 2023-02-12 13:23:30 发布

xiaojun220

最新推荐文章于 2023-02-12 13:23:30 发布

阅读量502

点赞数

分类专栏： spark大数据

本文链接：https://blog.csdn.net/xiaojun220/article/details/51484962

版权

spark大数据专栏收录该内容

10 篇文章 0 订阅

订阅专栏

DStream是RDD的模板，每隔一个batchInterval会根据DStream模板生成一个对应的RDD，DStream实际上本身就是RDD的集合，只不过加上了时间的维度，我们从DStream中的generatedRDDs数据结构中也能看出这一点：

 // RDDs generated, marked as private[streaming] so that testsuites can access it
  @transient
  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()

下面我们从一个简单的SparkStraming例子中来看RDD的

val lines = new SocketInputDStream("localhost", 9999)   
// lines 类型是 SocketInputDStream
val words = new FlatMappedDStream(lines, _.split(" ")) 
 // SocketInputDStream --> FlatMappedDStream
val pairs = new MappedDStream(words, word => (word, 1)) 
// FlatMappedDStream --> MappedDStream
val wordCounts = new ShuffledDStream(pairs, _ + _)      
// MappedDStream --> ShuffledDStream
wordCounts.print();

有这个简单的例子我们看到了DStream相互转化的过程，此处我们从wordCounts.print()这句方法背后的源码，来解密RDD的生成过程
由于我们在前面的对此处的代码已经解析过，现在我们只关注这其中的部分重点

  private def foreachRDD(
      foreachFunc: (RDD[T], Time) => Unit,
      displayInnerRDDOps: Boolean): Unit = {
    new ForEachDStream(this,
      context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
  }

我们简单的说明一下这其中的过程

wordCounts.print();
    -->  print(10)
            -->foreachRDD

上面是print()方法到foreachRDD的调用过程，下面我们关注register()方法

/**
   * Register this streaming as an output stream. This would ensure that RDDs of this
   * DStream will be generated.
   */
  private[streaming] def register(): DStream[T] = {
    ssc.graph.addOutputStream(this)
    this
  }
}

当然这块的逻辑，其实在前面已经阐述，我们在这里依然简单的说明一下，作为一个输出的DStream(ForEachDStream)，其generateJob在被调用时不光可能会产生Job也会产生RDD，在generateJob方法中首先调用的就是父类的getOrCompute

/**
   * Get the RDD corresponding to the given time; either retrieve it from cache
   * or compute-and-cache it.
   */
  private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
    // If RDD was already generated, then retrieve it from HashMap,
    // or else compute the RDD
    generatedRDDs.get(time).orElse {
      // Compute the RDD if time is valid (e.g. correct time in a sliding window)
      // of RDD generation, else generate nothing.
      if (isTimeValid(time)) {

        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details. We need to have this call here because
          // compute() might cause Spark jobs to be launched.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            compute(time)
          }
        }

        rddOption.foreach { case newRDD =>
          // Register the generated RDD for caching and checkpointing
          if (storageLevel != StorageLevel.NONE) {
            newRDD.persist(storageLevel)
            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
          }
          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
            newRDD.checkpoint()
            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
          }
          generatedRDDs.put(time, newRDD)
        }
        rddOption
      } else {
        None
      }
    }
  }

在此处我们看到了RDD的产生，当然这是SparkStreaming程序中最后一个DStream，因为DStream.print(action操作)，我们知道RDD执行action操作之后会不会转换出新的RDD呢？我们知道是不会的，同样DStream在执行action之后同样也不会再产生DStream，因为DStram本身就是RDD的模板，DStream记录的RDD其实也是每个batchDuration时间最后一个RDD，我们可以通过DStream之间的dependencies进行回溯，从而形成一个计算链条，当然也涉及到DStream之间的转换，DStream之间的转换也必定调用其compute方法，compute方法之中又做了什么呢？
以FlatMappedDStream的compute方法为例：

override def compute(validTime: Time): Option[RDD[U]] = {
    parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc))
  }

我们依然可以看到起首先调用的是其父类DStream的getOrCompute方法，在getOrCompute会产生新的RDD，这种RDD的生成主要来自于DStream之间的转换而来的，这种DStream之间时存在依赖关系的

下面我们来看SocketInputDStream，SocketInputDStream作为数据的源头，其本身是没有依赖关系的

override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream
        // for this batch
        val receiverTracker = ssc.scheduler.receiverTracker
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)

        // Register the input blocks information into InputInfoTracker
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

作为SparkStreaming程序中第一个DStream，其本身是没有依赖关系的，其compute方法并不是调用父类的方法而是执行自己的方法逻辑，根据receiverTracker收集的元数据信息及blocks信息生成具体的RDD