spark2.2.0源码阅读---spark core包 --- partial/rdd

最新推荐文章于 2024-09-06 16:35:30 发布

danlial

最新推荐文章于 2024-09-06 16:35:30 发布

阅读量285

点赞数

分类专栏： spark源码文章标签： spark spark源码

本文链接：https://blog.csdn.net/dianlial/article/details/80436931

版权

spark源码专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1、本文目标以及其它说明：

本文主要是介绍partial、rdd包下面的类

2、partial包下面的数据结构说明

private[spark] trait ApproximateEvaluator[U, R] {
  def merge(outputId: Int, taskResult: U): Unit
  def currentResult(): R
}

这接口两个方法主要是用来逐渐地合并不同task跑后的结果。每一个task任务结束都调用一次merge方法。

private[spark] class ApproximateActionListener[T, U, R](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    evaluator: ApproximateEvaluator[U, R],
    timeout: Long)
  extends JobListener {

在规定的时间内返回结果，这个结果可能是所有分区的结果，也有可能只是一部分规定时间执行完分区的结果（所以包名叫 partial）

class BoundedDouble(val mean: Double, val confidence: Double, val low: Double, val 
high: Double) {

这一个类封装了mean/confidence/low/high,调用equals方法的时候，这几个都要相等才会返回true

private[spark] class CountEvaluator(totalOutputs: Int, confidence: Double)
  extends ApproximateEvaluator[Long, BoundedDouble] {

就相当于返回元素个数。

private[spark] class GroupedCountEvaluator[T : ClassTag](totalOutputs: Int, confidence:

Double)  extends ApproximateEvaluator[OpenHashMap[T, Long], Map[T, BoundedDouble]] {

通过不同的key进行累加求和。

private[spark] class MeanEvaluator(totalOutputs: Int, confidence: Double)
  extends ApproximateEvaluator[StatCounter, BoundedDouble] {

返回结果的均值

private[spark] class SumEvaluator(totalOutputs: Int, confidence: Double)
  extends ApproximateEvaluator[StatCounter, BoundedDouble] {

返回元素求和的总和

3、rdd包下面的数据结构说明

由于本包下面rdd具体实现太多，而且模式统一，故，这里只研究一两个即可。

private[spark] class BlockRDDPartition(val blockId: BlockId, idx: Int) extends Partition {
  val index = idx
}

BlockRDD的分区类型。blockId,表示的是父块，当前分区引用的分区块。idx，表示的是当前分区在当前blockrdd里面的索引

编号。

private[spark]
class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: Array[BlockId])
  extends RDD[T](sc, Nil) {

 @transient val blockIds: Array[BlockId] 表示的是数据块源，可以理解为父RDD的数据。每一个rdd

具体实现都会复写三个方法：getPartitions  /  compute / getPreferredLocation

override def getPartitions: Array[Partition] = {
  assertValid()
  (0 until blockIds.length).map { i =>
    new BlockRDDPartition(blockIds(i), i).asInstanceOf[Partition]
  }.toArray
}   为当前rdd生成分区，所有的分区。

override def compute(split: Partition, context: TaskContext): Iterator[T] = {
  assertValid()
  val blockManager = SparkEnv.get.blockManager
  val blockId = split.asInstanceOf[BlockRDDPartition].blockId
  blockManager.get[T](blockId) match {
    case Some(block) => block.data.asInstanceOf[Iterator[T]]
    case None =>
      throw new Exception(s"Could not compute split, block $blockId of RDD $id not found")
  }
} split是父分区，context是代表父分区任务执行的上下文

override def getPreferredLocations(split: Partition): Seq[String] = {
  assertValid()
  _locations(split.asInstanceOf[BlockRDDPartition].blockId)
} 获取的是引用地址，父分区数据存放的物理地址。

private[spark] case class NarrowCoGroupSplitDep(
    @transient rdd: RDD[_],
    @transient splitIndex: Int,
    var split: Partition
  ) extends Serializable {

  @throws(classOf[IOException])
  private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException {
    // Update the reference to parent split at the time of task serialization
    split = rdd.partitions(splitIndex)
    oos.defaultWriteObject()
  }
}  就包含一个分区，这个分区是父分区。

private[spark] class CoGroupPartition(
    override val index: Int, val narrowDeps: Array[Option[NarrowCoGroupSplitDep]])
  extends Partition with Serializable {
  override def hashCode(): Int = index
  override def equals(other: Any): Boolean = super.equals(other)
} narrowDeps是代表的全部分区。代表的是CoGroupedRDD的分区

class CoGroupedRDD[K: ClassTag](
    @transient var rdds: Seq[RDD[_ <: Product2[K, _]]],
    part: Partitioner)
  extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) {

本RDD就是将 cogroup 算子产生的RDD。

override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](
        rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
    }
  }
}

获取的是本RDD所依赖的血统关系

override def getPartitions: Array[Partition] = {
  val array = new Array[Partition](part.numPartitions)
  for (i <- 0 until array.length) {
    // Each CoGroupPartition will have a dependency per contributing RDD
    array(i) = new CoGroupPartition(i, rdds.zipWithIndex.map { case (rdd, j) =>
      // Assume each RDD contributed a single dependency, and get it
      dependencies(j) match {
        case s: ShuffleDependency[_, _, _] =>
          None
        case _ =>
          Some(new NarrowCoGroupSplitDep(rdd, i, rdd.partitions(i)))
      }
    }.toArray)
  }
  array
}

为当前的CoGroupRDD生成自己的partitions

override def compute(s: Partition, context: TaskContext):

Iterator[(K, Array[Iterable[_]])] = {

计算s分区的数据。

class DoubleRDDFunctions(self: RDD[Double]) extends Logging with Serializable {

本类里面的一些功能有sum/mean/方差、标准差等等，是通过隐式转换被RDD所使用

class OrderedRDDFunctions[K : Ordering : ClassTag,
                          V: ClassTag,
                          P <: Product2[K, V] : ClassTag] @DeveloperApi() (
    self: RDD[P])
  extends Logging with Serializable {

RDD的排序类，也是通过隐式转换增强的RDD的功能，适合key value对这种形式的数据

class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {

专门用来处理key value 这种数据情况的，里面提供了算子，通过隐式转换被rdd所使用

private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val

rdd: RDD[T])  extends RDDCheckpointData[T](rdd) with Logging {

将数据写入checkpoint的地方，外部存储系统，比如hdfs

danlial

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark2.2.0源码阅读---spark core包 --- partial/rdd

1、本文目标以及其它说明：本文主要是介绍partial、rdd包下面的类2、partial包下面的数据结构说明private[spark] trait ApproximateEvaluator[U, R] { def merge(outputId: Int, taskResult: U): Unit def currentResult(): R}这接口两个方法主要是用来逐渐地合并...
复制链接

扫一扫