Spark RDD count, sample, colease, distinct, order by 等算子实现原理解析

最新推荐文章于 2022-11-01 18:21:44 发布

编程小王子啊

最新推荐文章于 2022-11-01 18:21:44 发布

阅读量1.9k

点赞数 3

分类专栏： CSDN大数据专栏文章标签： spark scala big data 大数据

本文链接：https://blog.csdn.net/u012361112/article/details/120454521

版权

CSDN大数据专栏专栏收录该内容

8 篇文章 1 订阅

订阅专栏

一、RDD 算子概览

二、RDD 算子实现原理

1. map, filter, flatmap, mapPartions 算子原理

2. combineByKey，reduceByKey，groupBykey 原理

3. coalesce, repartition 原理

4. count 算子原理

5. sortByKey 算子原理

前言

我们在编写 spark 代码来处理数据时，大多数的工作都是调用 spark api 对数据做转换，然后收集最终结果。这些 api 函数便被称之为算子(operation)。

一、RDD 算子概览

Spark rdd 算子可以分为以下 3 类：

非 shuffle 类 transform 算子，以 map，filter，flatmap 算子为代表，这类算子的特点是不会触发 rdd 计算过程，只是将一个 rdd 转换成另一个 rdd，前后两个 rdd 之间是窄依赖关系（Narrow Dependency）。
shuffle 类 transform 算子，以 groupByKey，reduceByKey，repartition 算子为代表，这类算子特点也是将一个 rdd 转换成另一个 rdd，但是会触发rdd的 shuffle 过程，前后两个 rdd 之间是宽依赖关系（Shuffle Dedenpdency）。
action 算子，以 count，take，collect，saveAsTextFile 算子为代表，这类算子的特点是当用这些函数时实际上都是去调用了 SparkContext.runJob() 方法，该方法会触发 rdd 的真正计算过程。而这也正是 spark 惰性求值思想的直接体现——有计算需求才触发任务计算。

二、RDD 算子实现原理

1. map, filter, flatmap, mapPartions 算子原理

这几个方法都是在 RDD.scala 类中定义，是4个简单的 transform 方法，没有 shuffle过程，结果都是将当前 rdd 转换为 MapPartionsRDD，两个 RDD 之间是窄依赖关系，可以用下面的图来表示：

以 map 函数为例，它的实现代码如下：

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

从代码中可以看出，map 方法需要传入一个参数 f, f 是一个函数，它表示该函数接受参数类型 T 转换为另一个类型 U，但是 map 方法并不是立即调用该函数，而是创建新的对象 MapPartionsRDD ，同时生成了新的的函数 (context, pid, iter) => iter.map(cleanF) 作为 MapPartionsRDD 类的构造方法参数被传入。在 RDD 迭代计算过程中，MapPartionsRDD 的 compute 方法会使用该函数来遍历 Iterator[T] 得到新的 iterator[U]：

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {
    override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))
}

MapPartionsRDD 重写了父类 RDD 的 compute 方法，其调用时机是在任务计算过程中，spark 提交任务时，会将这些 RDD 包装为 ShuffleMapTask，ShuffleMapTask 的 runTask 方法会调用 RDD.iterator() 方法，而该方法又会去调用 RDD.compute() 方法。

2. combineByKey，reduceByKey，groupBykey 原理

首先，这些方法是定义在 PairRDDFunctions 类中定义，RDD 类中并没有这些方法。其次，只有 [K, V] 类型的键值对 RDD 才可以调用这些方法，原理是什么呢？答案是隐式转换，Spark 可以将 RDD 隐式转换为 PairRDDFunctions 类，然后再调用该类的这些方法。

先说 combineByKey 方法，它的方法定义如下：

def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)
}

从功能上来说，该方法是将一个 [k, v] 类型的 rdd 转换为一个新的 rdd[k, c]，这个过程存在 shuffle，会将 k 相同的记录写入到同一个 reduce 分区中，将 key 相同的记录聚合。这个 c 便是相同 k 所对应的每一个 v 进行聚合后的结果。这个方法的前3个参数均为函数，不在这里做解释。partitioner 参数是分区器，指定 key 的分区规则，mapSideCombine 表示是否开启 map 端聚合。如果 mapSideCombine 为 true，即开启 map 端聚合，则会在 map 端使用 createCombiner 和 mergeValue 函数执行部分聚合，在 reduce 端使用 mergeCombiners 函数进行最终聚合；否则会在 reduce 端使用这 3 个函数聚合。

来看下它所调用的 combineByKeyWithClassTag 方法的核心代码：

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

该方法先创建一个 Aggerator 类实例。然后判断当前 rdd 的分区器(partioner) 是否和方法传入的 partioner 相等(相等的条件是分区类型相同且分区数相等)，要满足分区器相等的前置条件是该 rdd 已经被混洗过，否则 rdd 的 partitioner 对象为 NULL。如果相等，则调用 rdd 的 mapPartitions 方法生成 MapPartitionsRDD（即不会再产生 shuffle 过程，因为数据已经被清洗过），否则生成 ShuffledRDD。

那 shuffle 过程是什么时候发生的呢？答案也是在实际的任务计算过程中，DAGScheduler 类在做任务 Stage 切分时，判断当前 ShuffledRDD 和其依赖的 prevRDD 是宽依赖关系（ShuffledRDD 类重写了 getDependencies 方法），会在此处切分为两个 Stage，如下图所示：

Stage0 的 ShuffleMapTask 运行时，会将 rdd.iterator() 方法获得的结果调用 ShuffleWriter.write() 方法写入到本地磁盘文件中，此过程为 Shuffle 写。Stage1 的任务运行时，会调用 ShuffledRDD 的 compute () 方法，它会按照分区规则拉取 Stage0 阶段的所有 map 任务输出中，属于自己的应该计算的数据，此过程成为 Shuffle 读：

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
      .read()
      .asInstanceOf[Iterator[(K, C)]]
  }

reduceByKey 和 groupByKey 最终都是调用 combineByKeyWithClassTag 方法。区别在于前者前者传入的参数 mapSideCombine 为 true，并且会要求传入一个 func: (V, V) => V，即怎么对 key 相同的 v 做聚合操作。而后者传入的参数 mapSideCombine 为 false，即不会在 map 端聚合，并且 groupByKey 方法使用 ComPactBuffer 这种数据结构来保存 key 相同所对应的 value值，它是一种类似于 ArrayBuffer 的只能追加写入的数据结构，底层使用数组来存放数据。

3. coalesce, repartition 原理

coalesce 函数用来减少或者增加 RDD 分区数，从而修改任务的并行度，它的实现源码如下：

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

该函数有3个参数，第1个参数是新的分区数量，第二个参数是是否 shuffle，第3个参数为可选的指定分区器。

shuffle参数为 false 的情况下，没有 shuffle 过程产生，此时 numPartiions 参数的值必须小于当前 rdd 的 partition 数，即只能减小分区，不能增加分区，那么在没有 shuffle 的情况下，CoalescedRDD 是怎么做到减少分区合并计算的呢？建议大家看看 CoalescedRDD 类的代码，重点看 getPartitions 方法和 compute 方法。在 spark 任务运行时，如果 map 小任务数量过多，可以使用该算子来减少 map 任务分区数量，并且不产生 shuffle 过程。但是需要注意的是，如果 rdd 本身数据是分区不均匀的，按照 CoalescedRDD 默认的合并分区策略（DefaultPartitionCoalescer），会加重数据倾斜的情况。举例来说：某个 rdd 有4个分区，数据集中在第1个分区和第二个分区上，现在调用该算子调整为两个分区。旧分区和新分区映射关系为 [0, 1] => [0]，[2,3] => [1]，新的RDD 的第一个分区的计算数据量要加重。所以如果在数据本身分布不均的情况下，应该自己实现一个 PartitionCoalescer 类并作为参数传入，或者设置参数 shuffle = true。

shuffle 参数为 true 的情况下，会产生 shuffle 过程，会经历3次 rdd 变换：当前rdd -> MapPartitionsRDD（代码中 mapPartitionsWithIndexInternal 方法生成） -> ShuffledRDD -> CoalescedRDD。为什么会有个 map 转换过程呢？注意看 distributePartition 函数，这是为了解决 shuffle 过程中出现数据不均匀的问题。

repartition 函数实际上也是调用的 coalesce 函数：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

4. count 算子原理

count 算子属于 action 算子，它会触发任务的提交过程。count 方法的实现源码如下：

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

可以看出，该方法会去调用 SparkContext.runJob() 方法，该方法有几个重载方法：

class SparkContext(config: SparkConf) extends Logging {
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

}

先看最后一个重载方法，该方法一共要传入 4 个参数：

第一个参数 rdd 表示要提交的 rdd，这个比较好理解。
第二个参数是一个函数： func: (TaskContext, Iterator[T]) => U，该函数表示的意思是在任务的最终输出的每个分区内应用该函数，将分区内的输出数据集 iterator[T] 应用该函数得到结果集 U。count 方法传入的 func 是 Utils.getIteratorSize _，表示只需要返回 iterator 的 size 即可，func 函数的调用时机是在 ResultTask.runTask 方法中。
第三个参数分区索引序列，上面第一个重载方法传入的是 0 until rdd.partitions.length，表示默认情况所有分区都要参与计算。
第四个参数也是一个函数： resultHandler: (Int, U) => Unit。这个函数解释有点麻烦，先讲下调用时机。每当一个 ResultTask 运行完成时，TaskSchedulerImpl 类的 statusUpdate 方法被调用，该方法判断如果 task 是运行成功而不是失败状态，则会调用 TaskResultGetter 类的 enqueueSuccessfulTask 方法去拉取去拉取每个分区的 ResulTask 的任务输出（上面提到的应用了 func 函数的结果集 U），它使用线程池技术去异步拉取数据到本地，拉取完成后，会回调 DAGScheduler 类的方法，经过层层事件传递，最终会调用 DAGScheduler 类的 handleTaskCompletion 方法，在该方法中会去调用 JobWaiter 类的 taskSucceed 方法：

 override def taskSucceeded(index: Int, result: Any): Unit = {
    // resultHandler call must be synchronized in case resultHandler itself is not thread safe.
    synchronized {
      resultHandler(index, result.asInstanceOf[T])
    }
    if (finishedTasks.incrementAndGet() == totalTasks) {
      jobPromise.success(())
    }
  }

从这里的方法调用可以看出 resultHandler 函数传入的第一个参数是任务 index，第二个参数是任务的输出结果。再来看 runJob 第3个构造方法是怎么传入 resultHandler 函数参数的呢？

val results = new Array[U](partitions.size)
runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
results

可以明显看出，这个方法传入的函数是将输出结果用 results 变量这个数组接收存放的。在调用完 runJob 方法后，最后返回 results 数组。嗯，我想，分析到这里，大家应该也明白了 count 算子的原理了吧。

5. sortByKey 算子原理

sortByKey 方法在 OrderedRDDFunctions 类中定义，本身是要实现对 key 进行全局排序的功能，spark rdd api 中没有 orderByKey。既然要全局排序，我们肯定会想到会对 key 进行重新分区，那如果按照默认的 hash 分区策略，肯定是不能满足需求的。那把所有任务的输出全部拉取到 Driver 端再排序呢？虽然满足了需求，但是数据量过大会导致内存溢出，那么 Spark 又是怎么解决的呢？让我们带着疑惑去看看 spark 源码实现，sortByKey 方法源码如下：

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

方法首先定义了一个 RangePartitioner，看看它的 getPartition 方法：

def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[K]
    var partition = 0
    if (rangeBounds.length <= 128) {
      // If we have less than 128 partitions naive search
      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
        partition += 1
      }
    } else {
      // Determine which binary search method to use only once.
      partition = binarySearch(rangeBounds, k)
      // binarySearch either returns the match location or -[insertion point]-1
      if (partition < 0) {
        partition = -partition-1
      }
      if (partition > rangeBounds.length) {
        partition = rangeBounds.length
      }
    }
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }

该方法中有一个很关键的变量 rangeBounds，它是一个已经排好序的升序数组。其数组长度一般等于该 rdd 分区数，数组中的每一个值所对应的是 partition 的边界值。它的实现原理是对原本数据进行采样，将采样获取到的 key 放到一个数组中，然后根据 partition 数量，得到每个 partition 的 key 的边界值放入到 rangeBounds 数组中。

有了这个 rangeBounds 数组后，就很容易确定 key 所对应的分区了，如果数组长度小于128，则直接从前向后顺序遍历，直到找到 key 小于某个边界值停下来，返回对应的 partition。否则使用二分查找方法遍历。

在看完 RangePartitioner 类的 getPartition 方法后，我们清楚了 key 经过这样的规则划分之后，分区之间的 key 是顺序递增的，但是分区内的 key 还是没有排序的。再看 sortBykey 方法，它在创建了 ShuffledRDD 实例之后，调用了 setKeyOrdering 方法，这个方法有什么作用呢？可以去看下源码：

class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient var prev: RDD[_ <: Product2[K, V]],
    part: Partitioner)
  extends RDD[(K, C)](prev.context, Nil) {
  def setKeyOrdering(keyOrdering: Ordering[K]): ShuffledRDD[K, V, C] = {
    this.keyOrdering = Option(keyOrdering)
    this
  }

  override def getDependencies: Seq[Dependency[_]] = {
    val serializer = userSpecifiedSerializer.getOrElse {
      val serializerManager = SparkEnv.get.serializerManager
      if (mapSideCombine) {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
      } else {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
      }
    }
    List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
  }
}

可以看出，ShuffledRDD 的 getDependencies 方法在创建 ShuffleDependency 对象时传入了该 keyOrdering 变量。在其 comupte 方法被调用时，会先有 Shuffle Read 阶段，而Shuffle 读时便会根据该变量值决定是否排序。

编程小王子啊

关注

3
点赞
踩
6

收藏

觉得还不错? 一键收藏
2
评论
Spark RDD count, sample, colease, distinct, order by 等算子实现原理解析

前言我们在编写 spark 代码来处理数据时，大多数的工作都是调用 spark api 对数据做转换，然后收集最终结果。这些 api 函数便被称之为算子(operation)。一、RDD 算子概览Spark rdd 算子可以分为以下 3 类：非 shuffle 类 transform 算子，以 map，filter，flatmap 算子为代表，这类算子的特点是不会触发 rdd 计算过程，只是将一个 rdd 转换成另一个 rdd，前后两个 rdd 之间是窄依赖关系（Narrow Depende
复制链接

扫一扫

专栏目录