1. spark源码学习分享：reduceByKey

最新推荐文章于 2024-07-25 09:26:43 发布

zerg_ling

最新推荐文章于 2024-07-25 09:26:43 发布

阅读量5.2k

点赞数 3

分类专栏： spark 文章标签： spark 源码

本文链接：https://blog.csdn.net/firstblood1/article/details/53444048

版权

本文详细介绍了Spark的reduceByKey操作，从transformation和action的区别开始，重点分析reduceByKey的源码，包括Partitioner的获取和子RDD的构建。在Partitioner获取部分，讲解了defaultPartitioner的规则，特别是HashPartitioner的划分策略。在子RDD构建过程中，探讨了如何避免不正确的分区以及mapPartitions中的聚合操作。文章还讨论了外部内存溢出时如何将数据写入磁盘，以及如何通过调整分区数量和参数来优化性能。

摘要由CSDN通过智能技术生成

零、前置（已经了解的看官可以跳过第0章）

spark的rdd支持两种类型的操作，分别是transformation和action操作。简单来说，transformation操作就是通过现有的rdd作一些变换之后得到一个新的rdd（例如map操作）；action操作则是在rdd上作一些计算，然后将结果返回给drvier（例如reduce操作）。具体哪些操作属于transformation，哪些操作属于action可以参照官方文档（http://spark.apache.org/docs/latest/programming-guide.html）。当spark解析到一个transformation类型的方法时，spark并不会立马执行这个transformation操作，而是会将该transformation操作作用在哪个rdd上记录下来，然后等到解析到action类型的方法时才会一并去执行前面的transformation方法。

默认情况下，每次执行到action类型的方法都会把它所依赖的transformation方法重新执行一遍（哪怕两个action方法依赖了同一个transformation方法）。除非你调用cache或者presist方法将产生的中间rdd缓存起来。

本文将从transformation操作开始，以一个job执行的过程为主线来走读源码。这里选择一个比较有代表性的transformation类型方法——reduceByKey。

reduceByKey函数的作用可以参照官方文档，这里不在赘述。

在阅读这部分源码的过程中可以验证二个问题（答案参照文中标粗的部分）：

1、transformation操作究竟会不会立马执行

2、经过transformation操作后生成的rdd和其父rdd的partition个数是什么关系

spark的reduceByKey方法有三种重载形式：

def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V]

def reduceByKey(func: JFunction2[V, V, V]): JavaPairRDD[K, V]

前两种形式除了允许用户传入聚合函数以外，还允许用户指定partitioner或者指定reduceByKey后生成的rdd的partition个数

  def reduceByKey(func: JFunction2[V, V, V]): JavaPairRDD[K, V] = {
    fromRDD(reduceByKey(defaultPartitioner(rdd), func))
  }

一、Partitioner的获取

其中，当用户没有指定partitioner以及partition的个数时，spark会调用defaultPartitioner(rdd)函数去获取一个默认的partitioner。defaultPartitioner的源码如下：

  /**
   * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
   *
   * If any of the RDDs already has a partitioner, choose that one.
   *
   * Otherwise, we use a default HashPartitioner. For the number of partitions, if
   * spark.default.parallelism is set, then we'll use the value from SparkContext
   * defaultParallelism, otherwise we'll use the max number of upstream partitions.
   *
   * Unless spark.default.parallelism is set, the number of partitions will be the
   * same as the number of partitions in the largest upstream RDD, as this should
   * be least likely to cause out-of-memory errors.
   *
   * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
   */
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
    for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.length)
    }
  }

该方法的注释中，描述了这个方法的大致逻辑（英语好的看官可以自行看上面的注释）：

该方法允许传入两个rdd，调用该方法时候最少需要传入一个rdd。方法首先会将传入的两个rdd合并成一个数组，然后依据rdd中partition的个数进行降序排序。之后，遍历这个数组，从有partitioner的rdd中，挑选出partition个数最多的rdd，将其partitioner返回。如果传入的rdd都没有partitioner，那么久会返回一个HashPartitioner，其中，如果spark配置了spark.default.parallelism参数，则partition的个数为该参数的值。否则，新生成的rdd中partition的个数取与其依赖的父rdd中partition个数的最大值。

再进一步，我们来看看HashPartitioner（HashPartitioner是Partitioner的一个内部类）的划分规则是怎么样的。先上源码：

/**
 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
 * Java's `Object.hashCode`.
 *
 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
 * produce an unexpected or incorrect result.
 */
class HashPartitioner(partitions: Int) extends Partitioner {
  require(pa