Spark成长之路(4)-分区器系统

最新推荐文章于 2024-06-01 19:58:07 发布

Q博士

最新推荐文章于 2024-06-01 19:58:07 发布

阅读量864

点赞数

本文链接：https://blog.csdn.net/itfootball/article/details/74315684

版权

各种框架学习专栏收录该内容

76 篇文章 0 订阅

订阅专栏

Spark分区器HashPartitioner和RangePartitioner代码详解
 分区器

总览图

这里写图片描述

分类如下：

org.apache.spark下的HashPartitioner和RangePartitioner
org.apache.spark.scheduler下的CoalescedPartitioner
org.apache.spark.sql.execution下的CoalescedPartitioner
org.apache.spark.mllib.linalg.distributed下的GridPartitioner
org.apache.spark.sql.execution下的PartitionIdPassthrough
org.apache.spark.api.python下的PythonPartitioner

一共7个分区器，重点讲解org.apache.spark下的HashPartitioner和RangePartitioner。

分区器只针对(K,V)形式的RDD操作。

Partitioner

Partitioner为抽象类，定义了分区器应该具备的member:

abstract class Partitioner extends Serializable {
  def numPartitions: Int
  def getPartition(key: Any): Int
}

numPartitions：获取分区个数。
getPartition：根据Key值得到分区ID。

Partitioner类还有一个伴身对象,是默认提供一个分区器。

object Partitioner {
  /**
   * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
   *
   * If any of the RDDs already has a partitioner, choose that one.
   *
   * Otherwise, we use a default HashPartitioner. For the number of partitions, if
   * spark.default.parallelism is set, then we'll use the value from SparkContext
   * defaultParallelism, otherwise we'll use the max number of upstream partitions.
   *
   * Unless spark.default.parallelism is set, the number of partitions will be the
   * same as the number of partitions in the largest upstream RDD, as this should
   * be least likely to cause out-of-memory errors.
   *
   * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
   */
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
    if (hasPartitioner.nonEmpty) {
      hasPartitioner.maxBy(_.partitions.length).partitioner.get
    } else {
      if (rdd.context.conf.contains("spark.default.parallelism")) {
        new HashPartitioner(rdd.context.defaultParallelism)
      } else {
        new HashPartitioner(rdds.map(_.partitions.length).max)
      }
    }
  }
}

defaultPartitioner方法中详细说明了默认分区器生成策略，要么是父类RDDs的中分区个数最大的RDD的分区器，如果父类RDDs没有分区器(非PairRDD)，那么返回的就是HashPartitioner,分区器个数有2种方式得到，如果设置了spark.default.parallelism,就以该值为分区个数，如果没设置该值，以父类分区数最大的为准。

HashPartitioner

class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

相对简单，分区ID号用key的hashCode值与分区个数取余，如果余数小于0，则用余数+分区的个数，最后返回的值就是这个key所属的分区ID。

RangePartitioner


class RangePartitioner[K : Ordering : ClassTag, V](
    partitions: Int,
    rdd: RDD[_ <: Product2[K, V]],
    private var ascending: Boolean = true)
  extends Partitioner {

  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
  require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")

  private var ordering = implicitly[Ordering[K]]

  // An array of upper bounds for the first (partitions - 1) partitions
  private var rangeBounds: Array[K] = {
    if (partitions <= 1) {
      Array.empty
    } else {
      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.
      val sampleSize = math.min(20.0 * partitions, 1e6)
      // Assume the input partitions are roughly balanced and over-sample a little bit.
      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
      if (numItems == 0L) {
        Array.empty
      } else {
        // If a partition contains much more than the average number of items, we re-sample from it
        // to ensure that enough items are collected from that partition.
        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
        val candidates = ArrayBuffer.empty[(K, Float)]
        val imbalancedPartitions = mutable.Set.empty[Int]
        sketched.foreach { case (idx, n, sample) =>
          if (fraction * n > sampleSizePerPartition) {
            imbalancedPartitions += idx
          } else {
            // The weight is 1 over the sampling probability.
            val weight = (n.toDouble / sample.length).toFloat
            for (key <- sample) {
              candidates += ((key, weight))
            }
          }
        }
        if (imbalancedPartitions.nonEmpty) {
          // Re-sample imbalanced partitions with the desired sampling probability.
          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
          val seed = byteswap32(-rdd.id - 1)
          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
          val weight = (1.0 / fraction).toFloat
          candidates ++= reSampled.map(x => (x, weight))
        }
        RangePartitioner.determineBounds(candidates, partitions)
      }
    }
  }

具体解释不多说，参考文章已经有了。其中比较重要的是分界函数，分界函数采用了水塘抽样算法,一种不知道样本总量的进行抽样的算法。

如果你只是想了解分区的策略，可以直接分析spark1.1代码，1.1之后的代码是优化了性能(利用水塘抽样算法减少全局遍历次数)，但是策略基本相同。都是抽样获得各个partition之间的边界值，每一个key进来以后，判定在哪个边界范围内，存到该分区中。由此可知，分区之间是有序的，就是分区A中的数据任意一个都小于分区B中的数据，但是分区内的数据没有序。