总览图
分类如下:
org.apache.spark
下的HashPartitioner
和RangePartitioner
org.apache.spark.scheduler
下的CoalescedPartitioner
org.apache.spark.sql.execution
下的CoalescedPartitioner
org.apache.spark.mllib.linalg.distributed
下的GridPartitioner
org.apache.spark.sql.execution
下的PartitionIdPassthrough
org.apache.spark.api.python
下的PythonPartitioner
一共7个分区器,重点讲解org.apache.spark
下的HashPartitioner
和RangePartitioner
。
分区器只针对(K,V)形式的RDD操作。
Partitioner
Partitioner
为抽象类,定义了分区器应该具备的member
:
abstract class Partitioner extends Serializable {
def numPartitions: Int
def getPartition(key: Any): Int
}
- numPartitions:获取分区个数。
- getPartition:根据Key值得到分区ID。
Partitioner
类还有一个伴身对象,是默认提供一个分区器。
object Partitioner {
/**
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If any of the RDDs already has a partitioner, choose that one.
*
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
*
* Unless spark.default.parallelism is set, the number of partitions will be the
* same as the number of partitions in the largest upstream RDD, as this should
* be least likely to cause out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*/
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val rdds = (Seq(rdd) ++ others)
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
if (hasPartitioner.nonEmpty) {
hasPartitioner.maxBy(_.partitions.length).partitioner.get
} else {
if (rdd.context.conf.contains("spark.default.parallelism")) {
new HashPartitioner(rdd.context.defaultParallelism)
} else {
new HashPartitioner(rdds.map(_.partitions.length).max)
}
}
}
}
defaultPartitioner
方法中详细说明了默认分区器生成策略,要么是父类RDDs
的中分区个数最大的RDD
的分区器,如果父类RDDs
没有分区器(非PairRDD
),那么返回的就是HashPartitioner
,分区器个数有2种方式得到,如果设置了spark.default.parallelism
,就以该值为分区个数,如果没设置该值,以父类分区数最大的为准。
HashPartitioner
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
def numPartitions: Int = partitions
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}
相对简单,分区ID号用key
的hashCode
值与分区个数取余,如果余数小于0,则用余数+分区的个数,最后返回的值就是这个key所属的分区ID。
RangePartitioner
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true)
extends Partitioner {
// We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")
private var ordering = implicitly[Ordering[K]]
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
val sampleSize = math.min(20.0 * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, partitions)
}
}
}
具体解释不多说,参考文章已经有了。其中比较重要的是分界函数,分界函数采用了水塘抽样算法,一种不知道样本总量的进行抽样的算法。
如果你只是想了解分区的策略,可以直接分析spark1.1
代码,1.1之后的代码是优化了性能(利用水塘抽样算法减少全局遍历次数),但是策略基本相同。都是抽样获得各个partition之间的边界值,每一个key进来以后,判定在哪个边界范围内,存到该分区中。由此可知,分区之间是有序的,就是分区A中的数据任意一个都小于分区B中的数据,但是分区内的数据没有序。