自定义分区器
在使用 partitionBy() 函时我们需要传一个分区器。在有些场景下,我们需要指定某些数据放在一个分区,这时自带的分区器不能满足我们的需求。
源码分析org.apache.spark.HashPartitioner 分区器
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
def numPartitions: Int = partitions
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}
我们看见这个类继承了 Partitioner() ,继续查看源码
abstract class Partitioner extends Serializ