repartition:
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*
* TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
coalesce :
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* 我理解的是如果将多个分区 要重分区为1个分区的时候 ,可以设置 shuffle = true ;这样可以避免 在一个执行器上
* 采用Netty通信的方式获取它的Parent RDD的数据,如果shuffle = true,重分区为1 此rdd的父rdd的分区所在的执
* 行器会将数据发送到一个同一个执行器的分区。
*
* @note With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner. The optional partition coalescer
* passed in must be serializable.
*
* 分区异常的大 可以using a hash partitioner重分区 到更多的分区
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
/** 从随机分区开始均匀分布元素在输出分区上。 */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position: Int = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
}: Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
// 之前的操作 还是并行的
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}
CoalescedRDD:
/**
* Represents a coalesced RDD that has fewer partitions than its parent RDD
* This class uses the PartitionCoalescer class to find a good partitioning of the parent RDD
* so that each new partition has roughly the same number of parent partitions and that
* the preferred location of each new partition overlaps with as many preferred locations of its
* parent partitions
* 每个新分区的优选位置与它的许多优选位置重叠父分区
* 将父分区尽可能多的分在 同一个 分区 比如 Parent:10 个分区 要收缩成为 一个分区 ,Parent:1-6 分区又在同一个执行器下
* 那么 新的分区会在 1-6分区 所在位置。其他 Parent:7-10 会采用Netty通信的方式获取它的Parent RDD的数据 到当前位置
* 因为没有落地到磁盘的操作,避免了磁盘I/O,所以比Shuffle还是要快不少
*
* @param prev RDD to be coalesced
* @param maxPartitions number of desired partitions in the coalesced RDD (must be positive)
* @param partitionCoalescer [[PartitionCoalescer]] implementation to use for coalescing
*/
private[spark] class CoalescedRDD[T: ClassTag](
@transient var prev: RDD[T],
maxPartitions: Int,
partitionCoalescer: Option[PartitionCoalescer] = None)
extends RDD[T](prev.context, Nil) { // Nil since we implement getDependencies
require(maxPartitions > 0 || maxPartitions == prev.partitions.length,
s"Number of partitions ($maxPartitions) must be positive.")
if (partitionCoalescer.isDefined) {
require(partitionCoalescer.get.isInstanceOf[Serializable],
"The partition coalescer passed in must be serializable.")
}
override def getPartitions: Array[Partition] = {
val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer())
pc.coalesce(maxPartitions, prev).zipWithIndex.map {
case (pg, i) =>
val ids = pg.partitions.map(_.index).toArray
new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
}
}
override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
//将多个分区的元素 flatMap 之后 再拼成 一个分区
partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
firstParent[T].iterator(parentPartition, context)
}
}
override def getDependencies: Seq[Dependency[_]] = {
Seq(new NarrowDependency(prev) {
def getParents(id: Int): Seq[Int] =
partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices
})
}
override def clearDependencies() {
super.clearDependencies()
prev = null
}
实际运用中完全可以根据自己的需要来灵活使用coalesce ;
并行度不够的情况下可以提高分区数;
存在过多的小任务的时候,可以通过收缩合并分区,减少分区的个数,减小任务调度成本,避免Shuffle,比RDD.repartition效率提高不少;
某个分区数据倾斜也可用重分区,repartition提高分区数来缓解倾斜的情况。
例子:
object CoalesceTest extends App {
val sparkConf = new SparkConf().
setAppName("CoalesceTest")
.setMaster("local[6]")
val spark = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val value: RDD[Int] = spark.sparkContext.parallelize(List(9, 2, 3, 5, 8, 1), 3)
val s = new Ordering[Int] {
override def compare(x: Int, y: Int): Int =
if (x < y) x
else y
}
val coalesceValue: RDD[Int] = value.coalesce(2)(s)
coalesceValue.foreach(println(_))
}
coalesce 有一个可选参数
(implicit ord: Ordering[T] = null) 这里我的例子自定义排序后并没有排序。分区收缩后元素位置保持不变。
https://blog.csdn.net/u012684933/article/details/51028707