Spark中repartition和partitionBy的区别

最新推荐文章于 2024-07-03 07:15:00 发布

ImBetter

最新推荐文章于 2024-07-03 07:15:00 发布

阅读量8.4k

点赞数 3

分类专栏： Spark 文章标签： Spark repartition partitionBy

本文链接：https://blog.csdn.net/ImBetter/article/details/79981088

版权

Spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

repartition 和 partitionBy 都是对数据进行重新分区，默认都是使用 HashPartitioner，区别在于partitionBy 只能用于 PairRdd，但是当它们同时都用于 PairRdd时，结果却不一样：

不难发现，其实 partitionBy 的结果才是我们所预期的，Why?我们打开 repartition 的源码进行查看：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
	coalesce(numPartitions, shuffle = true)
}

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions).values
    } else {
      new CoalescedRDD(this, numPartitions)
    }
}

注意到没有第15行没有，repartition 其实使用了一个随机生成的数来当做 Key，而不是使用原来的 Key！！

使用Spark的版本为1.6.1