浪尖说spark的coalesce的利弊及原理

最新推荐文章于 2024-09-18 19:16:17 发布

浪尖聊大数据-浪尖

最新推荐文章于 2024-09-18 19:16:17 发布

阅读量1.5k

点赞数 2

分类专栏： spark

本文链接：https://blog.csdn.net/rlnLo2pNEfx9c/article/details/105283012

版权

浪尖的粉丝应该很久没见浪尖发过spark源码解读的文章，今天浪尖在这里给大家分享一篇文章，帮助大家进一步理解rdd如何在spark中被计算的，同时解释一下coalesce降低分区的原理及使用问题。

主要是知识星球有人问到过coalesce方法的使用和原理的问题，并且参考阅读了网上关于coalesce方法的错误介绍，有了错误的理解，所以浪尖忙里偷闲给大家解释一下。

浪尖这里建议多看看spark源码上，spark源码我觉得是注释最全的一套源码了，而且整体代码逻辑比较清晰，就是scala高阶函数的使用会使得前期阅读的时候很头疼，但是不可否认spark是大家学习scala编程规范性的参考代码。

这里不得不吐槽一下：flink的代码写的很挫，注释又不好，感觉不太适合人们阅读学习。

1. coalesce 函数start

对于Spark 算子使用，大家还是要经常翻看一下源码上的注释及理解一下spark 算子的源码实现逻辑，注释很多时候已经很清楚了讲了算子的应用场景及原理，比如本文要讲的关于coalesce函数的注释如下：

 /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */

注释的大致意思就是假设父rdd 1000分区࿰