spark分区算子partitionBy、coalesce、repartition

南风知我意丿

已于 2022-07-22 15:40:26 修改

阅读量830

点赞数

分类专栏： Spark 文章标签： spark scala 大数据

于 2022-07-22 15:39:43 首次发布

本文链接：https://blog.csdn.net/Lzx116/article/details/125931440

版权

Spark 专栏收录该内容

57 篇文章 2 订阅

订阅专栏

文章目录

起因

这几天突发奇想想研究 df.show()函数算法，然后就看到一篇文章讲这个，看着看着，就对文章中的代码产生了疑问？

1.疑问代码：

val df = Seq((5,5), (6,6), (7,7), (8,8), (1,1), (2,2), (3,3), (4,4)).toDF("col1", "col2")
val df3 = df.repartition(3)

// lets see partition structures
df3.rdd.glom().collect()
/*
Array(Array([8,8], [1,1], [2,2]), Array([5,5], [6,6]), Array([7,7], [3,3], [4,4]))
*/

// And lets see the top 4 rows this time
df3.show(4, false)
/*
+----+----+
|col1|col2|
+----+----+
|8   |8   |
|1   |1   |
|2   |2   |
|5   |5   |
+----+----+

下面代码，为什么 8，1，2会分到一个分区内？？不符合hashPartition的算法啊，莫非是范围分区？试试

df3.rdd.glom().collect()
/*
Array(Array([8,8], [1,1], [2,2]), Array([5,5], [6,6]), Array([7,7], [3,3], [4,4]))
*/

2.带着疑问测试：

 //todo 测试RDD分区
 val rdd1: RDD[(Int, Int)] = sc.makeRDD(Seq((5, 5), (6, 6), (7, 7), (8, 8), (1, 1), (2, 2), (3, 3), (4, 4)),8)
    println(rdd1.getNumPartitions) //8

HashPartitioner

    //哈希分区
    val rdd2: RDD[(Int, Int)] = rdd1.partitionBy(new HashPartitioner(3))
    println(rdd2.getNumPartitions)//3
    rdd2.mapPartitionsWithIndex((index,iter)=>{
      val str: String = iter.map(_._2).mkString(",")
      Iterator((index,str))
    }).toDF("index","Value").show()
    
	+-----+-----+
	|index|Value|
	+-----+-----+
	|    0|  6,3|
	|    1|7,1,4|
	|    2|5,8,2|
	+-----+-----+

RangePartitioner

    //范围分区
    val rdd3: RDD[(Int, Int)] = rdd1.partitionBy(new RangePartitioner(3, rdd1))
    println(rdd3.getNumPartitions)//3
    rdd3.mapPartitionsWithIndex((index,iter)=>{
      val str: String = iter.map(_._2).mkString(",")
      Iterator((index,str))
    }).toDF("index","Value").show()
    
	+-----+-----+
	|index|Value|
	+-----+-----+
	|    0|1,2,3|
	|    1|5,6,4|
	|    2|  7,8|
	+-----+-----+

repartition

scala> val rdd100 = sc.makeRDD(Seq((5, 5), (6, 6), (7, 7), (8, 8), (1, 1), (2, 2), (3, 3), (4, 4)),8)
rdd100: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[11] at makeRDD at <console>:24

scala> val rdd101 = rdd100.repartition(3)
rdd101: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[15] at repartition at <console>:25

rdd101.glom.collect
res3: Array[Array[(Int, Int)]] = Array(Array(), Array((5,5), (7,7), (2,2), (3,3), (4,4)), Array((6,6), (8,8), (1,1)))

3.结论

partitionBy是根据partitioner，而repartition是随机策略，不能指定partitioner

分区算子详解

1.partitionBy

在这里插入图片描述

同repartition区别，partitionBy是根据partitioner，而repartition是随机策略，不能指定partitioner

2.coalesce

用来改变分区数，根据随机生成的key，使用随机策略均匀的分布数据，只能传入分区数，不能指定partitioner

val sc = new SparkContext()
val inputRDD = sc.parallelize(Array[(Int, Char)]((3, 'c'), (3, 'f'), (1, 'a'), (4, 'd'), (1, 'h'), (2, 'b'), (5, 'e'), (2, 'g')), 5)
var coalesceRDD = inputRDD.coalesce(2) //图3.19中的第1个图
coalesceRDD = inputRDD.coalesce(6) //图3.19中的第2个图
coalesceRDD = inputRDD.coalesce(2, true) // 图3.19中的第3个图
coalesceRDD = inputRDD.coalesce(6, true) //图3.19中的第4个图

减少分区个数

 var coalesceRDD = inputRDD.coalesce(2) //图3.19中的第1个图

在这里插入图片描述

如图319中的第1个图所示，rdd1的分区个数为5，当使用coalesce(2)减少为两个分区时,spark会将相邻的分区直接合并在一起，得到rdd2，形成的数据依赖关系是多对一的NarrowDependency.这种方法的缺点是，当rdd1中不同分区中的数据量差别较大时，直接合并容易造成数据倾斜(rdd2中某些分区数据量过多或过少)

增加分区个数

 coalesceRDD = inputRDD.coalesce(6)

在这里插入图片描述

如图3.19中的第2个图所示，当使用coalesce(6)将rdd1的分区个数增加为6时，会发现生成的rdd2的分区个数并没有增加，还是5。这是因为coalesce()默认使用NarrowDependency，不能将一个分区拆分为多份。

使用Shuffle来减少分区个数

coalesceRDD = inputRDD.coalesce(2, true)

在这里插入图片描述
如图3.19中的第3个图所示，为了解决数据倾斜的问题,我们可以使用coalesce(2, Shuffle = true)来减少RDD的分区个数。使用Shuffle = true后，Spark 随机将数据打乱，从而使得生成的RDD中每个分区中的数据比较均衡。具体采用的方法是为rdd1中的每个record添加一个特殊的Key,如第3个图中的MapPartitionsRDD,Key是 Int类型,并从[0, numPartitions)中随机生成，如<3,f > => <2,(3,f)>中，2是随机生成的Key，接下来的record的Key递增1，如<1,a> => <3,(1,a)>。这样，Spark可以根据Key的 Hash值将rdd1中的数据分发到rdd2的不同的分区中，然后去掉Key即可（见最后的 MapPartitionsRDD)。

使用Shuffle来增加分区个数

coalesceRDD = inputRDD.coalesce(6, true)

在这里插入图片描述
如图3.19 中的第4个图所示，通过使用ShuffeDepedency，可以对分区进行拆分和充足，解决分区个数不能增加的问题。

3.repartition

repartition = coalesce(numPartitions,true)

epartition(partitionNums)： Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
中文：通过创建更多或更少的分区将数据随机的打散，让数据在不同分区之间相对均匀。这个操作经常是通过网络进行数打散。
从设计的角度上来说，`repartition 是用来让数据更均匀分布的`

3.1coalesce和repartition 用法

语义上的区别：repartition = coalesce(numPartitions,true)

coalesce
coalesce算子默认只能减少分区数量，如果设置为false且参数大于调用RDD的分区数，那调用RDD的分区数不会变化。
oalesce的作用常常是减少分区数，已达到输出时合并小文件的效果。减少分区数有2种情况：
直接用默认的coalesce(parNum,false)，此时只有一个stage，且stage的并行度设置为coalesce的parNum，可能会对性能有影响
在一个stage中，coalesce中设定的分区数是优先级最高的
coalesce(parNum,true)，此时会有2个stage，stage0仍然使用原来的并行度，然后合并分区，减少并行度。适合大数据量的过滤后小数据量的操作。
比如有个rdd原来有200个分区，经过filter操作后，数据大幅减少，200个分区过多了，此时就可以使用coalesce(parNum,true)
repartition
repartition 返回一定一个parNum个分区的RDD，一定会shuffle，一般用这个就是为了增加分区数，提高并行度！