11.repartition操作
创建一个由数字1~100组成的RDD,并且设置为10个分区。然后执行repartition操作,将分区数聚合为“5”,然后再将其拓展为“7”,观察操作后的效果。
scala> val rddData1 = sc.parallelize(1 to 100,10)
rddData1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:24
scala> rddData1.partitions.length
res4: Int = 10
scala> val rddData2 = rddData1.repartition(5)
rddData2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at repartition at <console>:26
scala> rddData2.partitions.length
res7: Int = 5
scala> val rddData3 = rddData2.repartition(7)
rddData3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at repartition at <console>:28
scala> rddData3.partitions.length
res8: Int = 7
说明:
可以看到repartition的操作和coalesce操作类似,但repartition更灵活,可以将分区有低向高转,也可以由高向低转。其实coalesce也可以这样但是需要开启shuffle这个参数,如coalesce(7,true)这样,coalesce也可以将分区有低向高转,也可以由高向低转。