SparkCore：RDD Shuffle operations、coalesce和repartition算子、reduceByKey和groupByKey对比

最新推荐文章于 2022-10-25 19:52:38 发布

11号车厢

最新推荐文章于 2022-10-25 19:52:38 发布

阅读量462

点赞数

分类专栏： Spark2 文章标签： Spark2

原文链接：http://spark.apache.org/docs/2.4.2/rdd-programming-guide.html#shuffle-operations

版权

Spark2 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

文章目录

官网：Shuffle operations
http://spark.apache.org/docs/2.4.2/rdd-programming-guide.html#shuffle-operations

1、Shuffle operations

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Spark中的某些操作会触发一个称为shuffle的事件。shuffle是Spark用于重新分发数据的机制，以便在不同分区之间进行分组。这通常涉及跨执行器和机器复制数据，使shuffle成为一个复杂而昂贵的操作。
To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.
为了理解shuffle期间发生了什么，我们可以考虑reduceByKey操作的例子。reduceByKey操作生成一个新的RDD，其中单个键的所有值都组合成一个元组——键和对与该键关联的所有值执行reduce函数的结果。挑战在于，单个键的所有值不一定都位于相同的分区，甚至是同一台机器上，但是必须将它们放在一起才能计算结果。
就是说key是可能分布在不同的分区或者不同机器上，但是reduce必须要把所有分区的key聚合到一起，才能计算得到结果
Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.
可能导致Shuffle的操作包括repartition系列 (如repartition和coalesce)、ByKey系列(除 counting外)(如groupByKey和reduceByKey)和join系列(如cogroup和join)。

2、Performance Impact

性能影响：

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
Shuffle是一项昂贵的操作，它涉及磁盘I/O、数据序列化和网络I/O。为了组织shuffle所需的数据，Spark生成了一堆tasks——组织数据的map任务和聚合数据的reduce任务。这个术语来自MapReduce，与Spark的map和reduce操作没有直接关系。
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
在内部，来自单个map任务的结果被保存在内存中，直到它们不能匹配为止。然后，根据目标分区对这些文件进行排序，并将其写入单个文件。在reduce端，任务读取相关的已排序块。
Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.
某些Shuffle操作会消耗大量堆内存，因为它们使用内存中的数据结构来组织传输之前或之后的记录。具体来说，reduceByKey和aggregateByKey在map端创建这些结构，'ByKey系列在reduce端生成这些结构。当数据不适合内存时，Spark会将这些表溢出到磁盘，导致磁盘I/O的额外开销和增加的gc(垃圾收集)。
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context.
Shuffle还会在磁盘上生成大量的中间文件。从Spark 1.3开始，这些文件将被保存，直到不再使用相应的RDDs并进行gc(垃圾收集)。这样做是为了在重新计算时不需要重新创建shuffle文件。如果应用程序保留对这些RDDs的引用，或者GC不经常启动，垃圾收集可能只在很长一段时间之后才会发生。这意味着长时间运行的Spark作业可能会消耗大量磁盘空间。临时存储目录由spark.local指定。配置Spark上下文时的dir配置参数。
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the ‘Shuffle Behavior’ section within the Spark Configuration Guide.
可以通过调整各种配置参数来调整Shuffle行为。请参阅Spark配置指南中的“Shuffle行为”部分。

3、coalesce和repartition算子

这两个算子返回的是一个分区数的RDD，所以要用一个新的RDD接收。
coalesce(1) 减少分区数，如果要扩大分区数，就必须这样写 coalesce(4,true)，true是shuffle=true默认是false。
repartition是扩大分区数的，但是底层调用的是coalesce(numPartitions,shuffle=true)。所以他是会产生shuffle的

RDD.scala源码中查找

  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *返回一个新的RDD，该RDD具有确切的numpartition分区。
   
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *可以增加或减少此RDD中的并行度级别。在内部，这使用shuffle来重新分发数据。
   
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   *如果要减少这个RDD中的分区数量，请考虑使用“coalesce”，它可以避免执行shuffle。
   
   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

coalesce操作

[hadoop@vm01 ~]$ spark-shell --master local[2]
scala> val a = sc.textFile("hdfs://192.168.137.130:9000/test.txt")
scala> a.partitions.size
res0: Int = 2

scala> val a1= a.coalesce(1)
scala> a1.collect

在这里插入图片描述

scala> val a1=a.coalesce(4,true)
scala> a1.partitions.size
res6: Int = 4
scala> a1.collect

在这里插入图片描述
repartition操作，会产生shuffle

scala> a.partitions.size
res10: Int = 2

scala> val a3=a.repartition(5)
a3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:25

scala> a3.partition
partitioner   partitions

scala> a3.partitions.size
res11: Int = 5

scala> a3.collect

在这里插入图片描述

coalesce减少分区会不会产生shuffle

scala> val students=sc.parallelize(List("17er","laoer","benzeguo","jeff","zz","woodtree"),3) 
scala> import scala.collection.mutable.ListBuffer
scala>     students.coalesce(2).mapPartitionsWithIndex((index,partition)=>{
     |       val stus=new ListBuffer[String]
     |       while(partition.hasNext){
     |         stus += ("~~~~"+partition.next()+"，哪个组:"+(index+1))
     |       }
     |       stus.iterator
     |     }).foreach(println)
~~~~17er，哪个组:1
~~~~laoer，哪个组:1
~~~~benzeguo，哪个组:2
~~~~jeff，哪个组:2
~~~~zz，哪个组:2
~~~~woodtree，哪个组:2

这个例子说明，coalesce减少分区数，是不会产生shuffle的，webUI中可以看到只有一个stage
在这里插入图片描述
filter过滤之后，一般用coalesce收敛分区数，避免产生很多的小文件，甚至空文件。
repartition,可以提高数据的并行度

4、reduceByKey、groupByKey对比

reduceByKey: 先局部分区做了聚合，再做的shuffle，那么分发的数据肯定非常少
groupByKey: 先shuffle，那么这一步分发的数据非常多，因为是明细数据，再做的聚合
所以redeuceByKey 使用的比groupByKey 多
reduceByKey:底层调用的是combinerByKey,mapsidecombiner=ture
groupByKey:底层调用的是combinerByKey,mapsidecombiner=false
在这里插入图片描述

scala> val a = sc.textFile("hdfs://192.168.137.130:9000/test.txt")
scala> a.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

//返回的是一个RDD[(String, Iterable[Int])]
scala> val b =a.flatMap(_.split(" ")).map((_,1)).groupByKey()
b: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[30] at groupByKey at <console>:26

scala> val b =a.flatMap(_.split(" ")).map((_,1)).groupByKey().collect
b: Array[(String, Iterable[Int])] = Array((hive,CompactBuffer(1)), (hello,CompactBuffer(1, 1, 1, 1, 1)), (yarn,CompactBuffer(1)), (spark,CompactBuffer(1, 1)), (mr,CompactBuffer(1)))

//wc求和
scala> val b =a.flatMap(_.split(" ")).map((_,1)).groupByKey().map(x=>(x._1,x._2.sum)).collect
b: Array[(String, Int)] = Array((hive,1), (hello,5), (yarn,1), (spark,2), (mr,1))