Spark Shuffle

最新推荐文章于 2023-04-08 21:14:32 发布

Gru杨

最新推荐文章于 2023-04-08 21:14:32 发布

阅读量84

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_43517453/article/details/93876613

版权

Spark 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

shuffle是一个re-distributing data 重新分发数据的过程；
map tasks to organize the data
reduce tasks to aggregate it

individual map tasks are kept in memory until they can’t fit
然后根据目标分区排序并写到一个单独的文件
reduce tasks read the relevant sorted blocks
shuffle是一个complex and costly operation:
shuffle involves coping data across executors and machines.
这个过程会涉及到磁盘IO以及数据的序列化，网络IO(网络IO是要经过数据的序列化的)
能避免使用shuffle算子的时候尽量避免；而且有shuffle的时候就有可能出现数据倾斜
shuffle会使数据落盘
哪些算子会造成分区：
repartition operations：repartition , coalesce
ByKey operations：groupByKey, reduceByKey
join operations: cogroup , join

repartition：
coalesce(默认是窄依赖，不发生shuffle)：
默认将RDD的分区数减少到指定的分区数,不能放大；
多的分区数变成少的分区数，不需要数据的shuffle
如果要放大需要将第二个参数变成true
合并小文件（不shuffle，从多变少）
data.partition.length=2
val data1 = data.coalesce(1) //将分区数减少为一个
val data2 = data.coalesce(4) //这种方式并不能将分区数增加到4个
val data3 = data.coalesce(4,true) //将分区数增加到4

repartition 底层是调用coalesce(num,true)
能够增加或者减少分区，是肯定要进行shuffle的
repartition用来提高并行度，处理数据倾斜

val info4 = info.repartition(5)
ifon4.partitions.length=5

val students = sc.parallelize(List(“a”,“b”,“c”,“d”,“e”,“f”),3)
students.mapPartitionsWithIndex((index,partition) =>{
val stus = new ListBuffer[String]
while(partition.hasNext){
stus += ("—"+partition.next()+",哪个组: "+ (index+1))
}
stus.iterator
}).foreach(println)

ByKey：
reduceByKey:
sc.textFile("").flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
groupByKey:
sc.textFile("").flatMap(.split("\t")).map((,1)).groupByKey().map(x=>(x._1,x._2.sum)).collect

reduceByKey不仅简单，shuffle的数据还比groupByKey的shuffle的数据少：原因是reduceByKey 事先在map端本地做了一次聚合操作（combiner），combiner的结果再做了shuffle，所以shuffle的数据量少一些
所以工作当中优先使用reduceByKey
```
 reduceByKey是数据先进行combiner，只有局部数据进行shuffle
 	groupByKey 是全量数据进行shuffle
```
shuffle 会带来什么样的影响:
shuffle是大数据所有的计算里面的性能杀手，它是瓶颈所在
发生了宽依赖那么就一定发生了shuffle过程

shuffle1：SparkCore 5.11

Gru杨

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Shuffle

shuffle是一个re-distributing data 重新分发数据的过程；map tasks to organize the datareduce tasks to aggregate itindividual map tasks are kept in memory until they can’t fit然后根据目标分区排序并写到一个单独的文件reduce tasks r...
复制链接

扫一扫

专栏目录