查看RDD运行所需要的JVM Heap大小和Spark主要的RDD算子、关于Join的shuffle、ByKey算子

jim8973

已于 2022-06-20 16:19:46 修改

阅读量420

点赞数

分类专栏： spark 文章标签： spark scala big data

于 2020-03-22 23:32:56 首次发布

本文链接：https://blog.csdn.net/jim8973/article/details/105031326

版权

spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

查看RDD运行所需要的JVM Heap大小

 val result = ssc.textFile("").flatMap(_.split("\t")).map((_, 1))
      .reduceByKey(_ + _)

 SizeEstimator.estimate(result)

算子

zip:拉链，分区数需要相同（分区不同Can’t zip RDDs with unequal numbers of partitions: List(4, 2)）；元素个数也要相同（Can only zip RDDs with same number of elements in each partition）

 val rddzip1 = sc.parallelize(List("fei","jim","jack"))
 val rddzip2 = sc.parallelize(List(30,18,20))
 val rddzip3 = rddzip1.zip(rddzip2)
 /**
  * zipWithIndex返回值带上分区
  */
 val rddzip4 = rddzip1.zip(rddzip2).zipWithIndex()
 rddzip3.foreach(println(_))
 println("-----------------------------")
 rddzip4.foreach(println(_))

(fei,30)
(jim,18)
(jack,20)
-----------------------------
((fei,30),0)
((jim,20),2)
((jack,18),1)

union合并，不去重，分区数叠加

scala>  val rdd1 = sc.parallelize(List(1,2,3,4,5,6),3)
scala> val rdd2 = sc.parallelize(List(3,4,5,6,7,8,8),2)
scala> val rdd3 = rdd1.union(rdd2)
scala> rdd3.collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7, 8, 8)
//分区数叠加
scala> rdd3.partitions.length
res2: Int = 5

intersection：并集,返回两个集合都有的元素，分区返回rdd1的分区

scala> val rdd4 = rdd1.intersection(rdd2)
scala> rdd4.collect
res3: Array[Int] = Array(6, 3, 4, 5)
//分区返回rdd1的分区
scala> rdd4.partitions.length
res5: Int = 3

subtract:差集，返回在rdd1有的，rdd2没有的元素,分区返回rdd1的分区

scala> val rdd5 = rdd1.subtract(rdd2)
scala> rdd5.collect
res6: Array[Int] = Array(1, 2)

scala> rdd5.partitions.length
res7: Int = 3

cartesian:笛卡尔积,返回rdd1和rdd2所有的元素组合，分区是相乘(3*2=6)

scala> val rdd6 = rdd1.cartesian(rdd2)

scala> rdd6.collect
res8: Array[(Int, Int)] = Array((1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (1,6), (1,7), (1,8), (1,8), (2,6), (2,7), (2,8), (2,8), (3,3), (3,4), (3,5), (4,3), (4,4), (4,5), (3,6), (3,7), (3,8), (3,8), (4,6), (4,7), (4,8), (4,8), (5,3), (5,4), (5,5), (6,3), (6,4), (6,5), (5,6), (5,7), (5,8), (5,8), (6,6), (6,7), (6,8), (6,8))

scala> rdd6.partitions.length
res9: Int = 6

distinct:去重，可以传partition的数量

scala> rdd2.distinct.collect
res12: Array[Int] = Array(4, 6, 8, 3, 7, 5)
scala> rdd3.collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7, 8, 8)
scala>  rdd3.distinct(4).mapPartitionsWithIndex((index,partition)=>{
     |       partition.map(x=>"分区:"+index + "___" + "元素:" + x)
     |     }).collect
res15: Array[String] = Array(分区:0___元素:4, 分区:0___元素:8, 分区:1___元素:1, 分区:1___元素:5, 分区:2___元素:6, 分区:2___元素:2, 分区:3___元素:3, 分区:3___元素:7)

总结：一般分区的数据是平均分配，多的放在最后一个分区

collect

其中runJob是真正的提交作业到spark上运行的

重要算子

coalesce:改变rdd的分区数，当传参小于当前rdd的分区数，可以实现小文件合并，不需要shuffle,是一个窄依赖；当传参大于当前rdd的分区数，需要把第二个参数shuffle=true打开（否则不起作用），所以需要shuffle

scala>  val rdd7 = rdd3.coalesce(1)
scala> rdd7.partitions.length
res17: Int = 1
scala> rdd3.partitions.length
res18: Int = 5
scala> val rdd8 = rdd3.coalesce(6,true)
rdd8: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[45] at coalesce at <console>:25
scala> rdd8.partitions.length
res20: Int = 6

repartition重新分区，涉及到了shuffle,其实底层走的coalesce(numPartition,true)

scala> val rdd9 = rdd3.repartition(3)

scala> rdd9.partitions.length
res21: Int = 3

scala> val rdd10 = rdd3.repartition(6)

scala> rdd10.partitions.length
res22: Int = 6

排序:sortBy和sortByKey,其实sortBy底层还是sortByKey

scala> val rdd = sc.parallelize(List(("CK85",30),("LUCKY",18),("AK47",60)))
//默认升序，如果想降序，第二个参数false
scala> rdd.sortBy(_._2).collect
res5: Array[(String, Int)] = Array((LUCKY,18), (CK85,30), (AK47,60))
//默认升序，如果想降序，设置参数rdd.sortByKey(false)
scala> rdd.sortByKey().collect
res6: Array[(String, Int)] = Array((AK47,60), (CK85,30), (LUCKY,18))

reduceByKey(func, numPartitions=None)和groupByKey(numPartitions=None)

reduceByKey用于对每个key对应的多个value进行merge操作，最重要的是它能够在本地先进行merge操作(也就是map端进行一个预聚合操作==combiner)，并且merge操作可以通过函数自定义
groupByKey也是对每个key进行操作，但只生成一个sequence，groupByKey不能自定义函数，我们需要先用groupByKey生成RDD，然后才能对此RDD通过map进行自定义函数操作。
groupByKey在方法shuffle之间不会合并原样进行shuffle。reduceByKey进行shuffle之前会先做合并（所以groupByKey的shuffle数据量明显多于reduceByKey）,这样就减少了shuffle的io传送，所以效率高一点
演示数据
```
1.txt
hello1 tom1
hello2 tom2
hello3 tom3
hello4 tom4

2.txt
hello tom
hello jack
hello kitty
hello jerry

3.txt
hello3
```
```
scala> val rdd1 = sc.textFile("file:///home/hadoop/data/wc").flatMap(_.split(" ")).map((_,1))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:24
scala> rdd1.reduceByKey(_+_).collect

scala> rdd1.groupByKey().mapValues(_.sum).collect
```
通过spark-shell查看

reduceByKey : shuffle read有372B/14=(8+4+2)(4:hello,2:hello3)条数据

groupByKey : shuffle read有374B/17=(8+8+1)条数据,比reduceByKey多出2B/3的数据，因为2.txt在groupByKey中没有本地先进行merge，所以2.txt的shuffle read是多了3条（8-5）

Join,join底层调用的是cogroup和flatMapValues

scala> val joinrdd1 = sc.parallelize(List(("fei","bj"),("jim","sz")))
scala> val joinrdd2 = sc.parallelize(List(("fei",30),("jack",18),("jim",17)))

scala> joinrdd1.join(joinrdd2).collect
res2: Array[(String, (String, Int))] = Array((jim,(sz,17)), (fei,(bj,30)))

scala> joinrdd1.leftOuterJoin(joinrdd2).collect
res3: Array[(String, (String, Option[Int]))] = Array((jim,(sz,Some(17))), (fei,(bj,Some(30))))

scala> joinrdd1.rightOuterJoin(joinrdd2).collect
res4: Array[(String, (Option[String], Int))] = Array((jim,(Some(sz),17)), (jack,(None,18)), (fei,(Some(bj),30)))

scala> joinrdd1.fullOuterJoin(joinrdd2).collect
res6: Array[(String, (Option[String], Option[Int]))] = Array((jim,(Some(sz),Some(17))), (jack,(None,Some(18))), (fei,(Some(bj),Some(30))))

scala> joinrdd1.cogroup(joinrdd2).collect
res7: Array[(String, (Iterable[String], Iterable[Int]))] = Array((jim,(CompactBuffer(sz),CompactBuffer(17))), (jack,(CompactBuffer(),CompactBuffer(18))), (fei,(CompactBuffer(bj),CompactBuffer(30))))

不用distinct实现去重

distinct底层实现如下
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }
  
scala> val rdd1 = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 8, 9))
scala> rdd1.distinct().sortBy(-_).collect
res11: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)

scala> val rdd2 = rdd1.map(x => (x, null)).reduceByKey((x,y)=>x).map(_._1).sortBy(-_)
scala> rdd2.collect
res12: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)

aggregate
aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U)

val rdd1 = sc.parallelize(1 to 10,3)
/**
  * 0分区：1,2,3
  * 1分区：4,5,6
  * 2分区：7,8,9,10
  */
def func1(a:Int,b:Int):Int = a * b
def func2(a:Int,b:Int):Int = a + b
//    0分区  1*2*3      * 3 = 18
//    1分区  4*5*6      * 3 = 360
//    2分区  7*8*9*10   * 3 = 15120
//                          = 15498
//    15498 + 3 = 15501
rdd1.aggregate(3)(func1,func2)


scala> val rdd2 = sc.parallelize(List(List(1,3),List(2,4),List(3,5)),3)
scala> def func3(a:Int,b:List[Int]):Int = {
     |       a.max(b.max)
     |     }
func3: (a: Int, b: List[Int])Int

scala> def func4(a:Int,b:Int):Int = a + b
func4: (a: Int, b: Int)Int

scala>rdd2.aggregate(3)(func3,func4)
res0: Int = 15

aggregateByKey
aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U)，初始值只参与分区内计算，不参与全局计算，与上面算子的差别

scala> val rdd3 = sc.parallelize(List(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2)
scala>rdd3.aggregateByKey(10)(math.max(_,_),_+_)
scala>res1.collect
res2: Array[(String, Int)] = Array((b,10), (a,10), (c,20))

====================>
分区0
("a",3),("a",2),("c",4),
	a:
		max(10,max(3,2)) ==> 10
	c:
		max(10,max(0,4)) ==> 10
-----------------------------------------------
分区1
("b",3),("c",6),("c",8)
	b: 
		max(10,3) ==>10
	c: 
		max(10,max(6,8)) ==>10
===============================================

scala>rdd3.aggregateByKey(5)(math.max(_,_),_+_)
scala>res3.collect
res5: Array[(String, Int)] = Array((b,5), (a,5), (c,13))

关于Join中是否shuffle的说明

关于shuffle中的宽窄依赖
窄依赖

一个父RDD的分区至多被一个子RDD的某个分区使用一次
一个父RDD的分区和一个子RDD的分区是唯一映射典型的map
多个父RDD的分区和一个子RDD的分区是唯一映射典型的union

宽依赖
一个父RDD的分区会被子RDD的分区使用多次

在窄依赖中有个特殊的join是不经过shuffle 的

这个特殊的join的存在有三个条件：

RDD1的分区数 = RDD2的分区数
RDD1的分区数 = Join的分区数
RDD2的分区数 = Join的分区数

/**
 * rdd1、rdd2、join 三者的分区数相同，不经过shuffle
 */
val rdd1 = sc.parallelize(List(("香蕉",20), ("苹果",50), ("菠萝",30), ("猕猴桃", 50)),2)
val rdd2 = sc.parallelize(List(("草莓",90), ("苹果",25), ("菠萝",25), ("猕猴桃", 30), ("西瓜", 45)),2)

val rdd3 = rdd1.reduceByKey(_ + _)
val rdd4 = rdd2.reduceByKey(_ + _)

val joinRDD = rdd3.join(rdd4,2)
joinRDD.collect()

Application的DAG图，从两个 reduceByKey 到 join 是一个 stage 中的，说明没有产生 shuffle
在这里插入图片描述
除了前面那种是三个条件满足的，其他的 join 都是宽依赖，比如上面的join时候指定分区为3的时候，就变成了宽依赖

val joinRDD = rdd3.join(rdd4,3)

在这里插入图片描述

ByKey

reduceByKey、groupByKey等算子底层都是调用的combineByKeyWithClassTag
在这里插入图片描述
createCombiner: V => C,确定聚和值的类型初始值/累加值
mergeValue: (C,V) => C,分区内聚和
mergeCombiners: (C,C) => C,全局聚合

理解combineByKeyWithClassTag，需要看combineByKey，因为其实现方式是调用相同的参数
实现求和reduceByKey

scala> val rdd2= sc.parallelize(List((1,3),(1,4),(1,2),(2,3),(3,6),(3,8)),3)
scala> rdd2.reduceByKey(_+_).collect
res1: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

//combineByKey实现方式
scala> rdd2.combineByKey(
     |   x=>x,               
     |   (a:Int,b:Int)=> a+b,
     |   (x:Int,y:Int)=> x+y 
     | ).collect
res2: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

分析
在这里插入图片描述
求平均数

scala> val rdd3= sc.parallelize(List(("a",88),("b",95),("a",91),("b",93),("a",95),("b",98)),2)
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> rdd3.combineByKey(
     |   (_,1),
     |   (a:(Int,Int),b)=>(a._1+b,a._2+1),
     |   (x:(Int,Int),y:(Int,Int))=>(x._1+y._1,x._2+y._2)
     | ).map{
     |   case (k,v) => (k,v._1/v._2.toDouble)
     | }.collect
res3: Array[(String, Double)] = Array((b,95.33333333333333), (a,91.33333333333333))

jim8973

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
查看RDD运行所需要的JVM Heap大小和Spark主要的RDD算子、关于Join的shuffle、ByKey算子

zip:拉链，分区数需要相同（分区不同Can’t zip RDDs with unequal numbers of partitions: List(4, 2)）；元素个数也要相同（Can only zip RDDs with same number of elements in each partition） val rddzip1 = sc.parallelize(List("ruoze......
复制链接

扫一扫