查看RDD运行所需要的JVM Heap大小
val result = ssc.textFile("").flatMap(_.split("\t")).map((_, 1))
.reduceByKey(_ + _)
SizeEstimator.estimate(result)
算子
-
zip:拉链,分区数需要相同(分区不同Can’t zip RDDs with unequal numbers of partitions: List(4, 2));元素个数也要相同(Can only zip RDDs with same number of elements in each partition)
val rddzip1 = sc.parallelize(List("fei","jim","jack")) val rddzip2 = sc.parallelize(List(30,18,20)) val rddzip3 = rddzip1.zip(rddzip2) /** * zipWithIndex返回值带上分区 */ val rddzip4 = rddzip1.zip(rddzip2).zipWithIndex() rddzip3.foreach(println(_)) println("-----------------------------") rddzip4.foreach(println(_)) (fei,30) (jim,18) (jack,20) ----------------------------- ((fei,30),0) ((jim,20),2) ((jack,18),1)
-
union合并,不去重,分区数叠加
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6),3) scala> val rdd2 = sc.parallelize(List(3,4,5,6,7,8,8),2) scala> val rdd3 = rdd1.union(rdd2) scala> rdd3.collect res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7, 8, 8) //分区数叠加 scala> rdd3.partitions.length res2: Int = 5
-
intersection:并集,返回两个集合都有的元素,分区返回rdd1的分区
scala> val rdd4 = rdd1.intersection(rdd2) scala> rdd4.collect res3: Array[Int] = Array(6, 3, 4, 5) //分区返回rdd1的分区 scala> rdd4.partitions.length res5: Int = 3
-
subtract:差集,返回在rdd1有的,rdd2没有的元素,分区返回rdd1的分区
scala> val rdd5 = rdd1.subtract(rdd2) scala> rdd5.collect res6: Array[Int] = Array(1, 2) scala> rdd5.partitions.length res7: Int = 3
-
cartesian:笛卡尔积,返回rdd1和rdd2所有的元素组合,分区是相乘(3*2=6)
scala> val rdd6 = rdd1.cartesian(rdd2) scala> rdd6.collect res8: Array[(Int, Int)] = Array((1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (1,6), (1,7), (1,8), (1,8), (2,6), (2,7), (2,8), (2,8), (3,3), (3,4), (3,5), (4,3), (4,4), (4,5), (3,6), (3,7), (3,8), (3,8), (4,6), (4,7), (4,8), (4,8), (5,3), (5,4), (5,5), (6,3), (6,4), (6,5), (5,6), (5,7), (5,8), (5,8), (6,6), (6,7), (6,8), (6,8)) scala> rdd6.partitions.length res9: Int = 6
-
distinct:去重,可以传partition的数量
scala> rdd2.distinct.collect res12: Array[Int] = Array(4, 6, 8, 3, 7, 5) scala> rdd3.collect res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7, 8, 8) scala> rdd3.distinct(4).mapPartitionsWithIndex((index,partition)=>{ | partition.map(x=>"分区:"+index + "___" + "元素:" + x) | }).collect res15: Array[String] = Array(分区:0___元素:4, 分区:0___元素:8, 分区:1___元素:1, 分区:1___元素:5, 分区:2___元素:6, 分区:2___元素:2, 分区:3___元素:3, 分区:3___元素:7)
总结:一般分区的数据是平均分配,多的放在最后一个分区
-
collect
其中runJob是真正的提交作业到spark上运行的
重要算子
-
coalesce:改变rdd的分区数,当传参小于当前rdd的分区数,可以实现小文件合并,不需要shuffle,是一个窄依赖;当传参大于当前rdd的分区数,需要把第二个参数shuffle=true打开(否则不起作用),所以需要shuffle
scala> val rdd7 = rdd3.coalesce(1) scala> rdd7.partitions.length res17: Int = 1 scala> rdd3.partitions.length res18: Int = 5 scala> val rdd8 = rdd3.coalesce(6,true) rdd8: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[45] at coalesce at <console>:25 scala> rdd8.partitions.length res20: Int = 6
-
repartition重新分区,涉及到了shuffle,其实底层走的coalesce(numPartition,true)
scala> val rdd9 = rdd3.repartition(3) scala> rdd9.partitions.length res21: Int = 3 scala> val rdd10 = rdd3.repartition(6) scala> rdd10.partitions.length res22: Int = 6
排序:sortBy和sortByKey,其实sortBy底层还是sortByKey
scala> val rdd = sc.parallelize(List(("CK85",30),("LUCKY",18),("AK47",60))) //默认升序,如果想降序,第二个参数false scala> rdd.sortBy(_._2).collect res5: Array[(String, Int)] = Array((LUCKY,18), (CK85,30), (AK47,60)) //默认升序,如果想降序,设置参数rdd.sortByKey(false) scala> rdd.sortByKey().collect res6: Array[(String, Int)] = Array((AK47,60), (CK85,30), (LUCKY,18))
-
reduceByKey(func, numPartitions=None)和groupByKey(numPartitions=None)
-
reduceByKey用于对每个key对应的多个value进行merge操作,最重要的是它能够在本地先进行merge操作(也就是map端进行一个预聚合操作==combiner),并且merge操作可以通过函数自定义
-
groupByKey也是对每个key进行操作,但只生成一个sequence,groupByKey不能自定义函数,我们需要先用groupByKey生成RDD,然后才能对此RDD通过map进行自定义函数操作。
-
groupByKey在方法shuffle之间不会合并原样进行shuffle。reduceByKey进行shuffle之前会先做合并(所以groupByKey的shuffle数据量明显多于reduceByKey),这样就减少了shuffle的io传送,所以效率高一点
演示数据1.txt hello1 tom1 hello2 tom2 hello3 tom3 hello4 tom4 2.txt hello tom hello jack hello kitty hello jerry 3.txt hello3
scala> val rdd1 = sc.textFile("file:///home/hadoop/data/wc").flatMap(_.split(" ")).map((_,1)) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:24 scala> rdd1.reduceByKey(_+_).collect scala> rdd1.groupByKey().mapValues(_.sum).collect
通过spark-shell查看
reduceByKey : shuffle read有372B/14=(8+4+2)(4:hello,2:hello3)条数据
groupByKey : shuffle read有374B/17=(8+8+1)条数据,比reduceByKey多出2B/3的数据,因为2.txt在groupByKey中没有本地先进行merge,所以2.txt的shuffle read是多了3条(8-5)
- Join,join底层调用的是cogroup和flatMapValues
scala> val joinrdd1 = sc.parallelize(List(("fei","bj"),("jim","sz")))
scala> val joinrdd2 = sc.parallelize(List(("fei",30),("jack",18),("jim",17)))
scala> joinrdd1.join(joinrdd2).collect
res2: Array[(String, (String, Int))] = Array((jim,(sz,17)), (fei,(bj,30)))
scala> joinrdd1.leftOuterJoin(joinrdd2).collect
res3: Array[(String, (String, Option[Int]))] = Array((jim,(sz,Some(17))), (fei,(bj,Some(30))))
scala> joinrdd1.rightOuterJoin(joinrdd2).collect
res4: Array[(String, (Option[String], Int))] = Array((jim,(Some(sz),17)), (jack,(None,18)), (fei,(Some(bj),30)))
scala> joinrdd1.fullOuterJoin(joinrdd2).collect
res6: Array[(String, (Option[String], Option[Int]))] = Array((jim,(Some(sz),Some(17))), (jack,(None,Some(18))), (fei,(Some(bj),Some(30))))
scala> joinrdd1.cogroup(joinrdd2).collect
res7: Array[(String, (Iterable[String], Iterable[Int]))] = Array((jim,(CompactBuffer(sz),CompactBuffer(17))), (jack,(CompactBuffer(),CompactBuffer(18))), (fei,(CompactBuffer(bj),CompactBuffer(30))))
- 不用distinct实现去重
distinct底层实现如下
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
scala> val rdd1 = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 8, 9))
scala> rdd1.distinct().sortBy(-_).collect
res11: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)
scala> val rdd2 = rdd1.map(x => (x, null)).reduceByKey((x,y)=>x).map(_._1).sortBy(-_)
scala> rdd2.collect
res12: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)
- aggregate
aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U)
val rdd1 = sc.parallelize(1 to 10,3)
/**
* 0分区:1,2,3
* 1分区:4,5,6
* 2分区:7,8,9,10
*/
def func1(a:Int,b:Int):Int = a * b
def func2(a:Int,b:Int):Int = a + b
// 0分区 1*2*3 * 3 = 18
// 1分区 4*5*6 * 3 = 360
// 2分区 7*8*9*10 * 3 = 15120
// = 15498
// 15498 + 3 = 15501
rdd1.aggregate(3)(func1,func2)
scala> val rdd2 = sc.parallelize(List(List(1,3),List(2,4),List(3,5)),3)
scala> def func3(a:Int,b:List[Int]):Int = {
| a.max(b.max)
| }
func3: (a: Int, b: List[Int])Int
scala> def func4(a:Int,b:Int):Int = a + b
func4: (a: Int, b: Int)Int
scala>rdd2.aggregate(3)(func3,func4)
res0: Int = 15
- aggregateByKey
aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U),初始值只参与分区内计算,不参与全局计算,与上面算子的差别
scala> val rdd3 = sc.parallelize(List(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2)
scala>rdd3.aggregateByKey(10)(math.max(_,_),_+_)
scala>res1.collect
res2: Array[(String, Int)] = Array((b,10), (a,10), (c,20))
====================>
分区0
("a",3),("a",2),("c",4),
a:
max(10,max(3,2)) ==> 10
c:
max(10,max(0,4)) ==> 10
-----------------------------------------------
分区1
("b",3),("c",6),("c",8)
b:
max(10,3) ==>10
c:
max(10,max(6,8)) ==>10
===============================================
scala>rdd3.aggregateByKey(5)(math.max(_,_),_+_)
scala>res3.collect
res5: Array[(String, Int)] = Array((b,5), (a,5), (c,13))
关于Join中是否shuffle的说明
关于shuffle中的宽窄依赖
窄依赖
- 一个父RDD的分区至多被一个子RDD的某个分区使用一次
- 一个父RDD的分区和一个子RDD的分区是唯一映射 典型的map
- 多个父RDD的分区和一个子RDD的分区是唯一映射 典型的union
宽依赖 - 一个父RDD的分区会被子RDD的分区使用多次
在窄依赖中有个特殊的join是不经过shuffle 的
这个特殊的join的存在有三个条件:
- RDD1的分区数 = RDD2的分区数
- RDD1的分区数 = Join的分区数
- RDD2的分区数 = Join的分区数
/**
* rdd1、rdd2、join 三者的分区数相同,不经过shuffle
*/
val rdd1 = sc.parallelize(List(("香蕉",20), ("苹果",50), ("菠萝",30), ("猕猴桃", 50)),2)
val rdd2 = sc.parallelize(List(("草莓",90), ("苹果",25), ("菠萝",25), ("猕猴桃", 30), ("西瓜", 45)),2)
val rdd3 = rdd1.reduceByKey(_ + _)
val rdd4 = rdd2.reduceByKey(_ + _)
val joinRDD = rdd3.join(rdd4,2)
joinRDD.collect()
Application的DAG图,从两个 reduceByKey 到 join 是一个 stage 中的,说明没有产生 shuffle
除了前面那种是三个条件满足的,其他的 join 都是宽依赖,比如上面的join时候指定分区为3的时候,就变成了宽依赖
val joinRDD = rdd3.join(rdd4,3)
ByKey
reduceByKey、groupByKey等算子底层都是调用的combineByKeyWithClassTag
createCombiner: V => C,确定聚和值的类型 初始值/累加值
mergeValue: (C,V) => C,分区内聚和
mergeCombiners: (C,C) => C,全局聚合
理解combineByKeyWithClassTag,需要看combineByKey,因为其实现方式是调用相同的参数
实现求和reduceByKey
scala> val rdd2= sc.parallelize(List((1,3),(1,4),(1,2),(2,3),(3,6),(3,8)),3)
scala> rdd2.reduceByKey(_+_).collect
res1: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))
//combineByKey实现方式
scala> rdd2.combineByKey(
| x=>x,
| (a:Int,b:Int)=> a+b,
| (x:Int,y:Int)=> x+y
| ).collect
res2: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))
分析
求平均数
scala> val rdd3= sc.parallelize(List(("a",88),("b",95),("a",91),("b",93),("a",95),("b",98)),2)
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> rdd3.combineByKey(
| (_,1),
| (a:(Int,Int),b)=>(a._1+b,a._2+1),
| (x:(Int,Int),y:(Int,Int))=>(x._1+y._1,x._2+y._2)
| ).map{
| case (k,v) => (k,v._1/v._2.toDouble)
| }.collect
res3: Array[(String, Double)] = Array((b,95.33333333333333), (a,91.33333333333333))