Action算子
- Action 用来触发RDD的计算,得到相关计算结果;
- Action触发Job。一个Spark程序(Driver程序)包含了多少 Action 算子,那么就有多少Job;
- 典型的Action算子: collect / count
常见算子
-
stats,返回统计信息。仅能作用 RDD[Double] 类型上调用
scala> val rdd1 = sc.range(1, 101) rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[84] at range at <console>:24 scala> rdd1.stats res21: org.apache.spark.util.StatCounter = (count: 100, mean: 50.500000, stdev: 28.866070, max: 100.000000, min: 1.000000)
-
count,count在各种类型的RDD上,均能调用
scala> val rdd2 = sc.range(1, 101) rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[88] at range at <console>:24 scala> rdd1.zip(rdd2).count res22: Long = 100
-
聚合操作:reduce(func) / fold(func) / aggregate(func)
scala> val rdd = sc.makeRDD(1 to 10, 2) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[90] at makeRDD at <console>:24 scala> rdd.reduce(_+_) res23: Int = 55 scala> rdd.fold(0)(_+_) res35: Int = 55 scala> rdd.fold(1)(_+_) res39: Int = 58 scala> rdd.fold(1)((x, y) => { | println(s"x=$x, y=$y") | x+y | }) x=1, y=16 x=17, y=41 res40: Int = 58 scala> rdd.aggregate(0)(_+_, _+_) res41: Int = 55 scala> rdd.aggregate(1)(_+_, _+_) res42: Int = 58 scala> rdd.aggregate(1)( | (a, b) => { | println(s"a=$a, b=$b") | a+b | }, | (x, y) => { | println(s"x=$x, y=$y") | x+y | }) x=1, y=16 x=17, y=41 res43: Int = 58
-
first / take(n) / top(n) :获取RDD中的元素。
scala> rdd.first res44: Int = 1 scala> rdd.take(10) res45: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> rdd.take(5) res46: Array[Int] = Array(1, 2, 3, 4, 5) scala> rdd.top(5) res47: Array[Int] = Array(10, 9, 8, 7, 6) scala> rdd.top(10) res48: Array[Int] = Array(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
-
takeSample采样并返回结果
scala> rdd.takeSample(false, 5) res49: Array[Int] = Array(2, 9, 10, 4, 8) scala> rdd.takeSample(false, 5) res50: Array[Int] = Array(6, 7, 10, 9, 8)
-
保存文件到指定路径(rdd有多少分区,就保存为多少文件,保存文件时注意小文件问题)
scala> rdd.saveAsTextFile("data/t1")
PairRDD操作
-
RDD整体上分为 Value 类型和 Key-Value 类型。 key-value 类型的RDD,也称为 PairRDD。
-
Value 类型RDD的操作基本集中在 RDD.scala 中;
-
key-value 类型的RDD操作集中在 PairRDDFunctions.scala 中
-
Pair RDD还有属于自己的Transformation、Action 算子;
val arr = (1 to 10).toArray val arr1 = arr.map(x => (x, x*10, x*100)) // rdd1 不是 Pair RDD val rdd1 = sc.makeRDD(arr1) // rdd2 是 Pair RDD val arr2 = arr.map(x => (x, (x*10, x*100))) val rdd2 = sc.makeRDD(arr2)
PairRDD Transformation操作
-
类似 map 操作,mapValues / flatMapValues / keys / values,这些操作都可以使用 map 操作实现,是简化操作。
scala> val a = sc.parallelize(List((1,2),(3,4),(5,6))) a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[100] at parallelize at <console>:24 scala> val b = a.mapValues(x=>1 to x) b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[101] at mapValues at <console>:25 scala> b.collect res52: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6)) // 使用map实现mapValues的操作 scala> val b = a.map(x => (x._1, 1 to x._2)) b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[102] at map at <console>:25 scala> b.collect res53: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6)) scala> val b = a.map{case (k, v) => (k, 1 to v)} b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[103] at map at <console>:25 scala> b.collect res54: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))
-
flatMapValues 将 value 的值压平
scala> val c = a.flatMapValues(x=>1 to x) c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[105] at flatMapValues at <console>:25 scala> c.collect res56: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6)) scala> val c = a.mapValues(x=>1 to x).flatMap{case (k, v) => v.map(x=> (k, x))} c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[109] at flatMap at <console>:25 scala> c.collect res57: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6)) scala> c.map{case (k, v) => k}.collect res60: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5) scala> c.map{case (k, _) => k}.collect res61: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5) scala> c.map{case (_, v) => v}.collect res62: Array[Int] = Array(1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6)
-
PairRDD 聚合操作,PariRDD(k, v)使用范围广,聚合:groupByKey / reduceByKey / foldByKey / aggregateByKey
-
combineByKey(OLD) / combineByKeyWithClassTag (NEW) => 底层实现
给定一组数据:(“spark”, 12), (“hadoop”, 26), (“hadoop”, 23), (“spark”,15), (“scala”, 26), (“spark”, 25), (“spark”, 23), (“hadoop”, 16), (“scala”, 24), (“spark”,16), 键值对的key表示图书名称,value表示某天图书销量。计算每个键对应的平均值,也就是计算每种图书的每天平均销量。
第一种解决方式:groupByKey+mapValues
scala> rdd.groupByKey.mapValues(v => v.sum.toDouble/v.size).collect
res65: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))
第二种解决方式:reduceByKey+mapValues
scala> rdd.mapValues((_, 1)).reduceByKey((x, y)=> (x._1+y._1, x._2+y._2)).mapValues(x => (x._1.toDouble / x._2)).collect
res73: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))
第三种解决方式:foldByKey+mapValues
scala> rdd.mapValues((_, 1)).foldByKey((0, 0))((x, y) => {(x._1+y._1, x._2+y._2)}).mapValues(x=>x._1.toDouble/x._2).collect
res74: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))
第四种解决方式:aggregateByKey+mapValues
scala> rdd.mapValues((_, 1)).aggregateByKey((0,0))((x, y) => (x._1 + y._1, x._2 + y._2),(a, b) => (a._1 + b._1, a._2 + b._2)).mapValues(x=>x._1.toDouble / x._2).collect
res75: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))
-
subtractByKey:类似于subtract,删掉 RDD 中键与 other RDD 中的键相同的元素
scala> val rdd1 = sc.makeRDD(Array(("spark", 12), ("hadoop", 26),("hadoop", 23), ("spark", 15))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[144] at makeRDD at <console>:24 scala> val rdd2 = sc.makeRDD(Array(("spark", 100), ("hadoop", 300))) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[145] at makeRDD at <console>:24 scala> rdd1.subtractByKey(rdd2).collect() res78: Array[(String, Int)] = Array() scala> val rdd = sc.makeRDD(Array(("a",1), ("b",2), ("c",3), ("a",5),("d",5))) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[149] at makeRDD at <console>:24 scala> val other = sc.makeRDD(Array(("a",10), ("b",20), ("c",30))) other: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[150] at makeRDD at <console>:24 scala> rdd.subtractByKey(other).collect() res81: Array[(String, Int)] = Array((d,5))
排序操作
-
sortByKey:sortByKey函数作用于PairRDD,对Key进行排序。在org.apache.spark.rdd.OrderedRDDFunctions 中实现:
scala> val a = sc.parallelize(List("wyp", "iteblog", "com","397090770", "test")) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[152] at parallelize at <console>:24 scala> val b = sc.parallelize (1 to a.count.toInt) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[153] at parallelize at <console>:26 scala> val c = a.zip(b) c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[154] at zip at <console>:27 scala> c.sortByKey().collect res82: Array[(String, Int)] = Array((397090770,4), (com,3), (iteblog,2), (test,5), (wyp,1)) scala> c.sortByKey(false).collect res83: Array[(String, Int)] = Array((wyp,1), (test,5), (iteblog,2), (com,3), (397090770,4)) scala>
join操作:cogroup / join / leftOuterJoin / rightOuterJoin / fullOuterJoin
scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),("5","25K"),("6","10K")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at makeRDD at <console>:24
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((3,(Scala,20K)), (4,(Java,18K)))
scala> rdd1.leftOuterJoin(rdd2).collect
res1: Array[(String, (String, Option[String]))] = Array((1,(Spark,None)), (2,(Hadoop,None)), (3,(Scala,Some(20K))), (4,(Java,Some(18K))))
scala> rdd1.rightOuterJoin(rdd2).collect
res2: Array[(String, (Option[String], String))] = Array((6,(None,10K)), (3,(Some(Scala),20K)), (4,(Some(Java),18K)), (5,(None,25K)))
scala> rdd1.fullOuterJoin(rdd2).collect
res3: Array[(String, (Option[String], Option[String]))] = Array((6,(None,Some(10K))), (1,(Some(Spark),None)), (2,(Some(Hadoop),None)), (3,(Some(Scala),Some(20K))), (4,(Some(Java),Some(18K))), (5,(None,Some(25K))))
PairRDD Action操作
-
collectAsMap / countByKey / lookup(key)
scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java"))) rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[19] at makeRDD at <console>:24 scala> rdd1.collectAsMap res11: scala.collection.Map[String,String] = Map(2 -> Hadoop, 1 -> Spark, 4 -> Java, 3 -> Scala) scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java"))) rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[22] at makeRDD at <console>:24 scala> rdd1.countByKey res12: scala.collection.Map[String,Long] = Map(1 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)
-
lookup(key):高效的查找方法,只查找对应分区的数据(如果RDD有分区器的话)
scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java"))) rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[22] at makeRDD at <console>:24 scala> rdd1.countByKey res12: scala.collection.Map[String,Long] = Map(1 -> 1, 2 -> 1, 3 -> 1, 4 -> 1) scala> rdd1.lookup("1") res13: Seq[String] = WrappedArray(Spark)