Spark——(Action算子,PairRDD,PairRDD Transformation操作,PairRDD Action操作)

Action算子

  1. Action 用来触发RDD的计算,得到相关计算结果;
  2. Action触发Job。一个Spark程序(Driver程序)包含了多少 Action 算子,那么就有多少Job;
  3. 典型的Action算子: collect / count

常见算子

  1. stats,返回统计信息。仅能作用 RDD[Double] 类型上调用

    scala> val rdd1 = sc.range(1, 101)
    rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[84] at range at <console>:24
    
    scala> rdd1.stats
    res21: org.apache.spark.util.StatCounter = (count: 100, mean: 50.500000, stdev: 28.866070, max: 100.000000, min: 1.000000)
    
  2. count,count在各种类型的RDD上,均能调用

    scala> val rdd2 = sc.range(1, 101)
    rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[88] at range at <console>:24
    
    scala> rdd1.zip(rdd2).count
    res22: Long = 100
    
  3. 聚合操作:reduce(func) / fold(func) / aggregate(func)

    scala> val rdd = sc.makeRDD(1 to 10, 2)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[90] at makeRDD at <console>:24
    
    scala> rdd.reduce(_+_)
    res23: Int = 55
    scala> rdd.fold(0)(_+_)
    res35: Int = 55
    scala> rdd.fold(1)(_+_)
    res39: Int = 58
    scala> rdd.fold(1)((x, y) => {
         | println(s"x=$x, y=$y")
         | x+y
         | })
    x=1, y=16
    x=17, y=41
    res40: Int = 58
    
    scala> rdd.aggregate(0)(_+_, _+_)
    res41: Int = 55
    
    scala> rdd.aggregate(1)(_+_, _+_)
    res42: Int = 58
    
    scala> rdd.aggregate(1)(
         | (a, b) => {
         | println(s"a=$a, b=$b")
         | a+b
         | },
         | (x, y) => {
         | println(s"x=$x, y=$y")
         | x+y
         | })
    x=1, y=16
    x=17, y=41
    res43: Int = 58
    
    
  4. first / take(n) / top(n) :获取RDD中的元素。

    scala> rdd.first
    res44: Int = 1
    
    scala> rdd.take(10)
    res45: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    
    scala> rdd.take(5)
    res46: Array[Int] = Array(1, 2, 3, 4, 5)
    
    scala> rdd.top(5)
    res47: Array[Int] = Array(10, 9, 8, 7, 6)
    
    scala> rdd.top(10)
    res48: Array[Int] = Array(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
    
  5. takeSample采样并返回结果

    scala> rdd.takeSample(false, 5)
    res49: Array[Int] = Array(2, 9, 10, 4, 8)
    
    scala> rdd.takeSample(false, 5)
    res50: Array[Int] = Array(6, 7, 10, 9, 8)
    
  6. 保存文件到指定路径(rdd有多少分区,就保存为多少文件,保存文件时注意小文件问题)

    scala> rdd.saveAsTextFile("data/t1")
    

PairRDD操作

  1. RDD整体上分为 Value 类型和 Key-Value 类型。 key-value 类型的RDD,也称为 PairRDD。

  2. Value 类型RDD的操作基本集中在 RDD.scala 中;

  3. key-value 类型的RDD操作集中在 PairRDDFunctions.scala 中

  4. Pair RDD还有属于自己的Transformation、Action 算子;

    val arr = (1 to 10).toArray
    val arr1 = arr.map(x => (x, x*10, x*100))
    // rdd1 不是 Pair RDD
    val rdd1 = sc.makeRDD(arr1)
    
    // rdd2 是 Pair RDD
    val arr2 = arr.map(x => (x, (x*10, x*100)))
    val rdd2 = sc.makeRDD(arr2)
    

PairRDD Transformation操作

  1. 类似 map 操作,mapValues / flatMapValues / keys / values,这些操作都可以使用 map 操作实现,是简化操作。

    scala> val a = sc.parallelize(List((1,2),(3,4),(5,6)))
    a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[100] at parallelize at <console>:24
    
    scala> val b = a.mapValues(x=>1 to x)
    b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[101] at mapValues at <console>:25
    
    scala> b.collect
    res52: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))
    
    // 使用map实现mapValues的操作
    scala> val b = a.map(x => (x._1, 1 to x._2))
    b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[102] at map at <console>:25
    
    scala> b.collect
    res53: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))
    
    scala> val b = a.map{case (k, v) => (k, 1 to v)}
    b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[103] at map at <console>:25
    
    scala> b.collect
    res54: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))
    
    
  2. flatMapValues 将 value 的值压平

    scala> val c = a.flatMapValues(x=>1 to x)
    c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[105] at flatMapValues at <console>:25
    
    scala> c.collect
    	res56: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))
    	scala> val c = a.mapValues(x=>1 to x).flatMap{case (k, v) => v.map(x=> (k, x))}
    c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[109] at flatMap at <console>:25
    
    scala> c.collect
    res57: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))
    scala> c.map{case (k, v) => k}.collect
    res60: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5)
    
    scala> c.map{case (k, _) => k}.collect
    res61: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5)
    
    scala> c.map{case (_, v) => v}.collect
    res62: Array[Int] = Array(1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6)
    
    
  3. PairRDD 聚合操作,PariRDD(k, v)使用范围广,聚合:groupByKey / reduceByKey / foldByKey / aggregateByKey

  4. combineByKey(OLD) / combineByKeyWithClassTag (NEW) => 底层实现

给定一组数据:(“spark”, 12), (“hadoop”, 26), (“hadoop”, 23), (“spark”,15), (“scala”, 26), (“spark”, 25), (“spark”, 23), (“hadoop”, 16), (“scala”, 24), (“spark”,16), 键值对的key表示图书名称,value表示某天图书销量。计算每个键对应的平均值,也就是计算每种图书的每天平均销量。

第一种解决方式:groupByKey+mapValues

scala> rdd.groupByKey.mapValues(v => v.sum.toDouble/v.size).collect
res65: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))

第二种解决方式:reduceByKey+mapValues

scala> rdd.mapValues((_, 1)).reduceByKey((x, y)=> (x._1+y._1, x._2+y._2)).mapValues(x => (x._1.toDouble / x._2)).collect
res73: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))

第三种解决方式:foldByKey+mapValues

scala> rdd.mapValues((_, 1)).foldByKey((0, 0))((x, y) => {(x._1+y._1, x._2+y._2)}).mapValues(x=>x._1.toDouble/x._2).collect
res74: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))

第四种解决方式:aggregateByKey+mapValues

scala> rdd.mapValues((_, 1)).aggregateByKey((0,0))((x, y) => (x._1 + y._1, x._2 + y._2),(a, b) => (a._1 + b._1, a._2 + b._2)).mapValues(x=>x._1.toDouble / x._2).collect
res75: Array[(String, Double)] = Array((scala,25.0), (hadoop,21.666666666666668), (spark,18.2))
  1. subtractByKey:类似于subtract,删掉 RDD 中键与 other RDD 中的键相同的元素

    scala> val rdd1 = sc.makeRDD(Array(("spark", 12), ("hadoop", 26),("hadoop", 23), ("spark", 15)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[144] at makeRDD at <console>:24
    
    scala> val rdd2 = sc.makeRDD(Array(("spark", 100), ("hadoop", 300)))
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[145] at makeRDD at <console>:24
    
    scala> rdd1.subtractByKey(rdd2).collect()
    res78: Array[(String, Int)] = Array()
    
    scala> val rdd = sc.makeRDD(Array(("a",1), ("b",2), ("c",3), ("a",5),("d",5)))
    rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[149] at makeRDD at <console>:24
    
    scala> val other = sc.makeRDD(Array(("a",10), ("b",20), ("c",30))) 
    other: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[150] at makeRDD at <console>:24
    
    scala> rdd.subtractByKey(other).collect()
    res81: Array[(String, Int)] = Array((d,5))
    

排序操作

  1. sortByKey:sortByKey函数作用于PairRDD,对Key进行排序。在org.apache.spark.rdd.OrderedRDDFunctions 中实现:

    scala> val a = sc.parallelize(List("wyp", "iteblog", "com","397090770", "test"))
    a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[152] at parallelize at <console>:24
    
    scala> val b = sc.parallelize (1 to a.count.toInt)
    b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[153] at parallelize at <console>:26
    
    scala> val c = a.zip(b)
    c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[154] at zip at <console>:27
    
    scala> c.sortByKey().collect
    res82: Array[(String, Int)] = Array((397090770,4), (com,3), (iteblog,2), (test,5), (wyp,1))
    
    scala> c.sortByKey(false).collect
    res83: Array[(String, Int)] = Array((wyp,1), (test,5), (iteblog,2), (com,3), (397090770,4))
    
    scala> 
    

join操作:cogroup / join / leftOuterJoin / rightOuterJoin / fullOuterJoin

scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),("5","25K"),("6","10K")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at makeRDD at <console>:24

scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((3,(Scala,20K)), (4,(Java,18K)))

scala> rdd1.leftOuterJoin(rdd2).collect
res1: Array[(String, (String, Option[String]))] = Array((1,(Spark,None)), (2,(Hadoop,None)), (3,(Scala,Some(20K))), (4,(Java,Some(18K))))

scala> rdd1.rightOuterJoin(rdd2).collect
res2: Array[(String, (Option[String], String))] = Array((6,(None,10K)), (3,(Some(Scala),20K)), (4,(Some(Java),18K)), (5,(None,25K)))

scala> rdd1.fullOuterJoin(rdd2).collect
res3: Array[(String, (Option[String], Option[String]))] = Array((6,(None,Some(10K))), (1,(Some(Spark),None)), (2,(Some(Hadoop),None)), (3,(Some(Scala),Some(20K))), (4,(Some(Java),Some(18K))), (5,(None,Some(25K))))

PairRDD Action操作

  1. collectAsMap / countByKey / lookup(key)

    scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java")))
    rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[19] at makeRDD at <console>:24
    
    scala> rdd1.collectAsMap
    res11: scala.collection.Map[String,String] = Map(2 -> Hadoop, 1 -> Spark, 4 -> Java, 3 -> Scala)
    scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java")))
    rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[22] at makeRDD at <console>:24
    
    scala> rdd1.countByKey
    res12: scala.collection.Map[String,Long] = Map(1 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)
    
  2. lookup(key):高效的查找方法,只查找对应分区的数据(如果RDD有分区器的话)

    scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java")))
    rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[22] at makeRDD at <console>:24
    
    scala> rdd1.countByKey
    res12: scala.collection.Map[String,Long] = Map(1 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)
    
    scala> rdd1.lookup("1")
    res13: Seq[String] = WrappedArray(Spark)
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值