spark算子底层实现

最新推荐文章于 2022-07-31 14:43:56 发布

深海里的小菜鸟

最新推荐文章于 2022-07-31 14:43:56 发布

阅读量328

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/EveryDayALittle/article/details/107885652

版权

spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Transformation算子(不触发Action)

不产生shuffle

map 底层调用MapPartitionsRDD --TaskContext 获取上下文


    val list = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7), 2)//分两个区
    list.map(x=>{
      (TaskContext.getPartitionId(),x*10)//分区编号,数据*10
    })//输出结果:ArrayBuffer((0,10), (0,20), (0,30), (1,40), (1,50), (1,60), (1,70))

    new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))//(RDD,(TaskContext, partition index, iterator))

mapPartitions 以分区为单位进行map操作,两个参数--1)迭代器存放数据 2)preservesPartitioning:Boolean 是否保留pairRDD原来分区器默认false 底层调用的也是MapPartitionsRDD

    val list = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7), 2)
    list.mapPartitions(it=>{
      val index = TaskContext.getPartitionId()//一个分区调用一次
      val iterator = it.map(x => {//调用iterator的map方法,也可以调用其他方法
        (index, x * 10)
      })
      iterator
    })//输出结果:ArrayBuffer((0,10), (0,20), (0,30), (1,40), (1,50), (1,60), (1,70))

mapPartitionsWithIndex和mapPartitions基本一样,但可以直接获取分区的编号

flatMap 相当于先map在flatten 底层调用mapPartitionRdd flatMapValues 对value进行操作

    val arr: Array[String] = Array("spark hadoop flink spark", "spark hadoop flinl", "spark hadoop hadoop")
    val lines: RDD[String] = sc.parallelize(arr)
    val flat: RDD[String] = lines.flatMap(x => x.split(" "))
    val r = flat.collect()//返回Array(spark,hadoop,flink,spark,spark,hadoop,flinl,spark,hadoop,hadoop)

产生shuffle

groupByKey 底层调用combineByKeyWithClassTag方法,获取了shuffleRDD实列

    val num = sc.parallelize(List(1, 2, 3, 1, 3, 2, 5, 2, 3, 4, 5, 2), 3)
    num.map((_, 1)).groupByKey()//输出结果:ArrayBuffer((3,CompactBuffer(1, 1, 1)), (4,CompactBuffer(1)), (1,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)), (2,CompactBuffer(1, 1, 1, 1)))

    //ShuffledRDD实现相同效果
    //泛型里放的为key类型 value类型  聚合后的类型
    val shuffleRdd: ShuffledRDD[Int, Int, ArrayBuffer[Int]] = new ShuffledRDD[Int, Int, ArrayBuffer[Int]](tp, new HashPartitioner(tp.partitions.length))
    //第一个value放到ArrayBuffer中,如果已经有CompactBuffer就进行局部聚合
    val createCombiner=(x:Int)=>ArrayBuffer(x)//底层时CompactBuffer
    //局部聚合时的函数
    val mergeValue=(ab:ArrayBuffer[Int],e:Int)=>ab+=e
    //全局聚合的函数
    val mergeCombiners = (ab1: ArrayBuffer[Int], ab2: ArrayBuffer[Int]) => ab1 ++= ab2
    //传入aggregator 
    shuffleRdd.setAggregator(new Aggregator[Int, Int, ArrayBuffer[Int]](createCombiner,mergeValue,mergeCombiners))
    //设置不在mapside合并 不在shuffleWrite前合并
    shuffleRdd.setMapSideCombine(false)

reduceByKey

    val num = sc.parallelize(List(1, 2, 3, 1, 3, 2, 5, 2, 3, 4, 5, 2), 3)
    val tp = num.map((_, 1))
    val res = tp.reduceByKey(_ + _)

    //foldByKey和reduceByKey相比可以指定一个初始值  初始值在每个分区用一次
    val res = tp.foldByKey(100)(_ + _)

    //需要传入初始值,局部聚合函数,全局聚合函数  初始值每个分区用一次
    val res = tp.aggregateByKey(100)(_ + _, _ + _)
    
    //groupBy实现reduceByKey  先聚和之后在value相加  比reduce传输的数据量大
    val res = tp.groupByKey().mapValues(_.sum)

    //CombineByKey实现reduceByKey
    //第一个value只有一个value所以先不聚合
    val f1= (x:Int)=>x
    //局部聚合
    val f2=(x:Int,y:Int)=> x + y
    //全局聚合
    val f3=(a:Int,b:Int)=> a + b
    val res = tp.combineByKey(f1, f2, f3)

    //shuffleRDD实现reduceByKey
    val shuffleRdd: ShuffledRDD[Int, Int, Int] = new ShuffledRDD[Int, Int, Int](tp, new HashPartitioner(tp.partitions.length))
    shuffleRdd.setMapSideCombine(true)
    val f1: Int => Int = (x:Int)=>x
    val f2=(x:Int,y:Int)=> x + y
    val f3=(x:Int,y:Int)=> x + y
    val res = shuffleRdd.setAggregator(new Aggregator[Int, Int, Int](f1, f2, f3))

cogroup跟group作用相似,但是value会以一个RDD对应一个迭代器的形式储存

    val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))
    val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))
    val rdd3 = rdd1.cogroup(rdd2)
//ArrayBuffer((tom,(CompactBuffer(1, 2),CompactBuffer(1))), (kitty,(CompactBuffer(2),CompactBuffer())), (jerry,(CompactBuffer(3),CompactBuffer(2))), (shuke,(CompactBuffer(),CompactBuffer(2))))

jion 底层调用cogroup

    val li1 = sc.parallelize(List(("spark", 1), ("hadoop", 1), ("spark", 2), ("hive", 2), ("flink", 2)), 2)
    val li2 = sc.parallelize(List(("spark", 3), ("hive", 3), ("hadoop", 4)), 2)
    val li3 = li1.join(li2)

    //使用cogroup实现类似join的方法
    val li3 = li1.cogroup(li2).flatMapValues(t => {
      for (x <- t._1; y <- t._2) yield (x, y) //yield可以将for循环中的元素返回
    })

    //使用cogroup实现类似rightOuterJoin的功能
    val li3 = li1.cogroup(li2).flatMapValues { pair =>
      if (pair._2.isEmpty) {
        pair._1.iterator.map(v => (v, None))
      } else {
        for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
      }
    }

    //使用cogroup实现类似rightOuterJoin的功能
    val li3 = li1.cogroup(li2).flatMapValues {
      case (li1, Seq()) => li1.iterator.map(v => (Some(v), None))
      case (Seq(), li2) => li2.iterator.map(w => (None, Some(w)))
      case (li1, li2) => for (v <- li1.iterator; w <- li2.iterator) yield (Some(v), Some(w))
    }

sortBy/sortByKey使用的RangePartition，在构建这个分区器是要采样，触发一次Action,RangePartition类中有个sketch方法会触发collec 但其本身是lazy的