Transformation算子(不触发Action)
不产生shuffle
map 底层调用MapPartitionsRDD --TaskContext 获取上下文
val list = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7), 2)//分两个区
list.map(x=>{
(TaskContext.getPartitionId(),x*10)//分区编号,数据*10
})//输出结果:ArrayBuffer((0,10), (0,20), (0,30), (1,40), (1,50), (1,60), (1,70))
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))//(RDD,(TaskContext, partition index, iterator))
mapPartitions 以分区为单位进行map操作,两个参数--1)迭代器存放数据 2)preservesPartitioning:Boolean 是否保留pairRDD原来分区器 默认false 底层调用的也是MapPartitionsRDD
val list = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7), 2)
list.mapPartitions(it=>{
val index = TaskContext.getPartitionId()//一个分区调用一次
val iterator = it.map(x => {//调用iterator的map方法,也可以调用其他方法
(index, x * 10)
})
iterator
})//输出结果:ArrayBuffer((0,10), (0,20), (0,30), (1,40), (1,50), (1,60), (1,70))
mapPartitionsWithIndex和mapPartitions基本一样,但可以直接获取分区的编号
flatMap 相当于先map在flatten 底层调用mapPartitionRdd flatMapValues 对value进行操作
val arr: Array[String] = Array("spark hadoop flink spark", "spark hadoop flinl", "spark hadoop hadoop")
val lines: RDD[String] = sc.parallelize(arr)
val flat: RDD[String] = lines.flatMap(x => x.split(" "))
val r = flat.collect()//返回Array(spark,hadoop,flink,spark,spark,hadoop,flinl,spark,hadoop,hadoop)
产生shuffle
groupByKey 底层调用combineByKeyWithClassTag方法,获取了shuffleRDD实列
val num = sc.parallelize(List(1, 2, 3, 1, 3, 2, 5, 2, 3, 4, 5, 2), 3)
num.map((_, 1)).groupByKey()//输出结果:ArrayBuffer((3,CompactBuffer(1, 1, 1)), (4,CompactBuffer(1)), (1,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)), (2,CompactBuffer(1, 1, 1, 1)))
//ShuffledRDD实现相同效果
//泛型里放的为key类型 value类型 聚合后的类型
val shuffleRdd: ShuffledRDD[Int, Int, ArrayBuffer[Int]] = new ShuffledRDD[Int, Int, ArrayBuffer[Int]](tp, new HashPartitioner(tp.partitions.length))
//第一个value放到ArrayBuffer中,如果已经有CompactBuffer就进行局部聚合
val createCombiner=(x:Int)=>ArrayBuffer(x)//底层时CompactBuffer
//局部聚合时的函数
val mergeValue=(ab:ArrayBuffer[Int],e:Int)=>ab+=e
//全局聚合的函数
val mergeCombiners = (ab1: ArrayBuffer[Int], ab2: ArrayBuffer[Int]) => ab1 ++= ab2
//传入aggregator
shuffleRdd.setAggregator(new Aggregator[Int, Int, ArrayBuffer[Int]](createCombiner,mergeValue,mergeCombiners))
//设置不在mapside合并 不在shuffleWrite前合并
shuffleRdd.setMapSideCombine(false)
reduceByKey
val num = sc.parallelize(List(1, 2, 3, 1, 3, 2, 5, 2, 3, 4, 5, 2), 3)
val tp = num.map((_, 1))
val res = tp.reduceByKey(_ + _)
//foldByKey和reduceByKey相比可以指定一个初始值 初始值在每个分区用一次
val res = tp.foldByKey(100)(_ + _)
//需要传入初始值,局部聚合函数,全局聚合函数 初始值每个分区用一次
val res = tp.aggregateByKey(100)(_ + _, _ + _)
//groupBy实现reduceByKey 先聚和之后在value相加 比reduce传输的数据量大
val res = tp.groupByKey().mapValues(_.sum)
//CombineByKey实现reduceByKey
//第一个value只有一个value所以先不聚合
val f1= (x:Int)=>x
//局部聚合
val f2=(x:Int,y:Int)=> x + y
//全局聚合
val f3=(a:Int,b:Int)=> a + b
val res = tp.combineByKey(f1, f2, f3)
//shuffleRDD实现reduceByKey
val shuffleRdd: ShuffledRDD[Int, Int, Int] = new ShuffledRDD[Int, Int, Int](tp, new HashPartitioner(tp.partitions.length))
shuffleRdd.setMapSideCombine(true)
val f1: Int => Int = (x:Int)=>x
val f2=(x:Int,y:Int)=> x + y
val f3=(x:Int,y:Int)=> x + y
val res = shuffleRdd.setAggregator(new Aggregator[Int, Int, Int](f1, f2, f3))
cogroup跟group作用相似,但是value会以一个RDD对应一个迭代器的形式储存
val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))
val rdd3 = rdd1.cogroup(rdd2)
//ArrayBuffer((tom,(CompactBuffer(1, 2),CompactBuffer(1))), (kitty,(CompactBuffer(2),CompactBuffer())), (jerry,(CompactBuffer(3),CompactBuffer(2))), (shuke,(CompactBuffer(),CompactBuffer(2))))
jion 底层调用cogroup
val li1 = sc.parallelize(List(("spark", 1), ("hadoop", 1), ("spark", 2), ("hive", 2), ("flink", 2)), 2)
val li2 = sc.parallelize(List(("spark", 3), ("hive", 3), ("hadoop", 4)), 2)
val li3 = li1.join(li2)
//使用cogroup实现类似join的方法
val li3 = li1.cogroup(li2).flatMapValues(t => {
for (x <- t._1; y <- t._2) yield (x, y) //yield可以将for循环中的元素返回
})
//使用cogroup实现类似rightOuterJoin的功能
val li3 = li1.cogroup(li2).flatMapValues { pair =>
if (pair._2.isEmpty) {
pair._1.iterator.map(v => (v, None))
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
}
}
//使用cogroup实现类似rightOuterJoin的功能
val li3 = li1.cogroup(li2).flatMapValues {
case (li1, Seq()) => li1.iterator.map(v => (Some(v), None))
case (Seq(), li2) => li2.iterator.map(w => (None, Some(w)))
case (li1, li2) => for (v <- li1.iterator; w <- li2.iterator) yield (Some(v), Some(w))
}
sortBy/sortByKey使用的RangePartition,在构建这个分区器是要采样,触发一次Action,RangePartition类中有个sketch方法会触发collec 但其本身是lazy的