spark常用算子有两种:
- transformation:RDD中所有转换算子都是延迟加载,从一个RDD到另一个RDD转换没有立即转换,仅记录数据的逻辑操作,只有要求结果还回到Driver时的动作时才会真正运行。
- action:触发action时才会真正的执行action操作动作
transformation常用算子类型如下:
1.textFile (path: String) : RDD[String] 读取hdfs的文本数据,还回String元素的RDD,文本一行即为RDD一行;
val lineRdd: RDD[String] = sc.textFile(file_path, numPartitions)
2. mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U],preservesPartitioning: Boolean = false): RDD[U]:该算子与map算子类似,每个分区对RDD先map再聚集在到一起,map元素时可共享分区内资源,当需要额外对象数据时mapPartitions算子比map效率高。
mapPartitionsWithIndex [U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false): RDD[U] f函数参数包含分区编号和该分区对应的数据集合两个参数,在转换的时候可以把分区index数据加上;参数preservesPartitioning表示是否保留父RDD的partitioner分区信息。 -
val rdd = sc.parallelize(List(1, 2, 3, 4, 5, 6), 3) // mapPartitions算子 rdd.mapPartitions { val value = { //map外连接资源 } iterator => iterator.map(_ * value) } // mapPartitionsWithIndex算子 val partitionIndex = (index: Int, iter: Iterator[Int]) => { iter.toList.map(item => "index:" + index + ": value: " + item).iterator } rdd.mapPartitionsWithIndex(partitionIndex, true).foreach(println(_)) /** index:0: value: 1 index:0: value: 2 index:1: value: 3 index:1: value: 4 index:2: value: 5 index:2: value: 6 */
3.filterByRange(lower: K, upper: K): RDD[P]:以RDD中元素key的范围做过滤,包含lower和upper上下边界
val rdd = sc.parallelize(List((2, 21), (9, 2), (5, 3), (6, 3), (3, 21), (10, 21)), 2)
rdd.filterByRange(3, 9)
4.flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]:对元组的value进行业务逻辑操作还回集合,并分别与key进行组合val rdd = sc.parallelize(List((2, "a b c"), (5, "q w e"), (2, "x y z"), (6, "t y")), 2) rdd.flatMapValues(_.split(" ")).collect() /** Array((2,a), (2,b), (2,c), (5,q), (5,w), (5,e), (2,x), (2,y), (2,z), (6,t), (6,y)) */
5.combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,...): RDD[(K, C)]:该算子是比较底层的算子,groupByKey和reduceByKey是基于此实现;在shuffle前各个partition内先以key做local聚集,会还回每个分区内key对应的C中间值,在shuffle后再合并各个key对应的C。有三个关键的函数,首先分区内独立处理数据,在分区内遍历所有元素把相同key聚合到一起,若key首次出现createCombiner函数把元素V转为C类型还回,若分区内key已经存在mergeValue函数会把对应V与对应的C做合并;分区内处理完成后,若key对应2个以上分区mergeCombiners函数把key对应的各个分区结果C放在一起合并。
reduceByKey (func: (V, V) => V, numPartitions: Int): RDD[(K, V)] 相同的key对value做聚合,先分区内再整体做聚合,还回与value相同的数据类型;
foldByKey(zeroValue: V,...)(func: (V, V) => V): RDD[(K, V)]:该算子通过调用combineByKey算子实现,先在各个partition内以key做聚集,分区内首次出现key对应的value通过调用createCombiner对V进行V=>V+zeroValue操作,然后再按key通过func函数对分区内V与余下V进行合并调用mergeValue函数,各个分区的结果也按key聚合通过func函数完成合并调用mergeCombiners函数。/** * 分区内key首次出现第一个元素转为C * * @param value * @return */ def createCombiner(value: Int): List[Int] = { println("create value:" + value) List(value) } /** * 分区内key再次出现把V与对应C做合并 * * @param list * @param value * @return */ def mergeValue(list: List[Int], value: Int): List[Int] = { println("merge value:" + value) list :+ (value) } /** * * 若key对应2+个分区,则合并key对应的各个分区聚集结果C * * @param a * @param b * @return */ def mergeCombiners(a: List[Int], b: List[Int]): List[Int] = { println("a:" + a.toBuffer + "\tb:" + b.toBuffer) a ++ b } // rdd data val rdd = sc.parallelize(List(("a", 21), ("b1", 2), ("a", 22), ("b2", 1), ("a", 23), ("c1", 1), ("a", 24), ("c2", 1)), 2) // combineByKey val rdd2 = rdd.combineByKey(createCombiner, mergeValue, mergeCombiners) println("combineByKey result:" + rdd2.collect().toBuffer) // reduceByKey val rdd3 = rdd.reduceByKey((pre: Int, after: Int) => (pre + after)) println("reduceByKey result:"+rdd3.collect().toBuffer) // reduceByKey val rdd4= rdd.foldByKey(100)(_+_) println("foldByKey result:"+rdd4.collect().toBuffer) /** create value:21 create value:1 merge value:22 create value:1 create value:23 create value:1 merge value:24 create value:1 a:ArrayBuffer(121, 22) b:ArrayBuffer(123, 24) combineByKey result:ArrayBuffer((b2,List(101)), (c1,List(101)), (a,List(121, 22, 123, 24)), (c2,List(101)), (b1,List(101))) reduceByKey result:ArrayBuffer((b2,1), (c1,1), (a,90), (c2,1), (b1,1)) foldByKey result:ArrayBuffer((b2,101), (c1,101), (a,290), (c2,101), (b1,101)) */
6.aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,combOp: (U, U) => U): RDD[(K, U)]:算子把相同的key聚集在一起,聚集流程与aggregate算子类似;seqOp函数在分区内先以相同的key做聚集,zeroValue与分区内第一个元素做聚集再依次与剩余元素聚集;在combOp函数中初始值zeroValue不参与聚集,各个分区聚集结果一起聚集还回结果。
/** * 分区内初始值zeroValue与每个元素依次聚集 * * @param zeroValue * @param value * @return */ def seqOp(zeroValue: ArrayBuffer[Int], value: Int): ArrayBuffer[Int] = { println("zeroValue:" + zeroValue + "\tvalue:" + value) zeroValue += value } /** * 各个分区聚集后的结果再依次聚集 * * @param a * @param b * @return */ def combOp(a: ArrayBuffer[Int], b: ArrayBuffer[Int]): ArrayBuffer[Int] = { println("a:" + a + "\tb:" + b) a ++ b } val rdd = sc.parallelize(List(("a", 21), ("b1", 2), ("a", 22), ("b2", 1), ("a", 23), ("c1", 1), ("a", 24),("c2", 1)), 2) val rdd2 = rdd.aggregateByKey(ArrayBuffer[Int](88))(seqOp, combOp) println(rdd2.collect().toBuffer) /** zeroValue:ArrayBuffer(88) value:21 zeroValue:ArrayBuffer(88) value:2 zeroValue:ArrayBuffer(88, 21) value:22 zeroValue:ArrayBuffer(88) value:1 zeroValue:ArrayBuffer(88) value:23 zeroValue:ArrayBuffer(88) value:1 zeroValue:ArrayBuffer(88, 23) value:24 zeroValue:ArrayBuffer(88) value:1 a:ArrayBuffer(88, 21, 22) b:ArrayBuffer(88, 23, 24) ArrayBuffer((b2,ArrayBuffer(88, 1)), (c1,ArrayBuffer(88, 1)), (a,ArrayBuffer(88, 21, 22, 88, 23, 24)), (c2,ArrayBuffer(88, 1)), (b1,ArrayBuffer(88, 2))) 说明:zeroValue在分区内先按相同key与每个元素依次聚集,各个分区结果再依次聚集zeroValue不参与,故2个分区结果有2个88 */
7.coalesce(numPartitions: Int, shuffle: Boolean = false): RDD[T]:对RDD重新分区,默认不shuffle;
repartition(numPartitions: Int): RDD[T]:重新分区会shuffle
rdd.partitions.length 分区数量
9.keyBy[K](f: T => K): RDD[(K, T)]:对RDD元素新加入一个key,旧的元素元素做为value组合成新的二元组元素
keys: RDD[K]:获取RDD所有的key组合成新的RDD
values: RDD[V]:获取所有value组合新的RDDval rdd = sc.parallelize(List("abc", "abcd", "ab", "bcd", "bc", "bcde"), 2) // keyBy val rdd2 = rdd.keyBy(_.size) println(rdd2.collect().toBuffer) // keys val keys = rdd2.keys println(keys.collect().toBuffer) // values val values = rdd2.values println(values.collect().toBuffer) /** ArrayBuffer((3,abc), (4,abcd), (2,ab), (3,bcd), (2,bc), (4,bcde)) ArrayBuffer(3, 4, 2, 3, 2, 4) ArrayBuffer(abc, abcd, ab, bcd, bc, bcde) */
未完待续