转换算子
主要做的是就是将一个已有的RDD生成另外一个RDD。Transformation具有lazy特性(延迟加载)。Transformation算子的代码不会真正被执行。只有当我们的程序里面遇到一个action算子的时候,代码才会真正的被执行。这种设计让Spark更加有效率地运行。
KV算子为Rdd内部为键值对类型
官方文档
RDD Programming Guide - Spark 3.5.1 Documentation
1、groupByKey
groupByKey对每个相同的key执行聚合操作,针对key生成的value是一个集合。
val rdd = sc.parallelize(Array(("red",1), ("yellow",2), ("red", 3),("yellow", 4)))
rdd.groupByKey().collect()
Array[(String, Iterable[Int])] = Array((red,CompactBuffer(1, 3)), (yellow,CompactBuffer(2, 4)))
CompactBuffer:类似于arrayBuffer的只追加缓冲区,但对于小缓冲区内存效率更高
拓展:
对List(("a", 50), ("b", 70), ("c", 30), ("c", 90),("b", 20), ("a", 80)) 进行分组排序,按字符串进行分组,对整数进行倒序排序。
sc.makeRDD(List(("a", 50), ("b", 70), ("c", 30), ("c", 90),("b", 20), ("a", 80))).groupByKey().map(x => (x._1, x._2.toList.sorted.reverse)).foreach(println)
toList: 把CompactBuffer转换成List;
sorted:对List进行正序排序
reverse:对元素进行反转,变成倒序
源码解析
/** * Group the values for each key in the RDD into a single sequence. Hash-partitions the * resulting RDD with the existing partitioner/parallelism level. The ordering of elements * within each group is not guaranteed, and may even differ each time the resulting RDD is * evaluated. * * @note This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey` * or `PairRDDFunctions.reduceByKey` will provide much better performance. */ def groupByKey(): RDD[(K, Iterable[V])] = self.withScope { groupByKey(defaultPartitioner(self)) }/** * Group the values for each key in the RDD into a single sequence. Allows controlling the * partitioning of the resulting key-value pair RDD by passing a Partitioner. * The ordering of elements within each group is not guaranteed, and may even differ * each time the resulting RDD is evaluated. * * @note This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey` * or `PairRDDFunctions.reduceByKey` will provide much better performance. * * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`. */ def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. val createCombiner = (v: V) => CompactBuffer(v) val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2 val bufs = combineByKeyWithClassTag[CompactBuffer[V]]( createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false) bufs.asInstanceOf[RDD[(K, Iterable[V])]] }
CompactBuffer解析
/** * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers. * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default, * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more * entries than that. This makes it more efficient for operations like groupBy where we expect * some keys to have very few elements. */ private[spark] class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable {
2、reduceByKey(func)
在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的函数,将相同key的值聚合到一起
val rdd = sc.parallelize(Array(("red",1), ("yellow",2), ("red", 3),("yellow", 4)))
rdd.reduceByKey((x,y) => x + y).collect
Array[(String, Int)] = Array((red,4), (yellow,6))
x 和 y 分别表示什么意思?
对于这个x,它代指的是函数返回值,而y是对rdd各元素的遍历。
刚开始,当y遍历到第一个元素时,因为还没有返回值,所以x可以忽略,相加后还是y本身,并把值返回
当遍历到第二个元素时,x即为刚才的返回值,再和y相加,以此类推。
如果我想对不是key-value形式的数据进行合并呢?
可以使用reduce:
val rdd5 = sc.parallelize(List(1, 2, 3, 4, 5))
rdd5.reduce(_ + _)
res4: Int = 15
源码解析
/** * Group the values for each key in the RDD into a single sequence. Hash-partitions the * resulting RDD with the existing partitioner/parallelism level. The ordering of elements * within each group is not guaranteed, and may even differ each time the resulting RDD is * evaluated. * * @note This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey` * or `PairRDDFunctions.reduceByKey` will provide much better performance. */ def groupByKey(): RDD[(K, Iterable[V])] = self.withScope { groupByKey(defaultPartitioner(self)) }/** * Group the values for each key in the RDD into a single sequence. Allows controlling the * partitioning of the resulting key-value pair RDD by passing a Partitioner. * The ordering of elements within each group is not guaranteed, and may even differ * each time the resulting RDD is evaluated. * * @note This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey` * or `PairRDDFunctions.reduceByKey` will provide much better performance. * * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`. */ def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. val createCombiner = (v: V) => CompactBuffer(v) val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2 val bufs = combineByKeyWithClassTag[CompactBuffer[V]]( createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false) bufs.asInstanceOf[RDD[(K, Iterable[V])]] }
3、aggregateByKey(value)(seqOp, combOp)
将每个分区里面的元素通过seqOp和初始值进行聚合,然后将每个分区的结果进行combine操作。
即对每一个分区,将key值相同的,先聚合;再对每个分区的结果进行聚合。
val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD.aggregateByKey(1)(_+_, _ + _).collect
res1: Array[(String, Int)] = Array((dog,13), (cat,21), (mouse,7))
我们来分析一下上面的过程:
1、第一个分区("cat",2), ("cat", 5),("dog", 12),聚合后: ("cat",8), ("dog", 13)
2、第二个分区("mouse", 4),("cat", 12),("mouse", 2),聚合后: ("mouse",7), ("cat", 13)
3、最终结果:Array[(String, Int)] = Array((dog,13), (cat,21), (mouse,7))
拓展
scala> val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
scala> pairRDD.aggregateByKey(2)(_+_, _ * _).collect
结果:Array[(String, Int)] = Array((dog,14), (cat,126), (mouse,8))
源码解析
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
// We will clean the combiner closure later in `combineByKey`
val cleanedSeqOp = self.context.clean(seqOp)
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)
}
4、foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
foldByKey是aggregateByKey的简化操作,也是按照相同key进行func聚合,只有一个函数,对每个分区内的数据按照此函数进行聚合,然后每个分区的结果依然按照此函数再进行聚合。
zeroValue为初始值,在对每个分区内 相同key的value通过func聚合前,先和zeroValue进行聚合
val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD.foldByKey(1)(_+_).collect
分析:
第一个分区数据:(cat,2)(cat,5)(dog,12);cat:1+2+5=8;dog:1+12=13
第二个分区数据:(mouse,4)(cat,12)(mouse,2);mouse:1+4+2=7;cat:1+12=13
分区合并:cat:8+13=21;dog:13;mouse:7
源码解析
/**
* Merge the values for each key using an associative function and a neutral "zero value" which
* may be added to the result an arbitrary number of times, and must not change the result
* (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
*/
def foldByKey(
zeroValue: V,
partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
// When deserializing, use a lazy val to create just one instance of the serializer per task
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))
val cleanedFunc = self.context.clean(func)
combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
cleanedFunc, cleanedFunc, partitioner)
}
对比aggregateByKey与foldByKey底层实现区别如下,发现foldByKey是aggregateByKey简化版本
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
cleanedFunc, cleanedFunc, partitioner)
5、sortByKey([ascending])
在一个(K,V)的RDD上调用,返回一个按照key进行排序的(K,V)的RDD。
val rdd = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd.sortByKey(true).collect
rdd.sortByKey(false).collect
res2: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))
res1: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))
源码解析
/** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in * order of the keys). */ // TODO: this currently doesn't work on P other than Tuple2! def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length) : RDD[(K, V)] = self.withScope { val part = new RangePartitioner(numPartitions, self, ascending) new ShuffledRDD[K, V, V](self, part) .setKeyOrdering(if (ascending) ordering else ordering.reverse) }
6、sortBy
根据func的结果进行排序
val rdd = sc.parallelize(List(2, 1, 4, 3))
rdd.sortBy(x => x).collect()
rdd.sortBy(x => x%2).collect()
res4: Array[Int] = Array(1, 2, 3, 4)
res5: Array[Int] = Array(2, 4, 1, 3)
源码解析
/** * Return this RDD sorted by the given key function. */ def sortBy[K]( f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope { this.keyBy[K](f) .sortByKey(ascending, numPartitions) .values }
7、mapValues
就是输入函数应用于RDD中KV(Key-Value)类型元素中的Value,原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素。因此,该函数只适用于元素为Key-Value对的RDD,针对于(K,V)形式的类型只对V进行操作。
val rdd1 =sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"),2)
val rdd2 =rdd1.map(x =>(x.length,x))
rdd2.mapValues("|"+_ +"|").collect
Array[(Int, String)] = Array((3,|dog|), (5,|tiger|), (4,|lion|), (3,|cat|), (7,|panther|), (5,|eagle|))
源码解析
/** * Pass each value in the key-value pair RDD through a map function without changing the keys; * this also retains the original RDD's partitioning. */ def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope { val cleanF = self.context.clean(f) new MapPartitionsRDD[(K, U), (K, V)](self, (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) }, preservesPartitioning = true) }
8、join(otherDataset)
在类型为(K,V)和(K,W)的RDD上调用,返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD。
val rdd1 = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c"),(4, "f")))
val rdd2 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1.join(rdd2).collect()
res7: Array[(Int, (String, Int))] = Array((1, (a, 4)), (2,(b,5)), (3,(c,6)))
val rdd1 = sc.parallelize(Array((1,"a"), (3,"b"), (3,"c")))
val rdd2 = sc.parallelize(Array((1,4), (1,5), (3,6), (3,7)))
rdd1.join(rdd2).collect()
res8: Array[(Int, (String, Int))] = Array((1,(a,4)), (1,(a,5)), (3,(b,6)), (3,(b,7)), (3,(c,6)), (3,(c,7)))
val a = sc.parallelize(List("dog", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
res9: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
源码解析
/** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD. */ def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) ) }/** * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the * list of values for that key in `this` as well as `other`. */ def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } val cg = new CoGroupedRDD[K](Seq(self, other), partitioner) cg.mapValues { case Array(vs, w1s) => (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]]) } }
9、leftOutJoin(otherDataset)
相当于在join基础上判断一侧的RDD是否为空,如果为空,则填充空,如果有数据,则将数据进行连接计算,然后返回结果。
接上面代码执行
b.leftOuterJoin(d).collect
res94: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant,None)))
源码解析
/** * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the * pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to * partition the output RDD. */ def leftOuterJoin[W]( other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope { this.cogroup(other, partitioner).flatMapValues { pair => if (pair._2.isEmpty) { pair._1.iterator.map(v => (v, None)) } else { for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w)) } } }
10、rightOutJoin(otherDataset)
相当于在join基础上判断一侧的RDD是否为空,如果为空,则填充空,如果有数据,则将数据进行连接计算,然后返回结果。
b.rightOuterJoin(d).collect
Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)), (4,(None,bear)))
源码解析
/** * Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the * resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the * pair (k, (None, w)) if no elements in `this` have key k. Uses the given Partitioner to * partition the output RDD. */ def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Option[V], W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues { pair => if (pair._1.isEmpty) { pair._2.iterator.map(w => (None, w)) } else { for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w) } } }
代码调试
pom.xml
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8 </version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.15</version>
</dependency>
</dependencies>
创建KVRDD对象
package com.soft863
import org.apache.spark.{SparkConf, SparkContext}
object KVRDD {
def main(args: Array[String]): Unit = {
//创建配置文件
val conf: SparkConf = new SparkConf().setAppName("KV RDD DEMO").setMaster("local[*]")
//创建SparkContext,该对象是提交的入口
val sc = new SparkContext(conf)
val rdd1 = sc.parallelize(Array(("red", 1), ("yellow", 2), ("red", 3), ("yellow", 4)))
rdd1.groupByKey().collect()
val rdd2 = sc.parallelize(Array(("red", 1), ("yellow", 2), ("red", 3), ("yellow", 4)))
rdd2.reduceByKey((x, y) => x + y).collect
val pairRDD3 = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12), ("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD3.aggregateByKey(1)(_ + _, _ + _).collect
val pairRDD4 = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12), ("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD4.foldByKey(2)(_ + _).collect
val rdd5 = sc.parallelize(Array((3, "aa"), (6, "cc"), (2, "bb"), (1, "dd")))
rdd5.sortByKey(true).collect
rdd5.sortByKey(false).collect
val rdd6 = sc.parallelize(List(2, 1, 4, 3))
rdd6.sortBy(x => x).collect()
rdd6.sortBy(x => x % 2).collect()
val rdd7 = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val rdd71 = rdd7.map(x => (x.length, x))
rdd71.mapValues("|" + _ + "|").collect
val rdd8 = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c"), (4, "f")))
val rdd81 = sc.parallelize(Array((1, 4), (2, 5), (3, 6)))
rdd8.join(rdd81).collect()
val a = sc.parallelize(List("dog", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog", "cat", "gnu", "salmon", "rabbit", "turkey", "wolf", "bear", "bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
b.leftOuterJoin(d).collect
b.rightOuterJoin(d).collect
}
}