RDD转换算子(二)-KV算子(键值类)及源码解析

数智侠

已于 2024-08-08 09:51:37 修改

阅读量668

点赞数 11

分类专栏： Spark 文章标签： spark rdd

于 2024-08-07 23:04:42 首次发布

本文链接：https://blog.csdn.net/taogumo/article/details/141002500

版权

Spark 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

转换算子

主要做的是就是将一个已有的RDD生成另外一个RDD。Transformation具有lazy特性(延迟加载)。Transformation算子的代码不会真正被执行。只有当我们的程序里面遇到一个action算子的时候，代码才会真正的被执行。这种设计让Spark更加有效率地运行。

KV算子为Rdd内部为键值对类型

官方文档

RDD Programming Guide - Spark 3.5.1 Documentation

1、groupByKey

groupByKey对每个相同的key执行聚合操作，针对key生成的value是一个集合。

val rdd = sc.parallelize(Array(("red",1), ("yellow",2), ("red", 3),("yellow", 4)))
rdd.groupByKey().collect()

Array[(String, Iterable[Int])] = Array((red,CompactBuffer(1, 3)), (yellow,CompactBuffer(2, 4)))

CompactBuffer：类似于arrayBuffer的只追加缓冲区，但对于小缓冲区内存效率更高

拓展：

对List(("a", 50), ("b", 70), ("c", 30), ("c", 90),("b", 20), ("a", 80)) 进行分组排序，按字符串进行分组，对整数进行倒序排序。

sc.makeRDD(List(("a", 50), ("b", 70), ("c", 30), ("c", 90),("b", 20), ("a", 80))).groupByKey().map(x => (x._1, x._2.toList.sorted.reverse)).foreach(println)

     toList: 把CompactBuffer转换成List；
     sorted：对List进行正序排序
     reverse：对元素进行反转，变成倒序

源码解析

/**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
 * within each group is not guaranteed, and may even differ each time the resulting RDD is
 * evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 */
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(defaultPartitioner(self))
}

/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

CompactBuffer解析

/**
 * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
 * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
 * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
 * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
 * entries than that. This makes it more efficient for operations like groupBy where we expect
 * some keys to have very few elements.
 */
private[spark] class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable {

2、reduceByKey(func)

在一个(K,V)的RDD上调用，返回一个(K,V)的RDD，使用指定的函数，将相同key的值聚合到一起

val rdd = sc.parallelize(Array(("red",1), ("yellow",2), ("red", 3),("yellow", 4)))
rdd.reduceByKey((x,y) => x + y).collect

Array[(String, Int)] = Array((red,4), (yellow,6))

x 和 y 分别表示什么意思？

对于这个x，它代指的是函数返回值，而y是对rdd各元素的遍历。

刚开始，当y遍历到第一个元素时，因为还没有返回值，所以x可以忽略，相加后还是y本身，并把值返回

当遍历到第二个元素时，x即为刚才的返回值，再和y相加，以此类推。

如果我想对不是key-value形式的数据进行合并呢？

可以使用reduce：

val rdd5 = sc.parallelize(List(1, 2, 3, 4, 5))
rdd5.reduce(_ + _)

res4: Int = 15

源码解析

/**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
 * within each group is not guaranteed, and may even differ each time the resulting RDD is
 * evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 */
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(defaultPartitioner(self))
}

/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

3、aggregateByKey(value)(seqOp, combOp)

将每个分区里面的元素通过seqOp和初始值进行聚合，然后将每个分区的结果进行combine操作。

即对每一个分区，将key值相同的，先聚合；再对每个分区的结果进行聚合。

val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD.aggregateByKey(1)(_+_, _ + _).collect

res1: Array[(String, Int)] = Array((dog,13), (cat,21), (mouse,7))

我们来分析一下上面的过程：

1、第一个分区("cat",2), ("cat", 5),("dog", 12)，聚合后： ("cat",8), ("dog", 13)
2、第二个分区("mouse", 4),("cat", 12),("mouse", 2)，聚合后： ("mouse",7), ("cat", 13)
3、最终结果：Array[(String, Int)] = Array((dog,13), (cat,21), (mouse,7))

拓展

scala> val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
scala> pairRDD.aggregateByKey(2)(_+_, _ * _).collect

结果：Array[(String, Int)] = Array((dog,14), (cat,126), (mouse,8))

源码解析

/**
 * Aggregate the values of each key, using given combine functions and a neutral "zero value".
 * This function can return a different result type, U, than the type of the values in this RDD,
 * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
 * as in scala.TraversableOnce. The former operation is used for merging values within a
 * partition, and the latter is used for merging values between partitions. To avoid memory
 * allocation, both of these functions are allowed to modify and return their first argument
 * instead of creating a new U.
 */
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
  // Serialize the zero value to a byte array so that we can get a new clone of it on each key
  val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
  val zeroArray = new Array[Byte](zeroBuffer.limit)
  zeroBuffer.get(zeroArray)

  lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
  val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

  // We will clean the combiner closure later in `combineByKey`
  val cleanedSeqOp = self.context.clean(seqOp)
  combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
    cleanedSeqOp, combOp, partitioner)
}

4、foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]

foldByKey是aggregateByKey的简化操作，也是按照相同key进行func聚合，只有一个函数，对每个分区内的数据按照此函数进行聚合，然后每个分区的结果依然按照此函数再进行聚合。

zeroValue为初始值，在对每个分区内相同key的value通过func聚合前，先和zeroValue进行聚合

val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD.foldByKey(1)(_+_).collect

分析：

第一个分区数据：(cat,2)(cat,5)(dog,12)；cat：1+2+5=8；dog：1+12=13

第二个分区数据：(mouse,4)(cat,12)(mouse,2)；mouse：1+4+2=7；cat：1+12=13

分区合并：cat：8+13=21；dog：13；mouse：7

源码解析

/**
 * Merge the values for each key using an associative function and a neutral "zero value" which
 * may be added to the result an arbitrary number of times, and must not change the result
 * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
 */
def foldByKey(
    zeroValue: V,
    partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
  // Serialize the zero value to a byte array so that we can get a new clone of it on each key
  val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
  val zeroArray = new Array[Byte](zeroBuffer.limit)
  zeroBuffer.get(zeroArray)

  // When deserializing, use a lazy val to create just one instance of the serializer per task
  lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
  val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

  val cleanedFunc = self.context.clean(func)
  combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
    cleanedFunc, cleanedFunc, partitioner)
}

对比aggregateByKey与foldByKey底层实现区别如下，发现foldByKey是aggregateByKey简化版本

combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)

combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
cleanedFunc, cleanedFunc, partitioner)

5、sortByKey([ascending])

在一个(K,V)的RDD上调用，返回一个按照key进行排序的(K,V)的RDD。

val rdd = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd.sortByKey(true).collect
rdd.sortByKey(false).collect

res2: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))

res1: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))

源码解析

/**
 * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
 * `collect` or `save` on the resulting RDD will return or output an ordered list of records
 * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
 * order of the keys).
 */
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
    : RDD[(K, V)] = self.withScope
{
  val part = new RangePartitioner(numPartitions, self, ascending)
  new ShuffledRDD[K, V, V](self, part)
    .setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

6、sortBy

根据func的结果进行排序

val rdd = sc.parallelize(List(2, 1, 4, 3))
rdd.sortBy(x => x).collect()
rdd.sortBy(x => x%2).collect()

res4: Array[Int] = Array(1, 2, 3, 4)

res5: Array[Int] = Array(2, 4, 1, 3)

源码解析

/**
 * Return this RDD sorted by the given key function.
 */
def sortBy[K](
    f: (T) => K,
    ascending: Boolean = true,
    numPartitions: Int = this.partitions.length)
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
  this.keyBy[K](f)
      .sortByKey(ascending, numPartitions)
      .values
}

7、mapValues

就是输入函数应用于RDD中KV（Key-Value）类型元素中的Value，原RDD中的Key保持不变，与新的Value一起组成新的RDD中的元素。因此，该函数只适用于元素为Key-Value对的RDD，针对于(K,V)形式的类型只对V进行操作。

val rdd1 =sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"),2)
val rdd2 =rdd1.map(x =>(x.length,x))
rdd2.mapValues("|"+_ +"|").collect

Array[(Int, String)] = Array((3,|dog|), (5,|tiger|), (4,|lion|), (3,|cat|), (7,|panther|), (5,|eagle|))

源码解析

/**
 * Pass each value in the key-value pair RDD through a map function without changing the keys;
 * this also retains the original RDD's partitioning.
 */
def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
  val cleanF = self.context.clean(f)
  new MapPartitionsRDD[(K, U), (K, V)](self,
    (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
    preservesPartitioning = true)
}

8、join(otherDataset)

在类型为(K,V)和(K,W)的RDD上调用，返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD。

val rdd1 = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c"),(4, "f")))
val rdd2 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1.join(rdd2).collect()

res7: Array[(Int, (String, Int))] = Array((1, (a, 4)), (2,(b,5)), (3,(c,6)))

val rdd1 = sc.parallelize(Array((1,"a"), (3,"b"), (3,"c")))
val rdd2 = sc.parallelize(Array((1,4), (1,5), (3,6), (3,7)))
rdd1.join(rdd2).collect()

res8: Array[(Int, (String, Int))] = Array((1,(a,4)), (1,(a,5)), (3,(b,6)), (3,(b,7)), (3,(c,6)), (3,(c,7)))

val a = sc.parallelize(List("dog", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect

res9: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

源码解析

/**
 * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
 * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
 * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
 */
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
  this.cogroup(other, partitioner).flatMapValues( pair =>
    for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
  )
}

/**
 * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
 * list of values for that key in `this` as well as `other`.
 */
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
  if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
    throw new SparkException("HashPartitioner cannot partition array keys.")
  }
  val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
  cg.mapValues { case Array(vs, w1s) =>
    (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
  }
}

9、leftOutJoin(otherDataset)

相当于在join基础上判断一侧的RDD是否为空，如果为空，则填充空，如果有数据，则将数据进行连接计算，然后返回结果。

接上面代码执行

b.leftOuterJoin(d).collect

res94: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant,None)))

源码解析

/**
 * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
 * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
 * pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to
 * partition the output RDD.
 */
def leftOuterJoin[W](
    other: RDD[(K, W)],
    partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
  this.cogroup(other, partitioner).flatMapValues { pair =>
    if (pair._2.isEmpty) {
      pair._1.iterator.map(v => (v, None))
    } else {
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
    }
  }
}

10、rightOutJoin(otherDataset)

相当于在join基础上判断一侧的RDD是否为空，如果为空，则填充空，如果有数据，则将数据进行连接计算，然后返回结果。

b.rightOuterJoin(d).collect

Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)), (4,(None,bear)))

源码解析

/**
 * Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
 * resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
 * pair (k, (None, w)) if no elements in `this` have key k. Uses the given Partitioner to
 * partition the output RDD.
 */
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Option[V], W))] = self.withScope {
  this.cogroup(other, partitioner).flatMapValues { pair =>
    if (pair._1.isEmpty) {
      pair._2.iterator.map(w => (None, w))
    } else {
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w)
    }
  }
}

代码调试

pom.xml

   <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8 </version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.2.0</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.15</version>
        </dependency>
    </dependencies>

创建KVRDD对象

package com.soft863

import org.apache.spark.{SparkConf, SparkContext}

object KVRDD {

  def main(args: Array[String]): Unit = {
    //创建配置文件
    val conf: SparkConf = new SparkConf().setAppName("KV RDD DEMO").setMaster("local[*]")
    //创建SparkContext,该对象是提交的入口
    val sc = new SparkContext(conf)

    val rdd1 = sc.parallelize(Array(("red", 1), ("yellow", 2), ("red", 3), ("yellow", 4)))
    rdd1.groupByKey().collect()

    val rdd2 = sc.parallelize(Array(("red", 1), ("yellow", 2), ("red", 3), ("yellow", 4)))
    rdd2.reduceByKey((x, y) => x + y).collect

    val pairRDD3 = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12), ("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
    pairRDD3.aggregateByKey(1)(_ + _, _ + _).collect

    val pairRDD4 = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12), ("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
    pairRDD4.foldByKey(2)(_ + _).collect

    val rdd5 = sc.parallelize(Array((3, "aa"), (6, "cc"), (2, "bb"), (1, "dd")))
    rdd5.sortByKey(true).collect
    rdd5.sortByKey(false).collect

    val rdd6 = sc.parallelize(List(2, 1, 4, 3))
    rdd6.sortBy(x => x).collect()
    rdd6.sortBy(x => x % 2).collect()

    val rdd7 = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
    val rdd71 = rdd7.map(x => (x.length, x))
    rdd71.mapValues("|" + _ + "|").collect

    val rdd8 = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c"), (4, "f")))
    val rdd81 = sc.parallelize(Array((1, 4), (2, 5), (3, 6)))
    rdd8.join(rdd81).collect()

    val a = sc.parallelize(List("dog", "salmon", "rat", "elephant"), 3)
    val b = a.keyBy(_.length)
    val c = sc.parallelize(List("dog", "cat", "gnu", "salmon", "rabbit", "turkey", "wolf", "bear", "bee"), 3)
    val d = c.keyBy(_.length)
    b.join(d).collect
    b.leftOuterJoin(d).collect
    b.rightOuterJoin(d).collect
  }
}