Spark Core：Pair RDD

最新推荐文章于 2022-06-15 16:19:13 发布

Icedzzz

最新推荐文章于 2022-06-15 16:19:13 发布

阅读量197

点赞数

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/Zeroowt/article/details/104663290

版权

Spark 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Pair RDD

Pair RDD是针对键值对RDD进行一些特有的操作，提供了并行操作各个键或跨节点重新进行数据分组的操作接口。
常见的Pair RDD方法包括：

函数名	作用
reduceByKey	合并具有相同键的值
groupByKey	对具有相同键的值进行分组，返回value的迭代器
combineByKey	使用不同返回类型合并并具有相同键的值
mapValue	对pairRDD中的每个值应用一个函数而不改变键
sortByKey	返回一个根据键排序的RDD
join	对两个RDD进行内连接
cogroup	将两个RDD中拥有相同键的数据分组到一起，返回多个value的迭代器

1. 并行度调优问题：

每个 RDD 都有固定数目的分区，分区数决定了在 RDD 上执行操作时的并行度。在执行聚合或分组操作时，可以要求 Spark 使用给定的分区数。 Spark 始终尝试根据集群的大小推断出一个有意义的默认值，但是有时候你可能要对并行度进行调优来获取更好的性能表现。
除了分组和聚合方法以外，还可以使用repartition() 和coalesce()函数 对RDD进行重新分区；

colesce函数
源码：

def coalesce(numPartitions: Int, shuffle: Boolean = false,
             partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
            (implicit ord: Ordering[T] = null)
    : RDD[T] = withScope {
  require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
  if (shuffle) {
    /** Distributes elements evenly across output partitions, starting from a random partition. */
    val distributePartition = (index: Int, items: Iterator[T]) => {
      var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
      items.map { t =>
        // Note that the hash code of the key will just be the key itself. The HashPartitioner
        // will mod it with the number of total partitions.
        position = position + 1
        (position, t)
      }
    } : Iterator[(Int, T)]

    // include a shuffle step so that our upstream tasks are still distributed
    new CoalescedRDD(
      new ShuffledRDD[Int, T, T](
        mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
        new HashPartitioner(numPartitions)),
      numPartitions,
      partitionCoalescer).values
  } else {
    new CoalescedRDD(this, numPartitions, partitionCoalescer)
  }
}

coalesce方法最后返回一个coalescedRDD（即一个经过简化的RDD），该RDD分区数为numPartitions。第二个参数shuffle默认为false，即这是一个窄依赖，不会发生shuffle，默认简化RDD的分区数。但如果你想要增加分区数目，则第二个参数shuffle=true，这会是一个HashPartitioner产生更多个分区分布到节点上，从而提高并行度。

repatition函数
源码：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  coalesce(numPartitions, shuffle = true)
}

可以看出，该方法内部调用colesce方法其中shuffle=ture，会产生多于之前分区数目的RDD（也可以减少，但推荐使用colesce方法）；该方法返回一个恰好有numParititions个分区的RDD，内部通过shuffle重新分布数据。

2.combineByKey（聚合操作）

combineByKey() 是最为常用的基于键进行聚合的函数，combineByKey() 可以让用户返回与输入数据的类型不同的
返回值.combineByKey() 会遍历分区中的所有元素，因此某个元素的键要么还没有遇到过，要么就和之前的某个元素的键相同。

源码：

/**
 * :: Experimental ::
 * Generic function to combine the elements for each key using a custom set of aggregation
 * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
 *
 * Users provide three functions:
 *
 *  - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
 *  - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
 *  - `mergeCombiners`, to combine two C's into a single one.
 *
 * In addition, users can control the partitioning of the output RDD, and whether to perform
 * map-side aggregation (if a mapper can produce multiple items with the same key).
 *
 * @note V and C can be different -- for example, one might group an RDD of type
 * (Int, Int) into an RDD of type (Int, Seq[Int]).
 */
def combineByKey[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
  combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
}

combineByKey函数中必须包含三个元素：

createCombiner函数：combineByKey会遍历每个分区中所有元素，如果在一个分区中遍历到一个未遇到过的Key，通过createCombiner（）函数创建这个Key对应的累加器初始值（“combined type” C）。注意：这一过程只在每个分区中第一次出现这个Key时才发生，而不是整个Rdd中。
mergeValue函数：当处理遍历到当前分区已经遇到过的Key时，使用mergeValue方法对累加器（C)对应值与新值合并。
mergeCombiners函数：每个分区都是独立处理的，对于同一个键可以有多个累加器。如果多个分区都有同一个键的累加器，则通过mergeCombiners方法将各分区结果进行合并。

案例：
计算每个键对应的平均数

val conf = new SparkConf().setMaster("local[*]").setAppName("PairRdd")
val sc = new SparkContext(conf)
val pairRdd = sc.parallelize( List(("coffee",1),("coffee",2),("panda",3),("coffee",9)),2)

val result = pairRdd.combineByKey(
  v => (v, 1),
  (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
  (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2),
).map { case (key, value) => (key, value._1 / value._2.toFloat) }
val r = result.collect
println(r.toBuffer)
sc.stop()

步骤：

并行化创建一个RDD，设置分区为2
combineByKey进行聚合：第一个函数，将遍历的key对应的Value值作为参数，返回（V，C）作为累加器，C记录遍历到Key值个数；第二个函数：将遍历到相同的Key值的Vaule累加，并计数器C值加一；第三个函数：将不同分区的结果合并
利用map函数，将每个结果取出来计算平均值

在这里插入图片描述

Icedzzz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Core：Pair RDD

/** * :: Experimental :: * Generic function to combine the elements for each key using a custom set of aggregation * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combin...
复制链接

扫一扫

专栏目录