深度剖析Spark中常用且易混的5个K-V类型算子

最新推荐文章于 2022-08-10 13:08:43 发布

iParadiser

最新推荐文章于 2022-08-10 13:08:43 发布

阅读量492

点赞数 1

分类专栏：大数据工具分析文章标签： spark 大数据 scala java hadoop

本文链接：https://blog.csdn.net/weixin_46020333/article/details/105886036

版权

大数据工具分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

在这里插入图片描述

原文发在我的公众号微信公众号"大数据学习应用"中
公众号后台回复"spark源码"可查看spark源码分析系列
本文系个人原创请勿私自转载

本文共约4400字

前言

spark内置了非常多有用的算子，通过对这些算子的组合就可以完成业务需要的功能。

spark的编程归根结底就是对spark算子的使用，因此非常有必要熟练掌握这些内置算子。

本文重点分析以下spark算子

groupByKey
reduceByKey
aggregateByKey
foldByKey
combineByKey

这几个算子操作的对象都是(k,v)类型的RDD

虽然都有迭代合并的意思但不同点在于传入的参数以及分区内和分区间的计算规则等

groupByKey()

函数签名

def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
}

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
}

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
        createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

函数说明

groupByKey()称为分组合并
对相同的key进行分组并对每个key返回一个Iterable[V]
Iterable[V]存放的是之前相同的key所对应的一个一个的value值
如果直接输出则value默认为CompactBuffer数据结构
groupByKey()处理数据时需要等待，等待所有相同的key都到达时，才能继续往后执行
groupByKey()会将数据打乱重组，也就是说含有shuffle的过程，但是又不能在内存中等待数据，所以必须将shuffle的数据落盘等待

关于CompactBuffer

CompactBuffer是spark里的数据结构，它继承自一个迭代器和序列，所以它的返回值是一个能进行循环遍历的集合

/**
* An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
* ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
* so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
* elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
* entries than that. This makes it more efficient for operations like groupBy where we expect
* some keys to have very few elements.
*/
/**
类似于ArrayBuffer的仅追加缓冲区，但是对于小型缓冲区而言，其内存效率更高。
ArrayBuffer总是分配一个Object数组来存储数据，默认情况下有16个条目，
因此它有大约80-100字节的开销。 
相反，CompactBuffer最多可以在主对象的字段中保留两个元素，并且仅当有更多条目时才分配Array [AnyRef]。 
这对于像groupBy这样的操作来说效率更高，因为我们希望某些键的元素很少。
*/
private[spark] class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable

代码举例

var rdd = sc.makeRDD(
    List(
        ("hello", 1),
        ("hello", 2),
        ("hadoop", 2),
        ("hadoop", 2),
        ("hadoop", 4)
    )
)

// 使用key进行分组操作
val rdd1: RDD[(String, Iterable[Int])] = rdd.groupByKey()
rdd1.collect().foreach(println)
// 可以直接输出 结果为
//(hadoop,CompactBuffer(2, 2, 4))
//(hello,CompactBuffer(1, 2))

val rdd2 = rdd1.mapValues(
    datas => {
        datas.sum
    }
)
rdd2.collect().foreach(println)
// 也可以将数据迭代取出进行后续操作之后输出 结果为
// (hadoop,8)
// (hello,3)

作图示例

reduceByKey()

函数签名

def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
}

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
}

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}

函数说明

reduceByKey()在groupByKey()基础上升级
区别在于 groupByKey()的代码中 mapSideCombine = false
也说就是groupByKey()没有map端的预聚合操作，直接进行shuffle
而reduceByKey()会在分区内做预聚合，然后再进行shuffle聚合，返回的结果是RDD
- 我们一般将分区内聚合称之为预聚合 combine
推荐使用reduceByKey() 在shuffle的过程中，落盘的数据量会变少，所以读写磁盘的速度会变快，性能更高

代码举例

var rdd = sc.makeRDD(
    List(
        ("Hello", 1),
        ("Hadoop", 2),
        ("Hello", 3),
        ("Hadoop", 4),
        ("Hadoop", 5),
        ("Hello", 6),
        ("Hadoop", 7)
    )
)

// spark中所有的byKey算子都需要通过KV类型的RDD进行调用
// reduceByKey = 分组 + 聚合
// 分组操作已经由Spark自动完成，按照key进行分组。然后在数据的value进行两两聚合
val rdd1: RDD[(String, Int)] = rdd.reduceByKey(_ + _)

rdd1.collect().foreach(println)
// 结果为
// (Hadoop,18)
// (Hello,10)

作图示例

aggregateByKey

函数签名

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
                                              combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}

def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
                                                                  combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, new HashPartitioner(numPartitions))(seqOp, combOp)
}

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
                                                                        combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
                                cleanedSeqOp, combOp, partitioner)
}

函数说明

将数据根据不同的规则进行分区内计算和分区间计算并且可以给定初始值zeroValue
zeroValue：给每一个分区中的每一个 key 一个初始值

seqOp：函数用于在每一个分区中用初始值逐步迭代 value

combOp：函数用于合并每个分区中的结果
按 key 将 value 进行分组合并，合并时，将每个 value 和初始值作为 seq 函数的参数，进行计算，返回的结果作为一个新的 kv 对，然后再将结果按照 key 进行合并
最后将每个分组的 value 传递给 comb 函数进行计算（先将前两个 value 进行计算，将返回结果和下一个 value 传给 comb 函数，以此类推），将 key 与计算结果作为一个新的 kv 对输出

代码举例

// 取出每个分区内相同key的最大值然后分区间相加
// aggregateByKey算子是函数柯里化，存在两个参数列表
// 1. 第一个参数列表中的参数表示每个key的初始值
// 2. 第二个参数列表中含有两个参数
//    2.1 第一个参数表示分区内的计算规则
//    2.2 第二个参数表示分区间的计算规则
val rdd =
sc.makeRDD(List(
    ("a",1),("a",2),("c",3),
    ("b",4),("c",5),("c",6)
		  ),2)
// 0:("a",1),("a",2),("c",3) => (a,5)(c,5)
//                                         => (a,5)(b,5)(c,11)
// 1:("b",4),("c",5),("c",6) => (b,5)(c,6)

val resultRDD =
rdd.aggregateByKey(5)(
    (x, y) => math.max(x,y),
    (x, y) => x + y
)

resultRDD.collect().foreach(println)
// 结果为
// (b,5)
// (a,5)
// (c,11)

作图示例

foldByKey()

函数签名

def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, defaultPartitioner(self))(func)
}

def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, new HashPartitioner(numPartitions))(func)
}

def foldByKey(
    zeroValue: V,
    partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    // When deserializing, use a lazy val to create just one instance of the serializer per task
    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

    val cleanedFunc = self.context.clean(func)
    combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
                                cleanedFunc, cleanedFunc, partitioner)
}

函数说明

当分区内计算规则和分区间计算规则相同时，aggregateByKey()就可以简化为foldByKey()
如果给定的初始值对数据的合并和计算没有任何影响
- 例如计算规则为求和而初始值为0 此时就相当于reduceByKey()

代码举例

// 如果aggregateByKey算子中分区内计算规则和分区间计算规则相同的话
// 那么可以采用其他算子来代替

val rdd =
sc.makeRDD(List(
    ("a", 1), ("a", 2), ("c", 3),
    ("b", 4), ("c", 5), ("c", 6)
		  ), 2)

// 如果做加法 初始值为0时就相当于 reduceByKey(_+_)
val resultRDD = rdd.foldByKey(5)(_ + _)

resultRDD.collect().foreach(println)
// 结果为
// (b,9)
// (a,8)
// (c,24)

作图示例

combineByKey()

函数签名

def combineByKey[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
}

def combineByKey[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    numPartitions: Int): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
}

def combineByKey[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true,
    serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
                             partitioner, mapSideCombine, serializer)(null)
}

函数说明

combineByKey()最通用的对key-value型RDD进行聚集操作的聚集函数
与aggregateByKey()相比 combineByKey()是使用传入的第一个函数createCombiner对第一次出现的key的value进行操作(可允许转换类型)，而aggregateByKey()是给各分区中每个key一个初始值，但是没有对原数据进行任何的操作。
如果返回值类型和参数类型一致，使用aggregateByKey()则较为简单

代码举例

// TODO : 求每个key的平均值 => ( total, cnt )
val rdd =
    sc.makeRDD(
        List(
            ("a", 88), ("b", 95), ("a", 91),
            ("b", 93), ("a", 95), ("b", 98))
        , 2)

val rdd1: RDD[(String, (Int, Int))] = rdd.combineByKey(
    // 对分区内第一次出现的key对应的value值进行格式的转换
    // 这里我们将Int 转换为元组(Int,Int)
    (x: Int) => (x, 1),
    // 分区内计算规则 数据相加，数量加1
    (x: (Int, Int), y: Int) => {
        (x._1 + y, x._2 + 1)
    },
    // 分区间计算规则 数据相加，数量相加
    (x: (Int, Int), y: (Int, Int)) => {
        (x._1 + y._1, x._2 + y._2) // 数据相加，数量相加
    }
)
val resultRDD = rdd1.mapValues(
    t => t._1 / t._2
)
// 结果为
// (b,95)
// (a,91)

作图示例

对比

关键代码对比

// ---关键代码对比---
// groupByKey
combineByKeyWithClassTag[CompactBuffer[V]](
        createCombiner, 
        mergeValue, 
        mergeCombiners, 
        partitioner, 
    	// 这里map端的聚合操作为false
        mapSideCombine = false)

// reduceByKey 
combineByKeyWithClassTag[V]((v: V) => 
               v, 
               func, 
               func, 
               partitioner)
// aggregateByKey
combineByKeyWithClassTag[U]((v: V) => 
               // 分区内计算规则 传进去初始值和v
               cleanedSeqOp(createZero(), v),
               // 接着在分区内连续使用分区内的计算规则
               cleanedSeqOp,
               // 分区间计算规则
               combOp, 
               partitioner)

// foldByKey
combineByKeyWithClassTag[V]((v: V) => 
               cleanedFunc(createZero(), v),
               // 分区内和分区间计算规则相同
               cleanedFunc, 
               cleanedFunc, 
               partitioner)
// combineByKey
combineByKeyWithClassTag(
    			//第一个参数是对第一次出现的key的value进行处理 可转换类型
    			createCombiner, 
                mergeValue, 
                mergeCombiners,
                partitioner, 
    			// map端的预聚合 上面的几个没有传该参数表示使用默认的 true
                mapSideCombine, 
    			// 序列化 默认为null
                serializer)(null)

五大算子比较

从底层来看五个都是使用相同的底层逻辑
groupByKey未进行map端的预聚合操作
reduceByKey不会对第一个value进行处理，分区内和分区间计算规则相同
aggregateByKey会把初始值和每个第一次出现的key对应的value使用分区内的计算规则进行计算分区内和分区间计算规则不同
foldByKey的算子的分区内和分区间的计算规则相同，并且初始值和第一个value使用的规则相同是aggregateByKey的简化版
combineByKey第一个参数就是对分区内每个第一次出现的key的value进行处理，且可以转换类型，所以无需初始值。
除groupByKey之外的四个算子都支持预聚合功能。所以shuffle性能比较高
上面的算子都可以实现WordCount

四大聚合算子比较

算子	初始值	分区内规则	分区间规则	是否相同
reduceByKey	无	func: (V, V) => V	func: (V, V) => V	√
aggregateByKey	zeroValue: U	seqOp: (U, V) => U	combOp: (U, U) => U	×
foldByKey	zeroValue: V	func: (V, V) => V	func: (V, V) => V	√
combineByKey	createCombiner: V => C	mergeValue: (C, V) => C	mergeCombiners: (C, C) => C	×

在这里插入图片描述

iParadiser

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
深度剖析Spark中常用且易混的5个K-V类型算子

原文发在我的公众号微信公众号"大数据学习应用"中公众号后台回复"spark源码"可查看spark源码分析系列本文系个人原创请勿私自转载本文共约4400字前言spark内置了非常多有用的算子，通过对这些算子的组合就可以完成业务需要的功能。spark的编程归根结底就是对spark算子的使用，因此非常有必要熟练掌握这些内置算子。本文重点分析以下spark算子groupByKey...
复制链接

扫一扫