PairRDDFunctions
PairRDDFunctions是RDD API 的扩展,用于为键值 RDD ( RDD[(K, V)]) 提供额外的方法。
PairRDDFunctions可通过 Scala 隐式转换在键值对的 RDD 中使用
combineByKeyWithClassTag
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)
主要参数是三个函数
createCombiner
它将接受我们的当前值作为参数并返回一个新值;mergeValue
, 它接受一个值并将其合并/组合(分区内)到先前createCombiner产生的值中;mergeCombiners
, 合并/组合(跨分区)来自多个 mergeValue 函数的输出。- partitioner :分区器
- mapSideCombine : 是否开启map端聚合
- 序列化方法
combineByKeyWithClassTag是下列函数的实现基础,而这些函数的主要区别就在于三个函数实现上的区别,
很多时候,这些算子可以相互间替换,在有选择的时候尽量选择性能较好的算子
combineByKey
是combineByKeyWithClassTag的简化版本,对输出的RDD进行hash分区
/**
* Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
* This method is here for backward compatibility. It does not provide combiner
* classtag information to the shuffle.
*
* @see `combineByKeyWithClassTag`
*/
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
}
/**
* Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
*/
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
new HashPartitioner(numPartitions))
}
reduceByKey
reduceByKey 需要传入一个函数 func: (V, V) => V 分区内 和 分区间的合并都是使用这个函数
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jLiWHPfP-1676030794255)(https://note.youdao.com/yws/res/13887/9ECCEF54D8E54FDAB114F8FE2469FA9B)]
aggregateByKey
aggregateByKe需要传入一个初始值(每个分区会有一个初始值
另外的两个参数是传入两个函数,分别对应combineByKeyWithClassTag的mergeValue和mergeCombiners,
这两个函数的处理逻辑不同。
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
// We will clean the combiner closure later in `combineByKey`
val cleanedSeqOp: (U, V) => U = self.context.clean(seqOp)
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)
}
foldByKey
有初始值,func: (V, V) => V,分区间和分区内的计算处理逻辑相同,foldByKey可以视为aggregateByKey的特例
def foldByKey(
zeroValue: V,
partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
// When deserializing, use a lazy val to create just one instance of the serializer per task
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))
val cleanedFunc = self.context.clean(func)
combineByKeyWithClassTag[V](
(v: V) => cleanedFunc(createZero(), v), // createCombiner
cleanedFunc, // mergeValue
cleanedFunc,
partitioner)
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MPVLlpIL-1676030794257)(https://note.youdao.com/yws/res/13888/07F660221BA5478C9C256E2585DF1369)]
groupByKey
groupByKey 的三个函数
- createCombiner :接收数据生成一个CompactBuffer序列,
- mergeValue:将新值合并到createCombiner生成的序列中
- mergeCombiners:合并每个分区的序列, 最终生成一个序列
groupByKey的缺点还在于不能开启map端聚合
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VSLdgomg-1676030794257)(https://note.youdao.com/yws/res/13889/5860B07204734E6DB66F7BD10D6F4C6B)]
代码示例
val seq = Seq("a" -> 0, "a" -> 2, "b" -> 5, "a" -> 1, "b" -> 8, "a" -> 4, "b" -> 6, "b" -> 7, "a" -> 3, "b" -> 9)
val groups = sc.parallelize(seq, numSlices = 2)
def createCombiner(n: Int) = {
println(s"createCombiner($n)")
n
}
def mergeValue(n1: Int, n2: Int) = {
println(s"mergeValue($n1, $n2)")
n1 + n2
}
def mergeCombiners(c1: Int, c2: Int) = {
println(s"mergeCombiners($c1, $c2)")
c1 + c2
}
val combineByKeyWithClassTagRDD = groups.combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)
println("======combineByKeyWithClassTagRDD======")
combineByKeyWithClassTagRDD.foreach(println)
//
println("======combineByKeyRDD======")
// combineByKey
val combineByKeyRDD = groups.combineByKey(createCombiner, mergeValue, mergeCombiners)
combineByKeyRDD.foreach(println)
println("======aggregateByKeyRDD======")
// aggregateByKeyRDD
val aggregateByKeyRDD = groups.aggregateByKey(2)(mergeValue, mergeCombiners)
aggregateByKeyRDD.foreach(println)
println("=====reduceByKey=======")
// reduceByKey
val reduceByKeyRDD = groups.reduceByKey(mergeCombiners)
reduceByKeyRDD.foreach(println)
println("======groupByKey======")
// groupByKey
val groupByKeyRDD = groups.groupByKey()
groupByKeyRDD.foreach(println)
println("======foldByKeyRDD======")
// foldByKeyRDD
val foldByKeyRDD = groups.foldByKey(0)(mergeCombiners)
foldByKeyRDD.foreach(println)
======combineByKeyWithClassTagRDD======
createCombiner(4)
createCombiner(0)
mergeValue(0, 2)
createCombiner(6)
mergeValue(6, 7)
createCombiner(5)
mergeValue(2, 1)
mergeValue(4, 3)
mergeValue(5, 8)
mergeValue(13, 9)
mergeCombiners(3, 7)
mergeCombiners(13, 22)
(a,10)
(b,35)
======combineByKeyRDD======
createCombiner(0)
createCombiner(4)
mergeValue(0, 2)
createCombiner(6)
createCombiner(5)
mergeValue(6, 7)
mergeValue(2, 1)
mergeValue(4, 3)
mergeValue(5, 8)
mergeValue(13, 9)
mergeCombiners(13, 22)
mergeCombiners(3, 7)
(b,35)
(a,10)
======aggregateByKeyRDD======
mergeValue(2, 4)
mergeValue(2, 0)
mergeValue(2, 2)
mergeValue(2, 6)
mergeValue(2, 5)
mergeValue(8, 7)
mergeValue(4, 1)
mergeValue(6, 3)
mergeValue(7, 8)
mergeValue(15, 9)
mergeCombiners(15, 24)
mergeCombiners(5, 9)
(a,14)
(b,39)
=====reduceByKey=======
mergeCombiners(0, 2)
mergeCombiners(6, 7)
mergeCombiners(2, 1)
mergeCombiners(4, 3)
mergeCombiners(5, 8)
mergeCombiners(13, 9)
mergeCombiners(13, 22)
mergeCombiners(3, 7)
(b,35)
(a,10)
======groupByKey======
(a,CompactBuffer(0, 2, 1, 4, 3))
(b,CompactBuffer(5, 8, 6, 7, 9))
======foldByKeyRDD======
mergeCombiners(0, 4)
mergeCombiners(0, 0)
mergeCombiners(0, 6)
mergeCombiners(0, 2)
mergeCombiners(6, 7)
mergeCombiners(4, 3)
mergeCombiners(0, 5)
mergeCombiners(13, 9)
mergeCombiners(2, 1)
mergeCombiners(5, 8)
mergeCombiners(3, 7)
mergeCombiners(13, 22)
(a,10)
(b,35)