1.combineByKey:使用用户设置好的聚合函数对每个Key中的Value进行组合(combine)。可以将输入类型为RDD[(K, V)]转成成RDD[(K, C)]。
函数原型
第一个和第二个函数都是基于第三个函数实现的,使用的是HashPartitioner,Serializer为null。而第三个函数我们可以指定分区,如果需要使用Serializer的话也可以指定。combineByKey函数比较重要,我们熟悉地诸如aggregateByKey、foldByKey、reduceByKey等函数都是基于该函数实现的。默认情况会在Map端进行组合操作。
第二个例子其实就是计算单词的个数,事实上,reduceByKey函数就是类似的计算。(x:Int, y: Int) => x + y就是我们传进reduceByKey函数的参数。
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C) : RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine:
Boolean =
true
, serializer: Serializer =
null
): RDD[(K, C)]
|
实例:
scala> val data = sc.parallelize(List((
1
,
"www"
), (
1
,
"iteblog"
), (
1
,
"com"
),
(
2
,
"bbs"
), (
2
,
"iteblog"
), (
2
,
"com"
), (
3
,
"good"
)))
data: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[
15
] at parallelize at <console>:
12
scala> val result = data.combineByKey(List(_),
(x: List [String], y: String) => y :: x, (x: List[String], y: List[String]) => x ::: y)
result: org.apache.spark.rdd.RDD[(Int, List[String])] =
ShuffledRDD[
19
] at combineByKey at <console>:
14
scala> result.collect
res20: Array[(Int, List[String])] = Array((
1
,List(www, iteblog, com)),
(
2
,List(bbs, iteblog, com)), (
3
,List(good)))
scala> val data = sc.parallelize(List((
"iteblog"
,
1
), (
"bbs"
,
1
), (
"iteblog"
,
3
)))
data: org.apache.spark.rdd.RDD[(String, Int)] =
ParallelCollectionRDD[
24
] at parallelize at <console>:
12
scala> val result = data.combineByKey(x => x,
(x: Int, y:Int) => x + y, (x:Int, y: Int) => x + y)
result: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[
25
] at combineByKey at <console>:
14
scala> result.collect
res27: Array[(String, Int)] = Array((iteblog,
4
), (bbs,
1
))
|
2.collect:将RDD转成Scala数组,并返回。
函数原型
从结果我们可以看出,如果RDD中同一个Key中存在多个Value,那么后面的Value将会把前面的Value覆盖,最终得到的结果就是Key唯一,而且对应一个Value。
def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
|
注:如果数据量比较大的时候,尽量不要使用collect函数,因为这可能导致Driver端内存溢出问题。
3.collectAsMap:功能和collect函数类似。该函数用于Pair RDD,最终返回Map类型的结果
defcollectAsMap(): Map[K, V]
|
实例:
scala> val data = sc.parallelize(List((
1
,
"www"
), (
1
,
"iteblog"
), (
1
,
"com"
),
(
2
,
"bbs"
), (
2
,
"iteblog"
), (
2
,
"com"
), (
3
,
"good"
)))
data: org.apache.spark.rdd.RDD[(Int, String)] =
ParallelCollectionRDD[
26
] at parallelize at <console>:
12
scala> data.collectAsMap
res28: scala.collection.Map[Int,String] = Map(
2
-> com,
1
-> com,
3
-> good)
|
4.cogroup:将多个RDD中同一个Key对应的Value组合到一起。
实例:
scala> val data1 = sc.parallelize(List((
1
,
"www"
), (
2
,
"bbs"
)))
|