RDD转换算子之Key-Value类型
文章目录
1. partitionBy(partitioner)
-
作用: 对
RDD
进行分区操作,如果原有的partitionRDD
和现有的partitionRDD
是一致的话就不进行分区,否则会生成ShuffleRDD
,即会产生shuffle
过程。 -
示例:
scala> val rdd = sc.makeRDD(Array((1,"a"), (2,"b"), (3,"c"), (4,"d"))) rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[121] at makeRDD at <console>:24 scala> rdd.partitions.length res59: Int = 4 scala> val newRdd = rdd.partitionBy(new org.apache.spark.HashPartitioner(3)) newRdd: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[122] at partitionBy at <console>:26 scala> newRdd.partitions.length res60: Int = 3
2. reduceByKey(func, [numTasks])
-
作用: 在一个
(K,V)
的RDD
上调用,返回一个(K,V)
的RDD
,使用指定的reduce
函数,将相同key
的value
聚合在一起,reduce
任务的数量可以通过第二个可选参数来设置。 -
示例:
scala> val rdd = sc.makeRDD(Array(("male",1), ("female",3), ("female",2), ("male",5))) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[123] at makeRDD at <console>:24 scala> val newRdd = rdd.reduceByKey(_+_) newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[124] at reduceByKey at <console>:26 scala> newRdd.collect res61: Array[(String, Int)] = Array((female,5), (male,6))
3. groupByKey()
-
作用: 按照
key
进行分组。 -
示例:
scala> val rdd1 = sc.makeRDD(Array("hello","world","hello","spark","hello","scala")) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[126] at makeRDD at <console>:24 scala> val rdd2 = rdd1.map((_,1)) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[127] at map at <console>:26 scala> val rdd3 = rdd2.groupByKey rdd3: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[128] at groupByKey at <console>:28 scala> rdd3.collect res62: Array[(String, Iterable[Int])] = Array((spark,CompactBuffer(1)), (scala,CompactBuffer(1)), (hello,CompactBuffer(1, 1, 1)), (world,CompactBuffer(1))) scala> val newRdd = rdd3.map(t => (t._1, t._2.sum)) newRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[129] at map at <console>:30 scala> newRdd.collect res63: Array[(String, Int)] = Array((spark,1), (scala,1), (hello,3), (world,1))
4. aggregateByKey(zeroValue)(seqOp, comOp, [numTasks])
-
作用: 使⽤给定的
combine
函数和⼀个初始化的zero value
, 对每个key
的value
进⾏聚合。1.1
zeroValue
:给每一个分区中的每一个key
一个初始值。
1.2seqOp
:函数用于在每一个分区中用初始值逐步迭代value
。用于在一个分区进行合并。
1.3comOp
:函数用于合并每个分区中的结果。用于在两个分区间进行合并。 -
示例:
// 创建⼀个 pairRDD,取出每个分区相同 key 对应值的最⼤值,然后相加 val rdd = sc.makeRDD(Array(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[130] at makeRDD at <console>:24 scala> val newRdd = rdd.aggregateByKey(Int.MinValue)(math.max(_,_), _+_) newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[132] at aggregateByKey at <console>:26 scala> newRdd.collect res65: Array[(String, Int)] = Array((b,3), (a,3), (c,12))
5. foldByKey(zeroValue)(func)
-
作用:
aggregateByKey
的简化操作,seqOp
和comOp
相同。 -
示例:
scala> val rdd = sc.makeRDD(Array(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8))) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[133] at makeRDD at <console>:24 scala> val newRdd = rdd.foldByKey(0)(_+_) newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[134] at foldByKey at <console>:26 scala> newRdd.collect res66: Array[(String, Int)] = Array((a,5), (b,3), (c,18))
6. combineByKey[C]
-
作用: 针对每个
K
,将V
进行合并成C
,得到RDD[(K,C)]
。 -
示例:
// 创建⼀个 pairRDD,根据 key 计算每种 key 的 value 的均值。(先计算每个key出现的次 数以及可以对 // 应值的总和,再相除得到结果) scala> val rdd = sc.makeRDD(Array(("a",88),("b",95),("a",91),("b",93),("a",95),("b",98)),2) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[150] at makeRDD at <console>:24 scala> val newRdd = rdd.combineByKey((_,1),(acc:(Int,Int),v) => (acc._1+v, acc._2+1), (acc1:(Int,Int), acc2:(Int,Int)) => (acc1._1+acc2._1, acc1._2+acc2._2)) newRdd: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[151] at combineByKey at <console>:26 scala> newRdd.collect res71: Array[(String, (Int, Int))] = Array((b,(286,3)), (a,(274,3))) scala> val resRdd = newRdd.map(item=>(item._1, item._2._1.toInt / item._2._2)) resRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[152] at map at <console>:28 scala> resRdd.collect res72: Array[(String, Int)] = Array((b,95), (a,91))
7. sortByKey
-
作用: 在一个
(K,V)
的RDD
上调用,K
必须实现Ordered
接口,返回一个按照Key
进行排序的(K,V)
的RDD
。 -
示例:
scala> val rdd = sc.makeRDD(Array((1,"a"),(5,"e"),(4,"c"),(2,"b"),(10,"s"))) rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[135] at makeRDD at <console>:24 scala> val newRdd = rdd.sortByKey() newRdd: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[138] at sortByKey at <console>:26 scala> newRdd.collect res67: Array[(Int, String)] = Array((1,a), (2,b), (4,c), (5,e), (10,s))
8. mapValues
-
作用: 针对于
(K,V)
形式的类型只对V
进行操作。 -
示例:
scala> val rdd = sc.makeRDD(Array((1,"a"),(5,"e"),(4,"c"),(2,"b"),(10,"s"))) rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[139] at makeRDD at <console>:24 scala> val newRdd = rdd.mapValues("<" + _ + ">") newRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[140] at mapValues at <console>:26 scala> newRdd.collect res68: Array[(Int, String)] = Array((1,<a>), (5,<e>), (4,<c>), (2,<b>), (10,<s>))
9. join(otherDataSet, [numTasks])
-
作用: 内连接,在类型为
(K,V)
和(K,W)
的RDD
上调用,返回一个相同Key
对应的所有元素对在一起的(K,(V,W))
的RDD
。 -
示例:
scala> var rdd1 = sc.parallelize(Array((1, "a"), (1, "b"), (2, "c"))) rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[141] at parallelize at <console>:24 scala> var rdd2 = sc.parallelize(Array((1, "aa"), (3, "bb"), (2, "cc"))) rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[142] at parallelize at <console>:24 scala> var newRdd = rdd1.join(rdd2) newRdd: org.apache.spark.rdd.RDD[(Int, (String, String))] = MapPartitionsRDD[145] at join at <console>:28 scala> newRdd.collect res69: Array[(Int, (String, String))] = Array((1,(a,aa)), (1,(b,aa)), (2,(c,cc)))
10. cogroup(otherDataSet, [numTasks])
-
作用: 在类型为
(K,V)
和(K,W)
的RDD
上调用,返回一个(K,(Iterable<V>,Iterable<W>))
类型的RDD
。 -
示例:
scala> val rdd1 = sc.parallelize(Array((1, 10),(2, 20),(1, 100),(3, 30)),1) rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[146] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(Array((1, "a"),(2, "b"),(1, "aa"),(3, "c")),1) rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[147] at parallelize at <console>:24 scala> val newRdd = rdd1.cogroup(rdd2) newRdd: org.apache.spark.rdd.RDD[(Int, (Iterable[Int], Iterable[String]))] = MapPartitionsRDD[149] at cogroup at <console>:28 scala> newRdd.collect res70: Array[(Int, (Iterable[Int], Iterable[String]))] = Array((1,(CompactBuffer(10, 100),CompactBuffer(a, aa))), (3,(CompactBuffer(30),CompactBuffer(c))), (2,(CompactBuffer(20),CompactBuffer(b))))