Value类型RDD转换算子（一）——map、mapPartitions、mapPartitionsWithIndex、flatMap、glom、groupBy

本文链接：https://blog.csdn.net/wx1528159409/article/details/87205926

RDD整体上分为Value类型和KeyValue类型，其中Value类型又包含双Value类型，接下来的内容就是Value类型RDD的各种转换算子整理：

1. map（func）

2. mapPartitions（func）

3. mapPartitionsWithIndex（func）

4. flatMap（func）

eg1：区分Int类型的RDD和Range类型的RDD

eg2：对1,2,3,4,5中每个元素生成一个（1 to 该元素）的序列（flatMap用法1：元素 -> 序列）

eg3：map和flatMap的对比（flatMap用法2：元素 -> 扁平化拆分）

5. glom（）

6. groupBy（func）

1. map（func）

创建一个RDD，对其中元素经过func函数转换后得到新的RDD，map算子就是返回新的RDD，只对元素进行运算；

eg：用map算子对rdd1中所有元素乘以3，并返回转换后的算子map1

scala> var rdd1 = sc.makeRDD(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:24

scala> rdd1.collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val map1 = rdd1.map(_*3)
map1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at map at <console>:26

scala> map1.collect
res2: Array[Int] = Array(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)

2. mapPartitions（func）

与map的区别是：map每次对RDD中每一个元素进行运算，而mapPartitions则是把一分区的数据作为一个整体来处理，处理效率更高；假设一个partition有一万条数据，那么map中算子func需要执行一万次；而用mapPartitions算子，一个task处理一个分区执行一次func，func一次接收分区内的所有数据，效率比较高。

但这个分区的数据处理完后，原RDD中分区的数据才能释放，可能导致OOM（内存溢出），一般在内存空间较大时用mapPartitions

在Executor中，map每次处理一条数据，每一条数据用完引用就释放，然后GC；而mapPartitions只有把一个分区的数据全部处理完才会释放引用，然后GC，当内存空间不够大时可能导致内存溢出。

func的函数类型：Iterator[T] => Iterator[U]

ps：scala代码里，用mapPartitions，需要返回buffer.toIterator

eg：创建一个rdd2，使每个元素*2组成新的RDD

scala> var rdd2 = sc.makeRDD(Array(1,2,3,4,5))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at makeRDD at <console>:24

scala> rdd2.mapPartitions(x => x.map(_*2))
res3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at mapPartitions at <console>:27

scala> res3.collect
res4: Array[Int] = Array(2, 4, 6, 8, 10)

3. mapPartitionsWithIndex（func）

类似mapPartitions，但func中提供了两个参数index,items，其中index表示分区的索引值。

func的函数类型：Int , Interator[T] => Iterator[U]

eg：创建rdd3，使每个元素和所在分区形成一个元组，组成新的RDD

scala> val rdd3 = sc.parallelize(Array(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> val indexRdd = rdd3.mapPartitionsWithIndex((index,items)=>(items.map((index,_))))
indexRdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[11] at mapPartitionsWithIndex at <console>:30

scala> indexRdd.collect
res10: Array[(Int, Int)] = Array((1,1), (3,2), (5,3), (7,4))

通过rdd.partitions.length看到RDD的分区数有8个（当前内核数）

scala> indexRdd.partitions.length
res11: Int = 8

4. flatMap（func）

flatMap一般接split切割字段，然后扁平化

对每个元素进行map操作然后扁平化；不同于map对于每一个元素只生成一个对应的元素，

flatMap是对每一个元素生成一个序列（由多个子元素构成），并对序列中每一个子元素进行扁平化遍历。

ps：scala中扁平化的概念：对每一个元素的子元素进行映射；通俗的说就是把每个元素拆分成最基本的子元素

这么说可能比较模糊，看一下源码：

  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

T类型的RDD => TraversableOnce[U]，即把RDD中元素变成可迭代的（eg：字符串、数组、集合等Range类型的都是可迭代的类型，迭代即将其中每一个子元素进行遍历，基础数据类型Int这种本身只有一个元素，无法迭代，故无法扁平化处理）；

与map的区别点就在这，map中可以func后Int => Int，但flatMap中需要变成Int => Int.toString之类的可迭代类型。

eg1：区分Int类型的RDD和Range类型的RDD

（1）Int型RDD，在flatMap时报错，=>后需要是可迭代的类型，x是Int类型

scala> val rdd4 = sc.makeRDD(Array(1,2,3,4))
rdd4: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at makeRDD at <console>:24

scala> rdd4.flatMap(x=>x).collect
<console>:27: error: type mismatch;
 found   : Int
 required: TraversableOnce[?]
       rdd4.flatMap(x=>x).collect

（2）Range型RDD，成功flatMap扁平化成单个元素

scala> val rdd5 = sc.makeRDD(Array(1 to 5))
rdd5: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[13] at makeRDD at <console>:24

scala> rdd5.flatMap(x=>x).collect
res13: Array[Int] = Array(1, 2, 3, 4, 5)

eg2：对1,2,3,4,5中每个元素生成一个（1 to 该元素）的序列（flatMap用法1：元素 -> 序列）

scala> val rdd6 = sc.makeRDD(1 to 5)
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at makeRDD at <console>:24

scala> val flat = rdd6.flatMap(1 to _)
flat: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[16] at flatMap at <console>:26

scala> flat.collect
res16: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)

这里实际上是x => (1 to x)，每一个Int元素 => 一个序列

eg3：map和flatMap的对比（flatMap用法2：元素 -> 扁平化拆分）

scala> val arr1 = sc.makeRDD(Array(("A",1),("B",2),("C",3)))
arr1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[17] at makeRDD at <console>:24

scala> arr1.flatMap(x=>(x._1+x._2)).collect
res18: Array[Char] = Array(A, 1, B, 2, C, 3)

scala> arr1.map(x=>(x._1+x._2)).collect
res19: Array[String] = Array(A1, B2, C3)

通过eg3可以明显看出，map是对RDD中是元素 => 元素，flatMap中是元素 => 元素 => 子元素迭代遍历（字符串A1迭代遍历成A，1）

5. glom（）

将每一个分区形成一个数组，由RDD[T] => RDD[Array[T]]

eg：创建一个4个分区的RDD，并将每个分区的数据单独放入一个数组

scala> val rdd5 = sc.makeRDD(1 to 12,4)
rdd5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at makeRDD at <console>:24

scala> rdd5.glom().collect
res21: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9), Array(10, 11, 12))

6. groupBy（func）

按照传入func函数的返回值进行分组，将相同key（func计算得出key）对应的值放入一个迭代器

eg：创建一个RDD，按照元素模以2的值作为key进行分组

scala> val rdd6 = sc.makeRDD(1 to 5)
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at makeRDD at <console>:24

scala> val groupBy = rdd6.groupBy(_%2)
groupBy: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[25] at groupBy at <console>:26

scala> groupBy.collect
res22: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4)), (1,CompactBuffer(1, 3, 5)))

2和4的模数为0分一组，1、3、5的模数为1分一组