02 Spark：RDD转换算子之单Value类型

最新推荐文章于 2021-12-14 10:12:29 发布

朱古力...

最新推荐文章于 2021-12-14 10:12:29 发布

阅读量254

点赞数

分类专栏： Spark 文章标签： spark 大数据 rdd 转换算子

本文链接：https://blog.csdn.net/weixin_44781238/article/details/106349267

版权

Spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

RDD转换算子之单Value类型

文章目录

RDD转换算子之单Value类型

1. map(func)

作用： 返回一个新的 RDD ，该 RDD 是由原 RDD 的每个元素经过函数转换之后的值组成。即，是对 RDD 中的数据做转换。

示例：

// 创建⼀个包含 1-10 的 RDD，然后将每个元素 *2 形成新的 RDD

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[40] at makeRDD at <console>:24

scala> val newRdd = rdd.map(_*2)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[41] at map at <console>:26

scala> newRdd.collect
res31: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

2. mapPartitions(func)

作用： 类似于 map(func) ，但是是独立在每个分区上运行，所以 func 的类型是 Iterator<T> => Iterator<U>。假设有 N 个元素，M 个分区，那么 map 函数会被调用 N 次，而 mapPartitions 会被调用 M 次。

示例：

// 创建⼀个包含 1-10 的 RDD，然后将每个元素 *2 形成新的 RDD

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at makeRDD at <console>:24

scala> val newRdd = rdd.mapPartitions(par=>par.map(_*2))
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[43] at mapPartitions at <console>:26

scala> newRdd.collect
res32: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

3. mapPartitionsWithIndex(func)

作用： 与 mapPartitions(func) 类似，但是会给 func 多提供一个 Int 类型的分区号，所以 func 的类型是 (Int, Iterator<T>) => Iterator<U>。

示例：

// 创建⼀个包含 1-10 的 RDD，然后得到 (分区号, 数据) 形式的新的 RDD

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at makeRDD at <console>:24


scala> val newRdd = rdd.mapPartitionsWithIndex((index, par) => par.map((index, _)))
newRdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[45] at mapPartitionsWithIndex at <console>:26

scala> newRdd.collect
res33: Array[(Int, Int)] = Array((0,1), (0,2), (1,3), (1,4), (1,5), (2,6), (2,7), (3,8), (3,9), (3,10))

4. flatMap(func)

作用： 与 map(func) 类似，但是每一个输入可以被映射成 0 或多个输出元素，所以 func 应该返回一个序列，而不是一个单一元素 T => TraversableOnce[U]。

示例：

// 创建⼀个包含 1-10 的 RDD，然后得到一个由原 RDD 中每个元素的平方和三次方组成的新的 RDD

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at <console>:24

scala> val newRdd = rdd.flatMap(x => Array(x*x, x*x*x))
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[48] at flatMap at <console>:26

scala> newRdd.collect
res35: Array[Int] = Array(1, 1, 4, 8, 9, 27, 16, 64, 25, 125, 36, 216, 49, 343, 64, 512, 81, 729, 100, 1000)

5. glom

作用： 将每一个分区的元素合并成一个数组，形成新的 RDD 类型是 RDD[Array[T]]。

示例：

// 创建⼀个 4 个分区的 RDD，并将每个分区的数据放到⼀个数组

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at makeRDD at <console>:24

scala> val newRdd= rdd.glom
newRdd: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[50] at glom at <console>:26

scala> newRdd.collect
res36: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5), Array(6, 7), Array(8, 9, 10))

6. groupBy(func)

作用： 按照 func 的返回值进行分组，func返回值作为 key ，对应的值放入一个迭代器中，返回的 RDD 类型是 RDD[(K, Iterable[T])]，每组内元素的顺序不能保证，并且甚至每次调用得到的顺序也有可能不同。

示例：

// 创建⼀个 RDD，按照元素的奇偶性进⾏分组

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at makeRDD at <console>:24

scala> val newRdd = rdd.groupBy(e => if(e % 2 == 0) "even" else "odd")
newRdd: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[53] at groupBy at <console>:26

scala> newRdd.collect
res37: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8, 10)), (odd,CompactBuffer(1, 3, 5, 7, 9)))

7. filter(func)

作用： 过滤，返回的新的 RDD 是由 func 的返回值为 true 的那些元素组成。

示例：

// 创建⼀个 RDD（由字符串组成），过滤出⼀个新 RDD（包含"xiao"⼦串）

scala> val rdd = sc.makeRDD(Array("xiaozhu", "xiaozhang", "zhu", "zhang", "xiaohong"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[54] at makeRDD at <console>:24

scala> val newRdd = rdd.filter(_.contains("xiao"))
newRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[55] at filter at <console>:26

scala> newRdd.collect
res38: Array[String] = Array(xiaozhu, xiaozhang, xiaohong)

8. sample(withReplacement, fraction, seed)

作用：

1.1 以指定的随机种⼦随机抽样出⽐例为 fraction 的数据，(抽取到的数量是: size * fraction )，需要注意的是得到的结果并不能保证准确的⽐例。
1.2 withReplacement 表示是抽出的数据是否放回， true 为有放回的抽样， false 为⽆放回的抽样。放回表示数据有可能会被重复抽取到, false 则不可能重复抽取到。如果是 false ，则 fraction 必须是: [0,1] , 是 true 则⼤于等于 0 就可以了。
1.3 seed ⽤于指定随机数⽣成器种⼦。⼀般⽤默认的，或者传⼊当前的时间戳。

示例：

// 不放回抽样

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at makeRDD at <console>:24

scala> val newRdd = rdd.sample(false, 0.5)
newRdd: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[57] at sample at <console>:26

scala> newRdd.collect
res39: Array[Int] = Array(1, 2, 4, 5, 6, 7, 8, 9)

// 放回抽样

scala> newRdd.collect
res39: Array[Int] = Array(1, 2, 4, 5, 6, 7, 8, 9)

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[58] at makeRDD at <console>:24

scala> val newRdd = rdd.sample(true, 1.5)
newRdd: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[59] at sample at <console>:26

scala> newRdd.collect
res40: Array[Int] = Array(1, 1, 1, 2, 5, 5, 7, 8, 8, 9, 9, 9, 9, 10)

9. distinct([numTasks])

作用： 对 RDD 中的元素执行去重操作，参数表示任务的数量，默认值和分区数保持一致。

示例：

// 对 RDD 的元素进行去重

scala> val rdd = sc.makeRDD(Array(1, 1, 1, 2, 5, 5, 7, 8, 8, 9, 9, 9, 9, 10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[60] at makeRDD at <console>:24

scala> val newRdd = rdd.distinct
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[63] at distinct at <console>:26

scala> newRdd.collect
res41: Array[Int] = Array(8, 1, 9, 5, 10, 2, 7)

10 coalesce(numPartitions, shuffle)

作用： 缩减分区数到指定的数量，用于大数据集过滤后，提高小数据集的执行效率。第⼆个参数表示是否 shuffle , 如果不传或者传⼊的为 false , 则表示不进⾏ shuffer , 则分区数减少有效, 增加分区数⽆效。

示例：

// 将 4 个分区缩减为 2 个分区

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at makeRDD at <console>:24

scala> rdd.partitions.length
res42: Int = 4

scala> val newRdd = rdd.coalesce(2)
newRdd: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[65] at coalesce at <console>:26

scala> newRdd.partitions.length
res43: Int = 2

11.repartition(numPartitions)

作用： 根据新的分区数，重新 shuffle 所有数据，这个操作总会通过网络，新的分区数相比之前可以多，也可以少。

示例：

// 将 2 个分区，扩充至 4 个分区

scala> val rdd = sc.makeRDD(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[66] at makeRDD at <console>:24

scala> rdd.partitions.length
res44: Int = 2

scala> val newRdd = rdd.repartition(4)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[70] at repartition at <console>:26

scala> newRdd.partitions.length
res45: Int = 4

12. sortBy(func, [ascending], [numTasks])

作用： 使用 func 先对数据进行处理，按照处理后的数据比较结果排序，默认为正序。

示例：

scala> val rdd = sc.makeRDD(Array(4,6,9,1,4,8,5,9,0))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at makeRDD at <console>:24

// 正序
scala> val newRdd = rdd.sortBy(x => x)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[76] at sortBy at <console>:26

scala> newRdd.collect
res46: Array[Int] = Array(0, 1, 4, 4, 5, 6, 8, 9, 9)

// 正序
scala> val newRdd = rdd.sortBy(x => x, true)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[81] at sortBy at <console>:26

scala> newRdd.collect
res47: Array[Int] = Array(0, 1, 4, 4, 5, 6, 8, 9, 9)

// 倒序
scala> val newRdd = rdd.sortBy(x => x, false)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[86] at sortBy at <console>:26

scala> newRdd.collect
res48: Array[Int] = Array(9, 9, 8, 6, 5, 4, 4, 1, 0)

13. pipe(command, [envVars])

作用： 管道，针对每个分区，通过管道传递给 shell 命令或脚本，返回输出的 RDD 。每个分区执行一次这个命令，如果只有一个分区，则执行一次命令。

示例：

2.1 创建一个脚本文件 pipe.sh

echo "hello"
while read line; do
  echo ">>>"$line
done

2.2 只有一个分区的 RDD

scala> val rdd = sc.makeRDD(1 to 5, 1)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[91] at makeRDD at <console>:24

scala> val newRdd = rdd.pipe("./pipe.sh")
newRdd: org.apache.spark.rdd.RDD[String] = PipedRDD[92] at pipe at <console>:26

scala> newRdd.collect
res51: Array[String] = Array(hello, >>>1, >>>2, >>>3, >>>4, >>>5)

2.3 多个分区的 RDD

scala> val rdd = sc.makeRDD(1 to 5, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[93] at makeRDD at <console>:24

scala> val newRdd = rdd.pipe("./pipe.sh")
newRdd: org.apache.spark.rdd.RDD[String] = PipedRDD[94] at pipe at <console>:26

scala> newRdd.collect
res52: Array[String] = Array(hello, >>>1, >>>2, hello, >>>3, >>>4, >>>5)

朱古力...

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
02 Spark：RDD转换算子之单Value类型

RDD的转换算子之单Value类型文章目录RDD的转换算子之单Value类型1. map(func)2. mapPartitions(func)3. mapPartitionsWithIndex(func)4. flatMap(func)5. glom6. groupBy(func)7. filter(func)8. sample(withReplacement, fraction, seed)9. distinct([numTasks])10 coalesce(numPartitions, shuffl
复制链接

扫一扫