Spark系列 - 3 - RDD 算子

最新推荐文章于 2024-06-20 10:30:00 发布

IfNotExists

最新推荐文章于 2024-06-20 10:30:00 发布

阅读量365

点赞数

分类专栏： Spark系列文章标签： spark

本文链接：https://blog.csdn.net/weixin_42633805/article/details/131307455

版权

Spark系列专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文主要通过实例讲解常用的RDD算子的用法。

前面提到RDD算子分为转换算子、行动算子，其中转换算子是将RDD通过一定的操作转换成新的RDD，是惰性加载，当执行行动算子时才会被加载。而行动算子是触发spark运行的操作，底层调用的是环境对象的runJob方法，创建Job 并提交。

下面介绍常用的算子，为了方便阅读，代码没有简写，熟练掌握后可以简写。另外这里给出的案例只是体现部分功能，具体的功能可以参考算子的方法定义。

一、转换算子

value 类型

1、map

说明：

映射转换成一个新的RDD，分区数不变，并且分区内数据执行为串行，分区间为无序。多用于格式转换。

案例：

定义一个1到5的序列，返回每个元素x2的新序列。

sc.makeRDD(1 to 5)
.map(x => x * 2)
.collect().foreach(println)

2、flatMap

说明：将序列中元素展开，返回一个新的RDD。多用于统计词频。

案例：

将所有元素按空格分隔符展开
sc.makeRDD(List("hello python", "hello java"))

.flatMap( x => x.split(" ") )

.collect().foreach(println)

3、mapPartitions

说明：map的变种，以分区为单位，将一个函数应用于RDD的每一个分区，返回一个新的RDD。会把整个分区的数据加载到内存中而且不会主动释放。

案例：

求分区最大值，(1, 2), (3, 4, 5) => 2, 5

sc.makeRDD(1 to 5, 2)
  .mapPartitions(
  iter => {
    List(iter.max).iterator
  }
).collect().foreach(println)

4、mapPartitionsWithIndex

说明：和 mapPartitions 的处理方式类似，还可以返回分区索引

案例：

获取元素分区索引, (0, 1) (0, 2) (0, 3) (1, 4) ...

sc.makeRDD(1 to 6, 2)
    .mapPartitionsWithIndex(
      (index, iter) => iter.map(
        num => {
          (index, num)
        }
      )
    ).collect().foreach(println)

5、glom

说明：将同一个分区的元素合并到一个Array

案例：

求分区最大值的和 (1, 2), (3, 4, 5) => 2 + 5

sc.makeRDD(1 to 5, 2)
.glom()
.map(x=> x.max)
.collect().sum

6、union

说明：并集

案例：

val arr1 = sc.makeRDD(1 to 5)
val arr2 = sc.makeRDD(3 to 7)
arr1.union(arr2).collect().foreach(println)

7、intersection

说明：交集

案例：

val arr1 = sc.makeRDD(1 to 5)
val arr2 = sc.makeRDD(3 to 7)
arr1.intersection(arr2).collect().foreach(println)

8、subtract

说明：差集（左边在右边没有的部分，注意序列左右顺序）

案例：

val arr1 = sc.makeRDD(1 to 5)
val arr2 = sc.makeRDD(3 to 7)
arr1.subtract(arr2).collect().foreach(println)

9、groupBy

说明：分组，需要指定分组的key，有shuffle操作

案例：

按首字母分区， (H, (Hello, Hello)) (S, (Spark, Scala))

sc.makeRDD(List("Hello", "Spark", "Scala", "Hello"))
  .groupBy(
    x => {
      x.charAt(0)
    }
).collect().foreach(println)

10、groupByKey

说明：将相同key的值放到一个组中，(("a", 1), ("a", 3), ("b", 1)) =>（"a", (1, 3)）, ("b", (1))

案例：

按首字母分区

sc.makeRDD(List(("a",1), ("a",3), ("b", 1)))
  .groupByKey()
.collect().foreach(println)

11、filter

说明：过滤，分区不变，但分区内的数据可能会出现倾斜

案例：

获取序列中的偶数

sc.makeRDD(1 to 6).filter(x => x % 2 ==0).collect().foreach(println)

12、distinct

说明：去重

案例：

去除列表中的重复值

sc.makeRDD(List(1, 2, 1, 0, 3, 5)).distinct().collect().foreach(println)

13、coalesece

说明：减少分区数，默认不会打乱分区顺序，但会出现数据倾斜情况，此时需要shuffle (默认为false);也可增加分区.

案例：

缩减分区

sc.makeRDD(1 to 10, 5)
  .coalesce(2)
  .mapPartitionsWithIndex(
    (index, iter)=> iter.map(
      x => (index, x)
    )
  ).collect().foreach(println)

增加分区

sc.makeRDD(1 to 10, 2)
  .coalesce(5, true)
  .mapPartitionsWithIndex(
    (index, iter)=> iter.map(
      x => (index, x)
    )
  ).collect().foreach(println)

14、repartition

说明：增加分区，随机shuffle

案例：

增加分区

sc.makeRDD(1 to 10, 2)
  .repartition(5)
  .mapPartitionsWithIndex(
  (index, iter) => iter.map(
    x => (index, x)
  )
).collect().foreach(println)

15、sortBy

说明：排序，默认升序

案例：

对列表的第一个元素排序，降序 => (7, 2) (3, 0) (1, 5)

sc.makeRDD(List((1, 5), (3, 0), (7, 2)), 2)
.sortBy(t=>t._1, false)
.collect().foreach(println)

16、zip

说明：拉链，将两个元素个数以及分区个数相同的序列合并

案例：

val a1: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("b", 2), ("c", 3)), 2)
val a2: RDD[(String, Int)] = sc.makeRDD(List(("aa", 3), ("d", 1), ("f", 0)), 2)
a1.zip(a2).collect().foreach(println)

k-v 类型

1、mapValues

说明：针对k-v 形式，对v进行操作

案例：

效果：值加1 => ("a", 2), ("a", 3), ("b", 4)

代码：

sc.makeRDD(List(("a", 1), ("a", 2), ("b", 3))).mapValues(x => x+1).collect().foreach(println)

2、reduceByKey

说明：分组聚合，对比 groupByKey多了聚合功能，在shuffle 前分区内会做聚合，减少落盘数据

案例：

效果：("a", 1), ("a", 2), ("b", 3) =》 ("a", 3), ("b", 3)

代码：

sc.makeRDD(List(("a", 1), ("a", 2), ("b", 3))).reduceByKey(
  (x: Int, y: Int) => {
    x + y
  }
).collect().foreach(println)

3、combineByKey

说明：针对对相同的key，对v 进行聚合。

包含3个参数:

createCombiner: V => C,  // 相同key的第一个数据进行结构转换
mergeValue: (C, V) => C,  // 分区内计算规则
mergeCombiners: (C, C) => C, // 分区间计算规则

案例：

求相同key的平均值

val rdd: RDD[(String, Int)] = sc.makeRDD(List(
  ("a", 1), ("a", 2), ("b", 3),
  ("b", 4), ("b", 5), ("a", 6)
), 2)

rdd.combineByKey(
  v =>(v, 1),   // (1, 1) (2, 1) (6, 1)
  (p: (Int, Int), v) => {
    (p._1 + v, p._2 + 1)   // 区间内相同key的值的和, 以及相同key的个数
  },
  (p1:(Int, Int), p2: (Int, Int)) => {
    (p1._1 + p2._1, p1._2+ p2._2)  // 分区外相同key的值的总和， 以及相同key的总个数
  }
)    // (a,(9,3)) (b,(12,3))
  .mapValues{case (sum, cnt) => sum/cnt }   // 求平均值 （a,3）(b,4)
  .collect().foreach(println)

4、partitionBy

说明：根据指定的分区规则对数据进行重新分区

案例：

对序列分区

sc.makeRDD(1 to 5)
  .map(x=>(x, 1))  // 格式转换（1, 1）(2, 1) (3, 1)
  .partitionBy(new HashPartitioner(2))  // hash 分区
  .mapPartitionsWithIndex((index, iter)=>iter.map(x => (index, x))) // 获取分区号
  .collect().foreach(println)

5、cogroup

说明：将相同key的元素放在一个组中，同时可以支持3个RDD

案例：

效果： (a, (1), (4)) (b, (2), (5)) (c, (3), (6, 7))

代码：

val rdd1: RDD[(String, Int)] = sc.makeRDD(List(
  ("a", 1), ("b", 2), ("c", 3), ("d", 8)
))
val rdd2: RDD[(String, Int)] = sc.makeRDD(List(
  ("a", 4), ("b", 5), ("c", 6), ("c", 7)
))
rdd1.cogroup(rdd2).collect().foreach(println)

6、join

说明：两个不同数据源的数据，相同key的 value会形成一个元组, 没有匹配上的不会出现。可能会出现笛卡尔积。其他还包括 leftOuterJoin、 rightOuterJoin、fullOuterJoin 。

案例：

效果： (a,(1,4)) (b,(2,5)) (c,(3,6)) (c,(3,7))

代码：

    val rdd1: RDD[(String, Int)] = sc.makeRDD(List(
      ("a", 1), ("b", 2), ("c", 3), ("d", 8)
    ))

    val rdd2: RDD[(String, Int)] = sc.makeRDD(List(
      ("a", 4), ("b", 5), ("c", 6), ("c", 7)
    ))

    rdd1.join(rdd2).collect().foreach(println)

7、sortByKey

说明：按key 排序的kv 对，默认升序

案例：

按 key 降序

val rdd: RDD[(String, Int)] = sc.makeRDD(List(
      ("a", 1), ("a", 2), ("b", 3),
      ("b", 4), ("b", 5), ("a", 6)
    ),2)
rdd.sortByKey(false).collect().foreach(println)

8、aggregateByKey

说明：根据设定规则，同时在分区内和分区间进行计算。

包含2个参数:

(zeroValue: U)  // 初始值
(seqOp: (U, V) => U,  // 分区内计算规则
 combOp: (U, U) => U)  // 分区间计算规则

案例：

求序列每个分区内最大值的总和

效果：

("a", 2), ("b", 3)

("a",6), ("b", 5)

=> ("a", 8) ("b", 8)

代码：

val rdd: RDD[(String, Int)] = sc.makeRDD(List(
  ("a", 1), ("a", 2), ("b", 3),
  ("b", 4), ("b", 5), ("a", 6)
), 2)

rdd.aggregateByKey(0)(
  (x, y) => math.max(x, y),
  (x, y) => x + y
).collect().foreach(println)

9、flodByKey

说明：aggregateByKey 的简化，当分区内和分区间的计算规则相同时使用。

案例：

求最大

("a", 2), ("b", 3)

("a",6), ("b", 5)

=》("a", 6) ("b", 5)

    val rdd: RDD[(String, Int)] = sc.makeRDD(List(
      ("a", 1), ("a", 2), ("b", 3),
      ("b", 4), ("b", 5), ("a", 6)
    ), 2)

    rdd.foldByKey(0)(
      (x, y) => math.max(x, y)
    ).collect().foreach(println)

二、行动算子

保存到文件

1、saveAsTextFile

xxxRDD.saveAsTextFile("output")

集合或变量操作

1、collect

说明：会将不同分区的数据按照分区顺序采集到Driver端内存中，以数组的形式返回数据集的所有元素。上面转换算子案例中有使用到，这里不再提供案例。

2、foreach

说明：分布式遍历RDD中的每一个元素，调用指定函数。上面转换算子案例中有使用到，这里不再提供案例。

3、count

说明：统计序列中元素个数

案例：

println(sc.makeRDD(1 to 5, 2).count())

4、countByKey

说明：K-V对中，K出现的次数

案例：

    val rdd: RDD[(String, Int)] = sc.makeRDD(List(
      ("a", 1), ("a", 2), ("b", 3),
      ("b", 4), ("b", 5), ("a", 6)
    ), 2)
    println(rdd.countByKey())  
=》 Map(b -> 3, a -> 3)

5、countByValue

说明：值出现的次数

println(sc.makeRDD(List(1, 1, 2, 1)).countByValue())

=> Map(1 -> 3, 2 -> 1)

6、first

说明：获取序列中第一个元素

案例：

println(sc.makeRDD(1 to 5).first())

7、take

说明：获取序列中指定个数的元素

案例：

sc.makeRDD(1 to 5).take(2).foreach(println)

8、takeOrdered

说明：排序后去指定个数的元素，默认升序

案例：

sc.makeRDD(List(5, 2, 1, 8)).takeOrdered(2).foreach(println) // 升序获取前2个元素

sc.makeRDD(List(5, 2, 1, 8)).takeOrdered(2)(Ordering.Int.reverse).foreach(println) // 升序再反转，获取前两个

9、reduce

说明：聚合RDD中的所有元素，先聚合分区内，在聚合分区间

案例：

println(sc.makeRDD(1 to 10, 3).reduce((x, y) => x - y))

=> (1,2,3), (4,5, 6)(7, 8,9,10) => (1-2-3), (4 - 5 -6),(7-8-9-10) => -4 - (-7) -(-20) =23

以上列举的是一些常见的算子，其他可参考官网。

IfNotExists

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark系列 - 3 - RDD 算子

本文主要通过实例讲解常用的RDD算子的用法。
复制链接

扫一扫

专栏目录