Spark 常见算子总结

最新推荐文章于 2022-03-24 21:40:32 发布

qinsur

最新推荐文章于 2022-03-24 21:40:32 发布

阅读量464

点赞数

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/weixin_45560844/article/details/106500083

版权

Spark常见的算子

----Spark中map和flatmap的区别：
map函数会对每一条输入进行指定的操作，然后为每一条输入返回一个对象。
而flatMap函数则是两个操作的集合——正是“先映射后扁平化”：
操作1：同map函数一样：对每一条输入进行指定的操作，然后为每一条输入返回一个对象
操作2：最后将所有对象合并为一个对象

scala> rdd5.map(t=>{
   t.split("\t")}).collect
res9: Array[Array[String]] = Array(Array(75, 2018-09-17, BK181713017, 小一), 
									Array(75, 2018-09-17, BK181913016, 小二), 
									Array(75, 2018-09-17, BK181913062, 小四))

scala> rdd5.flatMap(t=>{
   t.split("\t")}).collect
    res8: Array[String] = 
		Array(75, 2018-09-17, BK181713017, 小一,75, 2018-09-17, BK181913016, 小二,75, 2018-09-17, BK181913062, 小四, 75, 2018-09-17, BK181913007)



scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.glom.collect
res0: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))

----map算子
    ----x:RDD中所有元素
scala> a.map(x=>(x*2)).glom.collect
res1: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))

map、mapPartitions区别主要在于调用力度不同：
map的输入变换函数应用于RDD中所有元素，而mapPartitions应用于所有分区。
如parallelize（1 to 10， 3），map函数执行10次，而mapPartitions函数执行3次。

----mapPartitions算子
    ----x:所有分区
    ----y:每个分区里面的所有元素
scala> a.mapPartitions(x=>(x.map(y=>(y*2)))).glom.collect
res2: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))

----mapPartitionsWithIndex算子
----index:每个分区的索引号
----x:所有分区
----y:每个分区里面的所有元素

scala> a.mapPartitionsWithIndex((index,x)=>x.map(y=>(index,y))).glom.collect
res4: Array[Array[(Int, Int)]] = Array(Array((0,1), (0,2), (0,3)), Array((1,4), (1,5), (1,6)), Array((2,7), (2,8), (2,9), (2,10)))

----union算子：合并两个RDD

scala> val b = sc.makeRDD(1 to 5,2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at <console>:24

scala> b.glom.collect
res5: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))

----生成一个新的RDD

scala> a.union(b)
res6: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at union at <console>:28

scala> res6.glom.collect
res7: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10), Array(1, 2), Array(3, 4, 5))

----intersection: 交集算子

scala> a.intersection(b)
res8: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at intersection at <console>:28

scala> res8.collect
res9: Array[Int] = Array(3, 4, 1, 5, 2)

scala> res8.glom.collect
res10: Array[Array[Int]] = Array(Array(3), Array(4, 1), Array(5, 2))

----subtract：计算差的一种函数
----前一个RDD中的元素减去后一个RDD中相同的元素，将前一个RDD不同的元素保留下来

scala> val rdd1 = sc.parallelize(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[104] at parallelize at <console>:24

scala> val rdd2 = sc.makeRDD(3 to 6)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[106] at makeRDD at <console>:24
                                                                                                                              
scala> rdd1.subtract(rdd2).collect
res55: Array[Int] = Array(1, 2, 7)

----sortBy排序算子：默认true(升序)，false(降序)
----sortBy源码：

    def sortBy[K](
          f: (T) => K,
          ascending: Boolean = true,    是否排序，默认(true)升序
          numPartitions: Int = this.partitions.length)      重新定义分区

scala> val c = a.intersection(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at intersection at <console>:27

scala> c.collect
res12: Array[Int] = Array(3, 4, 1, 5, 2)

----sortBy：不加true默认是true(升序)

scala> c.sortBy(x=>x).collect
res14: Array[Int] = Array(1, 2, 3, 4, 5)

----加上true一样

scala> c.sortBy(x=>x,true).collect
res15: Array[Int] = Array(1, 2, 3, 4, 5)

----sortBy：false(降序)

scala> c.sortBy(x=>x,false).collect
res16: Array[Int] = Array(5, 4, 3, 2, 1)

----distinct：去重算子
----（因为去重后数据量肯定减少，所以可以加上参数，重新定义默认的分区数量）

scala> val c = a.union(b)
c: org.apache.spark.rdd.RDD[Int] = UnionRDD[48] at union at <console>:27

scala> c.collect
res17: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5)

scala> c.distinct.collect
res18: Array

最低0.47元/天解锁文章

qinsur

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark 常见算子总结

Spark常见的算子）[root@zjw3 ~]# spark-shellSpark context Web UI available at http://zjw3:4040Spark context available as ‘sc’ (master = local[*], app id = local-1588002167755).Spark session available as ‘spark’.Welcome to____ __/ / ___ / /\
复制链接

扫一扫

专栏目录