5.Spark Core 应用解析之RDD常用行动操作

        RDD 中的Action是数据执行部分,其通过执行count,reduce,collect等方法真正执行数据的计算部分

1.reduce(func)

通过func函数聚集RDD中的所有元素,这个功能必须是可交换且可并联的

scala> val rdd1 = sc.parallelize(1 to 10, 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> rdd1.reduce(_+_)
res5: Int = 55

scala> val rdd2 = sc.parallelize(Array(("a", 1), ("a", 3), ("c", 3), ("d", 5)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> rdd2.reduce((x, y) => ((x._1 + y._1), x._2 + y._2))
res7: (String, Int) = (aacd,12)
2.collect()

在驱动程序中,以数组的形式返回数据集的所有元素
在这里插入图片描述

scala> val rdd = sc.parallelize(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

scala> rdd.collect()
res8: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
3.count()

返回RDD的元素个数

scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:24

scala> rdd.count()
res9: Long = 10
4.first()

返回RDD的第一个元素(类似于take(1))

scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:24

scala> rdd.first()
res10: Int = 1
5.take(n)

返回一个由数据集的前n个元素组成的数组,不排序

scala> val rdd = sc.parallelize(Array(5, 1, 8, 4, 10, 2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> rdd.take(1)
res6: Array[Int] = Array(5)

scala> rdd.take(3)
res7: Array[Int] = Array(5, 1, 8)
6.top(n)

返回一个由数据集前n个元素组成的数组,默认降序或可指定排序规则

scala> val rdd = sc.parallelize(Array(5, 1, 8, 4, 10, 2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> rdd.top(1)
res9: Array[Int] = Array(10)

scala> rdd.top(2)
res10: Array[Int] = Array(10, 8)

# 指定排序规则为升序
scala> implicit val myOrd = implicitly[Ordering[Int]].reverse
myOrd: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@140a2add

scala> rdd.top(1)
res11: Array[Int] = Array(1)

scala> rdd.top(2)
res12: Array[Int] = Array(1, 2)
7.takeOrdered(n)

与top类似,返回一个由数据集前n个元素组成的数组,降序

scala> val rdd = sc.parallelize(Array(5, 1, 8, 4, 10, 2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:26

scala> rdd.takeOrdered(1)
res13: Array[Int] = Array(10)

scala> rdd.takeOrdered(2)
res14: Array[Int] = Array(10, 8)
8.takeSample(withReplacement,num, [seed])

返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子
参数:

  1. withReplacement:元素可以多次抽样(在抽样时替换)
  2. num:返回的样本的大小
  3. seed:随机数生成器的种子
    注意:该方法仅在预期结果数组很小的情况下使用,因为所有数据都被加载到driver的内存中
scala> var rdd = sc.parallelize(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collect()
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)     

scala> rdd.takeSample(true, 5, 3)
res1: Array[Int] = Array(3, 5, 5, 9, 7)

scala> rdd.takeSample(false, 5, 3)
res2: Array[Int] = Array(6, 4, 2, 3, 5)
9.aggregate (zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)

aggregate函数将每个分区里面的元素通过seqOp和初始值进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回的类型不需要和RDD中元素类型一致
在这里插入图片描述

scala> var rdd = sc.makeRDD(1 to 4, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:24

scala> rdd.collect()
res0: Array[Int] = Array(1, 2, 3, 4)                                            

scala> rdd.aggregate(1)(
     | {(x: Int, y: Int) => x + y},
     | {(a: Int, b: Int) => a + b}
     | )
res1: Int = 13

scala> rdd.aggregate(1)(
     | {(x: Int, y: Int) => x * y},
     | {(a: Int, b: Int) => a + b}
     | )
res1: Int = 15
10.fold(num)(func)

折叠操作,aggregate的简化操作,seqop和combop一样

scala> var rdd = sc.makeRDD(1 to 4, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:24

scala> rdd.collect()
res0: Array[Int] = Array(1, 2, 3, 4)              
         
scala> rdd.fold(1)((x: Int, y:Int) => x + y)
res2: Int = 13

scala> rdd.fold(1)(_+_)
res3: Int = 13
11.saveAsTextFile(path)

将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本

scala> var rdd = sc.parallelize(1 to 10, 2)

scala> rdd.saveAsTextFile("hdfs://harvey:9000/rdd")

[hadoop@harvey ~]$ hadoop fs -text /rdd/*
1
2
3
4
5
6
7
8
9
10
12.saveAsSequenceFile(path)

将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统

scala> var rdd = sc.parallelize(List((1, 2), (2, 5), (1,6), (3, 4), (2, 8), (3, 1), (3, 5)), 3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[18] at parallelize at <console>:26

scala> rdd.saveAsSequenceFile("hdfs://harvey:9000/seqFile")

[hadoop@harvey ~]$ hadoop fs -text /seqFile/*
1	2
2	5
1	6
3	4
2	8
3	1
3	5
13.saveAsObjectFile(path)

用于将RDD中的元素序列化成对象,存储到文件中

scala> var rdd = sc.parallelize(List((1, 2), (2, 5), (1,6), (3, 4), (2, 8), (3, 1), (3, 5)), 3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[18] at parallelize at <console>:26
                                                                                
scala> rdd.saveAsObjectFile("hdfs://harvey:9000/objFile")

在这里插入图片描述

14.countByKey()

针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数

scala> var rdd = sc.parallelize(List((1, 2), (2, 5), (1,6), (3, 4), (2, 8), (3, 1), (3, 5)), 3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[22] at parallelize at <console>:26

scala> rdd.countByKey()
res19: scala.collection.Map[Int,Long] = Map(3 -> 3, 1 -> 2, 2 -> 2)
15.foreach(func)

在数据集的每一个元素上,运行函数func进行更新

scala> var rdd = sc.makeRDD(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at makeRDD at <console>:26

scala> var sum = sc.accumulator(0)
warning: there were two deprecation warnings; re-run with -deprecation for details
sum: org.apache.spark.Accumulator[Int] = 0

scala> rdd.foreach(sum+=_)

scala> sum.value
res21: Int = 55

scala> rdd.collect().foreach(println)
1
2
3
4
5
6
7
8
9
10
16.Action 之 数值RDD

Spark 对包含数值数据的 RDD 提供了一些描述性的统计操作,这些行动操作仅仅适用于数值型的RDD,其他类型是不支持的。

Spark 的数值操作是通过流式算法实现的,允许以每次一个元素的方式构建出模型。这些 统计数据都会在调用 stats() 时通过一次遍历数据计算出来,并以 StatsCounter 对象返回
在这里插入图片描述

scala> var rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.count()
res0: Long = 10

scala> rdd.mean()
res1: Double = 5.5

scala> rdd.sum()
res2: Double = 55.0

scala> rdd.max()
res3: Int = 10

scala> rdd.min()
res4: Int = 1

scala> rdd.variance()
res5: Double = 8.25

scala> rdd.sampleVariance()
res6: Double = 9.166666666666666

scala> rdd.stdev()
res7: Double = 2.8722813232690143

scala> rdd.sampleStdev()
res8: Double = 3.0276503540974917
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值