spark算子

数据准备

这是hdfs上的result_math.txt数据:为数学成绩表,第一列为学号, 第二列为科目(数学),第三列为成绩

这是hdfs上的result_bigdata.txt数据:为大数据成绩表,第一列为学号, 第二列为科目(大数据),第三列为成绩

1、通过textFile读取hdfs上文件为RDD

scala> val math=sc.textFile("test1/result_math.txt")
scala> math.collect
结果为:
res0: Array[String] = Array(1001        math    96, 1002        math    94, 1003	mat100, 1004	math	100, 1005	math	94, 1006	math	80, 1007	mat90, 1008	math	94, 1009	math	84, 1010	math	86, 1011	math	79, 1012	math	91)

2、map:将原来RDD的每个数据按照map的中自定义的函数转换为新的RDD,map不会改变分区个数

将math的每个元素按照\t进行分割,并将第三个分割后的元素转换为Int型
scala> val math_map=math.map{x=>val line=x.split("\t");(line(0),line(1),line(2).toInt)}
查看math_map
scala> math_map.collect
结果
res0: Array[(String, String, Int)] = Array((1001,math,96), (1002,math,94), (1003,math,100), (1004,math,100), (1005,math,94), (1006,math,80), (1007,math,90), (1008,math,94), (1009,math,84), (1010,math,86), (1011,math,79), (1012,math,91))

3、sortBy是对RDD进行排序

sortBy有三个参数:

1、为f:(T)=>k,左边为要被排序对象中的每一个元素,右边为指定要进行排序的值

2、为指定 升降序,true为升序,flase为降序

3、为排序完分区的个数

按成绩进行排序(第三列为成绩)
scala> val math_sort = math_map.sortBy(x=>x._3,false,1)
scala> math_sort.collect
结果:
res1: Array[(String, String, Int)] = Array((1003,math,100), (1004,math,100), (1001,math,96), (1002,math,94), (1005,math,94), (1008,math,94), (1012,math,91), (1007,math,90), (1010,math,86), (1009,math,84), (1006,math,80), (1011,math,79))

 

4、union:合并两个类型一样的RDD

首先向创建一个与math类型一样的数据
scala> val bigdata=sc.textFile("test1/result_bigdata.txt")
scala> val bigdata_map=bigdata.map{x=>val line=x.split("\t");(line(0),line(1),line(2).toInt)}
scala> bigdata_map.collect
res2: Array[(String, String, Int)] = Array((1001,hadoop,90), (1002,hadoop,94), (1003,hadoop,100), (1004,hadoop,99), (1005,hadoop,90), (1006,hadoop,94), (1007,hadoop,100), (1008,hadoop,93), (1009,hadoop,89), (1010,hadoop,78), (1011,hadoop,91), (1012,hadoop,84))

合并math_map与bigdata_map
scala> val grage=math_sort.union(bigdata_map)
scala> grage.collect
res4: Array[(String, String, Int)] = Array((1003,math,100), (1004,math,100), (1001,math,96), (1008,math,94), (1002,math,94), (1005,math,94), (1012,math,91), (1007,math,90), (1010,math,86), (1009,math,84), (1006,math,80), (1011,math,79), (1001,hadoop,90), (1002,hadoop,94), (1003,hadoop,100), (1004,hadoop,99), (1005,hadoop,90), (1006,hadoop,94), (1007,hadoop,100), (1008,hadoop,93), (1009,hadoop,89), (1010,hadoop,78), (1011,hadoop,91), (1012,hadoop,84))


5、filter:对元素进行过滤,保留满足表达式的数据

查询满分的记录
scala> val mangfen=grage.filter(x=>x._3==100)
scala> mangfen.collect
res5: Array[(String, String, Int)] = Array((1003,math,100), (1004,math,100), (1003,hadoop,100), (1007,hadoop,100))

6、distinct:去重,将重复的数据去除

取出成绩为100的学号,并把学号去重
scala> mangfen.map(x=>x._1).distinct.collect
res9: Array[String] = Array(1003, 1004, 1007)    

7、subtract:求两个RDD的差集

scala> val rdd1=sc.parallelize(List(1,2,3))
scala> val rdd2=sc.parallelize(List(2,3,4))
scala> rdd2.subtract(rdd1).collect
res10: Array[Int] = Array(4)                                                    

scala> rdd1.subtract(rdd2).collect
res11: Array[Int] = Array(1)

8、intersection:求两个RDD的交集

scala> rdd1.intersection(rdd2).collect
res13: Array[Int] = Array(3, 2)

9、 cartesian:求两个RDD的笛卡尔积

scala> rdd1.cartesian(rdd2).collect
res14: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (2,2), (2,3), (2,4), (3,2), (3,3), (3,4))

10、flatMap:对集合中的元素进行操作再扁平化

scala> math.flatMap(x=>x.split("\t")).collect
res17: Array[String] = Array(1001, math, 96, 1002, math, 94, 1003, math, 100, 1004, math, 100, 1005, math, 94, 1006, math, 80, 1007, math, 90, 1008, math, 94, 1009, math, 84, 1010, math, 86, 1011, math, 79, 1012, math, 91)

对比一下没有扁平化的效果
scala> math.map(x=>x.split("\t")).collect
res19: Array[Array[String]] = Array(Array(1001, math, 96), Array(1002, math, 94), Array(1003, math, 100), Array(1004, math, 100), Array(1005, math, 94), Array(1006, math, 80), Array(1007, math, 90), Array(1008, math, 94), Array(1009, math, 84), Array(1010, math, 86), Array(1011, math, 79), Array(1012, math, 91))

11、join:把键值对中相同键的值进行合并

scala> val math=sc.textFile("test1/result_math.txt").map{x=>val line=x.split("\t");(line(0),line(2).toInt)}

scala> val bigdata=sc.textFile("test1/result_bigdata.txt").map{x=>val line=x.split("\t");(line(0),line(2).toInt)}

scala> math.join(bigdata).collect
res1: Array[(String, (Int, Int))] = Array((1005,(94,90)), (1012,(91,84)), (1001,(96,90)), (1009,(84,89)), (1010,(86,78)), (1003,(100,100)), (1007,(90,100)), (1008,(94,93)), (1002,(94,94)), (1011,(79,91)), (1004,(100,99)), (1006,(80,94)))

12、mapValues:对键值对类型 中的值进行map操作,不对键做操作

将值求和
scala> math.join(bigdata).mapValues(x=>x._1+x._2).collect
res2: Array[(String, Int)] = Array((1005,184), (1012,175), (1001,186), (1009,173), (1010,164), (1003,200), (1007,190), (1008,187), (1002,188), (1011,170), (1004,199), (1006,174))

13、reduceByKey:把键相同的值进行操作,只操作值

数学与大数据成绩进行合并,将成绩后再记录1代表选课数,相加学号相同的成绩与选课数,得到各个学生的成绩总分与总选课数。
scala> math.union(bigdata).mapValues(x=>(x,1)).reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).collect
res5: Array[(String, (Int, Int))] = Array((1005,(184,2)), (1012,(175,2)), (1001,(186,2)), (1009,(173,2)), (1002,(188,2)), (1006,(174,2)), (1010,(164,2)), (1003,(200,2)), (1007,(190,2)), (1008,(187,2)), (1011,(170,2)), (1004,(199,2)))

14、将RDD数据保存至hdfs方法


1、把分区设置为1,并保存结果,不设置分区会根据你的分区个数划分为多个文件
val a=math.union(bigdata).combineByKey(x=>(x,1),(x:(Int,Int),y:Int)=>(x._1+y,x._2+1),(x:(Int,Int),y:(Int,Int))=>(x._1+y._1,x._2+y._2)).collect
a.mapValues(x=>x._1/x._2).repartition(1).saveAsTextFile("test1/out")

2、把结果保存为sequence文件格式
import org.apache.hadoop.io.{IntWritable,Text}
a.mapValues(x=>x._1/x._2).saveAsSequenceFile("test1/outse")
读取sequence文件内容
val outse=sc.sequenceFile("test1/outse",classOf[Text],classOf[IntWritable]).map{case (x,y)=>(x.toString,y.get())}

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值