数据准备
这是hdfs上的result_math.txt数据:为数学成绩表,第一列为学号, 第二列为科目(数学),第三列为成绩
这是hdfs上的result_bigdata.txt数据:为大数据成绩表,第一列为学号, 第二列为科目(大数据),第三列为成绩
1、通过textFile读取hdfs上文件为RDD
scala> val math=sc.textFile("test1/result_math.txt")
scala> math.collect
结果为:
res0: Array[String] = Array(1001 math 96, 1002 math 94, 1003 mat100, 1004 math 100, 1005 math 94, 1006 math 80, 1007 mat90, 1008 math 94, 1009 math 84, 1010 math 86, 1011 math 79, 1012 math 91)
2、map:将原来RDD的每个数据按照map的中自定义的函数转换为新的RDD,map不会改变分区个数
将math的每个元素按照\t进行分割,并将第三个分割后的元素转换为Int型
scala> val math_map=math.map{x=>val line=x.split("\t");(line(0),line(1),line(2).toInt)}
查看math_map
scala> math_map.collect
结果
res0: Array[(String, String, Int)] = Array((1001,math,96), (1002,math,94), (1003,math,100), (1004,math,100), (1005,math,94), (1006,math,80), (1007,math,90), (1008,math,94), (1009,math,84), (1010,math,86), (1011,math,79), (1012,math,91))
3、sortBy是对RDD进行排序
sortBy有三个参数:
1、为f:(T)=>k,左边为要被排序对象中的每一个元素,右边为指定要进行排序的值
2、为指定 升降序,true为升序,flase为降序
3、为排序完分区的个数
按成绩进行排序(第三列为成绩)
scala> val math_sort = math_map.sortBy(x=>x._3,false,1)
scala> math_sort.collect
结果:
res1: Array[(String, String, Int)] = Array((1003,math,100), (1004,math,100), (1001,math,96), (1002,math,94), (1005,math,94), (1008,math,94), (1012,math,91), (1007,math,90), (1010,math,86), (1009,math,84), (1006,math,80), (1011,math,79))
4、union:合并两个类型一样的RDD
首先向创建一个与math类型一样的数据
scala> val bigdata=sc.textFile("test1/result_bigdata.txt")
scala> val bigdata_map=bigdata.map{x=>val line=x.split("\t");(line(0),line(1),line(2).toInt)}
scala> bigdata_map.collect
res2: Array[(String, String, Int)] = Array((1001,hadoop,90), (1002,hadoop,94), (1003,hadoop,100), (1004,hadoop,99), (1005,hadoop,90), (1006,hadoop,94), (1007,hadoop,100), (1008,hadoop,93), (1009,hadoop,89), (1010,hadoop,78), (1011,hadoop,91), (1012,hadoop,84))
合并math_map与bigdata_map
scala> val grage=math_sort.union(bigdata_map)
scala> grage.collect
res4: Array[(String, String, Int)] = Array((1003,math,100), (1004,math,100), (1001,math,96), (1008,math,94), (1002,math,94), (1005,math,94), (1012,math,91), (1007,math,90), (1010,math,86), (1009,math,84), (1006,math,80), (1011,math,79), (1001,hadoop,90), (1002,hadoop,94), (1003,hadoop,100), (1004,hadoop,99), (1005,hadoop,90), (1006,hadoop,94), (1007,hadoop,100), (1008,hadoop,93), (1009,hadoop,89), (1010,hadoop,78), (1011,hadoop,91), (1012,hadoop,84))
5、filter:对元素进行过滤,保留满足表达式的数据
查询满分的记录
scala> val mangfen=grage.filter(x=>x._3==100)
scala> mangfen.collect
res5: Array[(String, String, Int)] = Array((1003,math,100), (1004,math,100), (1003,hadoop,100), (1007,hadoop,100))
6、distinct:去重,将重复的数据去除
取出成绩为100的学号,并把学号去重
scala> mangfen.map(x=>x._1).distinct.collect
res9: Array[String] = Array(1003, 1004, 1007)
7、subtract:求两个RDD的差集
scala> val rdd1=sc.parallelize(List(1,2,3))
scala> val rdd2=sc.parallelize(List(2,3,4))
scala> rdd2.subtract(rdd1).collect
res10: Array[Int] = Array(4)
scala> rdd1.subtract(rdd2).collect
res11: Array[Int] = Array(1)
8、intersection:求两个RDD的交集
scala> rdd1.intersection(rdd2).collect
res13: Array[Int] = Array(3, 2)
9、 cartesian:求两个RDD的笛卡尔积
scala> rdd1.cartesian(rdd2).collect
res14: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (2,2), (2,3), (2,4), (3,2), (3,3), (3,4))
10、flatMap:对集合中的元素进行操作再扁平化
scala> math.flatMap(x=>x.split("\t")).collect
res17: Array[String] = Array(1001, math, 96, 1002, math, 94, 1003, math, 100, 1004, math, 100, 1005, math, 94, 1006, math, 80, 1007, math, 90, 1008, math, 94, 1009, math, 84, 1010, math, 86, 1011, math, 79, 1012, math, 91)
对比一下没有扁平化的效果
scala> math.map(x=>x.split("\t")).collect
res19: Array[Array[String]] = Array(Array(1001, math, 96), Array(1002, math, 94), Array(1003, math, 100), Array(1004, math, 100), Array(1005, math, 94), Array(1006, math, 80), Array(1007, math, 90), Array(1008, math, 94), Array(1009, math, 84), Array(1010, math, 86), Array(1011, math, 79), Array(1012, math, 91))
11、join:把键值对中相同键的值进行合并
scala> val math=sc.textFile("test1/result_math.txt").map{x=>val line=x.split("\t");(line(0),line(2).toInt)}
scala> val bigdata=sc.textFile("test1/result_bigdata.txt").map{x=>val line=x.split("\t");(line(0),line(2).toInt)}
scala> math.join(bigdata).collect
res1: Array[(String, (Int, Int))] = Array((1005,(94,90)), (1012,(91,84)), (1001,(96,90)), (1009,(84,89)), (1010,(86,78)), (1003,(100,100)), (1007,(90,100)), (1008,(94,93)), (1002,(94,94)), (1011,(79,91)), (1004,(100,99)), (1006,(80,94)))
12、mapValues:对键值对类型 中的值进行map操作,不对键做操作
将值求和
scala> math.join(bigdata).mapValues(x=>x._1+x._2).collect
res2: Array[(String, Int)] = Array((1005,184), (1012,175), (1001,186), (1009,173), (1010,164), (1003,200), (1007,190), (1008,187), (1002,188), (1011,170), (1004,199), (1006,174))
13、reduceByKey:把键相同的值进行操作,只操作值
数学与大数据成绩进行合并,将成绩后再记录1代表选课数,相加学号相同的成绩与选课数,得到各个学生的成绩总分与总选课数。
scala> math.union(bigdata).mapValues(x=>(x,1)).reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).collect
res5: Array[(String, (Int, Int))] = Array((1005,(184,2)), (1012,(175,2)), (1001,(186,2)), (1009,(173,2)), (1002,(188,2)), (1006,(174,2)), (1010,(164,2)), (1003,(200,2)), (1007,(190,2)), (1008,(187,2)), (1011,(170,2)), (1004,(199,2)))
14、将RDD数据保存至hdfs方法
1、把分区设置为1,并保存结果,不设置分区会根据你的分区个数划分为多个文件
val a=math.union(bigdata).combineByKey(x=>(x,1),(x:(Int,Int),y:Int)=>(x._1+y,x._2+1),(x:(Int,Int),y:(Int,Int))=>(x._1+y._1,x._2+y._2)).collect
a.mapValues(x=>x._1/x._2).repartition(1).saveAsTextFile("test1/out")
2、把结果保存为sequence文件格式
import org.apache.hadoop.io.{IntWritable,Text}
a.mapValues(x=>x._1/x._2).saveAsSequenceFile("test1/outse")
读取sequence文件内容
val outse=sc.sequenceFile("test1/outse",classOf[Text],classOf[IntWritable]).map{case (x,y)=>(x.toString,y.get())}