1.aggregateByKey
val rdd1 = sc.parallelize(List(("tom",20),("tom",12),("tom1",18),("tom1",19),("tom2",22),("tom2",10),("tom2",30),("tom1",19)),2)
rdd1.aggregateByKey(0)(math.max(_,_),_+_).collect
结果
res38: Array[(String, Int)] = Array((tom2,30), (tom,20), (tom1,38))
2.combineByKey
(createCombine : V=>c ,mergeValue :(c,v)=>c, mergeCombiners :(c,v)=>c)
第一个参数:拿到分区的第一个元素,并按照给定的函数进行返回相应的类型
第二个参数:局部聚合
第三个参数:全局聚合
2.1.统计字符串
val rdd1= sc.textFile("hdfs://hlm1:9000/user/hdfs/emp.txt").flatMap(_.split(",")).map((_,1))
val rdd2= rdd1.combineByKey(x=>10+x,(a:Int,b :Int)=>a+b,(a:Int,b :Int)=>a+b)
2.2计算平均分数
val score = Array(("tom",80),("tom",90),("tom",70),("tom1",80),("tom1",88),("tom1",87))
val rdd = sc.parallelize(score)
rdd.combineByKey(score=> (1,score),(x:Tuple2[Int,Int],y:Int)=>(x._1+1,x._2+y),(m:Tuple2[Int,Int],n:Tuple2[Int,Int])=>(m._1+n._1,m._2+n._2))
(求得总分)
结果1(总分)
res41: Array[(String, (Int, Int))] = Array((tom,(3,240)), (tom1,(3,255)))
rdd.combineByKey(score=> (1,score),(x:Tuple2[Int,Int],y:Int)=>(x._1+1,x._2+y),(m:Tuple2[Int,Int],n:Tuple2[Int,Int])=>(m._1+n._1,m._2+n._2))
.map{case (name,(num,scores))=>(name,scores/num)}.collect
(求得平均分)
结果2(平均分)
res42: Array[(String, Int)] = Array((tom,80), (tom1,85))
3.action算子
val rdd = sc.parallelize(List(2,1,3,6,5),2)
rdd.reduce(_+_) res43: Int = 17
rdd.count res44: Long = 5
rdd.top(3) res45: Array[Int] = Array(6, 5, 3)
rdd.take(3) res47: Array[Int] = Array(2, 1, 3)
rdd.takeOrdered(3) res48: Array[Int] = Array(1, 2, 3)
rdd.first res49: Int = 2
countByKey
val rdd1 = sc.parallelize(List(("tom",80),("tom",90),("tom",70),("tom1",80),("tom1",88),("tom1",87)),2)
rdd1.countByKey
结果
res51: scala.collection.Map[String,Long] = Map(tom -> 3, tom1 -> 3)
countByValue
rdd1.countByValue
结果
res52: scala.collection.Map[(String, Int),Long] = Map((tom1,87) -> 1, (tom1,88) -> 1, (tom,80) -> 1, (tom,90) -> 1, (tom1,80) -> 1, (tom,70) -> 1)
(相同的元素统计)
foreach和foreachPartition区别
都没有返回值,foreach将func作用于每一个元素上
foreachPartition将func作用于每一个分区上
应用场景:一般都是作用在将结果输出
如果结果数据量小,可以用foreach进行存储
如果数据量大,会拿很多的连接进行存储,可能数据库直接宕机。可以用foreachPartition,用一个分区对用一个连接
val rdd = sc.parallelize(List(2,1,3,6,5,3,5,6,8,33),2)
rdd.foreachPartition(x=>println(x.reduce(_+_)))
17
55
filterByRange
val rdd1 = sc.parallelize(List(("tom",80),("tom",90),("tom",70),("tom1",80),("tom1",88),("tom1",87)),2)
rdd1.filterByRange("tom","tom").collect
tom to tom
res60: Array[(String, Int)] = Array((tom,80), (tom,90), (tom,70))
tom1 to tom
rdd1.filterByRange("tom1","tom").collect
结果
res62: Array[(String, Int)] = Array()
tom to tom1
rdd1.filterByRange("tom","tom1").collect
结果
res64: Array[(String, Int)] = Array((tom,80), (tom,90), (tom,70), (tom1,80), (tom1,88), (tom1,87))
flatMapValues
val rdd1 = sc.parallelize(List(("tom","80 90"),("tom1","88 90"),("tom2","87 94")),2)
rdd1.flatMapValues(_.split(" ")).collect
结果
res65: Array[(String, String)] = Array((tom,80), (tom,90), (tom1,88), (tom1,90), (tom2,87), (tom2,94))
foldByKey
val rdd2 = sc.parallelize(Array((12,"uzi1"),(2,"uzi2"),(3,"uzi3"),(3,"uzi4")))
rdd2.foldByKey("ADC")(_+_).collect
结果
res69: Array[(Int, String)] = Array((12,ADCuzi1), (2,ADCuzi2), (3,ADCuzi3uzi4))
keyBy
val rdd = sc.parallelize(List("jxlg","hdjd","shjps","javaweb","hadoop"))
rdd.keyBy(_.length).collect
结果
res71: Array[(Int, String)] = Array((4,jxlg), (4,hdjd), (5,shjps), (7,javaweb), (6,hadoop))
keys
val rdd = sc.parallelize(List("jxlg","hdjd","shjps","javaweb","hadoop")).map(x=>(x.length,x))
rdd.keys.collect
结果
res73: Array[Int] = Array(4, 4, 5, 7, 6)
values
rdd.values.collect
res74: Array[String] = Array(jxlg, hdjd, shjps, javaweb, hadoop)
collect:
属于Action算子,会将每一个Executor计算的结果汇总到Driver并将结果数据封装到Array
collectAsMap
val rdd = sc.parallelize(List(("a",1),("b",2)))
rdd.collectAsMap
结果
res75: scala.collection.Map[String,Int] = Map(b -> 2, a -> 1)