问题:求key只出现一次的数据, 如果用groupByKey或reduceByKey很容易就做出来了,现在用aggregateByKey求解一下。
输入数据:
asdfgh 546346
retr 4567
asdfgh 7685678
ghj 2345
asd 234
hadoop 435
ghj 23454
asdfgh 54675
asdfgh 546759878
asd 234
asdfgh 5467598782
代码:
package scala
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
import scala.collection.JavaConverters._
object AaidTest {
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setAppName("AaidTest").setMaster("local")
val sc=new SparkContext(conf)
sc.textFile("D://sparkmllibData/sparkml/mllibdata/arrregation.txt")
.map(line=>{
(line.split("\t")(0),line.split("\t")(1).toLong)
}).aggregateByKey(0L)(seqOp,seqOp)
.filter(line=> line._2!=-1L)
.collect().foreach(println)
}
def seqOp(U : (Long), v : (Long)) : Long = {
println("seqOp")
println("U="+U)
println("v="+v)
var count:Int=0
if(U!=0L){
count+=1
}
if(v!=0L){
count+=1
}
if(count>1){
-1L
}else{
v
}
}
}
输出结果:
seqOp
U=0
v=546346
seqOp
U=0
v=4567
seqOp
U=546346
v=7685678
seqOp
U=0
v=2345
seqOp
U=0
v=234
seqOp
U=0
v=435
seqOp
U=2345
v=23454
seqOp
U=1
v=54675
seqOp
U=1
v=546759878
seqOp
U=234
v=234
seqOp
U=1
v=5467598782
(hadoop,435)
(retr,4567)
很明显,达到最后的要求了。