spark一些算子的demo,为了方便理解把运行结果也给大家看一下,主要是加深对算子的熟练程度与深入的理解
Transformation算子:
1.map
/**
* map算子,将RDD中的每个元素传入自定义函数,获取一个新的元素,
* 然后用新的元素组成新的RDD。
*/
val conf = new SparkConf().setAppName("mapDemo").setMaster("local")
val sc = new SparkContext(conf)
val numRDD = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
numRDD.map(num=>num*2).foreach(println)
sc.stop()
2.reduceByKey
//相同key的value按照传入的自定义函数进行聚合
val conf = new SparkConf().setMaster("local").setAppName("reduceByKey")
val sc = new SparkContext(conf)
val datasRDD = sc.parallelize(Array(Tuple2("class1",98),
Tuple2("class2",96),
Tuple2("class1",90),
Tuple2("class2",100),
Tuple2("class1",94)))
val result = datasRDD.map(m=>(m._1,m._2)).reduceByKey((x,y)=>x+y)
result.foreach(i=>println(i._1+"班级的总分是:"+i._2))
3.groupByKey
//gropuByKey算子根据key进行分组,每个key对应一个Iterable<value>
val conf = new SparkConf().setAppName("groupByKey").setMaster("local")
val sc = new SparkContext(conf)
val datas = sc.parallelize(Array(
Tuple2("class1",98),
Tuple2("class2",96),
Tuple2("class1",90),
Tuple2("class2",100),
Tuple2("class1",94)))
datas.groupByKey().foreach(
i=>{
println("班级:"+i._1)
i._2.foreach(item=>println(item))
println("----------------------")
}
)
4.sortByKey
//sortByKey算子对每个key进行排序操作。
val conf = new SparkConf().setAppName("sortByKey").setMaster("local")
val sc = new SparkContext(conf)
val datasRDD = sc.parallelize(Array(
Tuple2(98,"张三"),
Tuple2(100,"李四"),
Tuple2(92,"Jack"),
Tuple2(96,"tom")
))
datasRDD.map(m=>(m._1,m._2)).sortByKey(false).foreach(i=>println(i._2+"的成绩是:"+i._1))
5.cogroup
//返回数据集中没有匹配上的数据。
val conf = new SparkConf().setAppName("cogroup").setMaster("local")
val sc = new SparkContext(conf)
val infoRDD = sc.parallelize(Array(
Tuple2(1,"张三"),
Tuple2(2,"李四"),
Tuple2(3,"jack"),
Tuple2(4,"tom")
))
val scoresRDD=sc.parallelize(Array(
Tuple2(1,98),
Tuple2(2,96),
Tuple2(3,92),
Tuple2(5,99)
))
val result=infoRDD.cogroup(scoresRDD)
result.foreach(m=> {
println(m)
println(m._1)
println(m._2._1.toString())
println(m._2._2.toString())
println("================================")
})
6.join
/**
* join 算子对两个包含<key,value>对的RDD进行join操作,
* 每个key join上的pair都会传入自定义函数进行处理,
* 相当于Hive中的内联接查询,获取数据集中能匹配上的共同数据。
*/
val conf = new SparkConf().setMaster("local").setAppName("join")
val sc = new SparkContext(conf)
val infoRDD = sc.parallelize(Array(
Tuple2(1, "张三"),
Tuple2(2, "李四"),
Tuple2(3, "jack"),
Tuple2(4, "tom")
))
val scoresRDD = sc.parallelize(Array(
Tuple2(1, 98),
Tuple2(2, 96),
Tuple2(3, 92),
Tuple2(5, 99)
))
val result = infoRDD.join(scoresRDD)
result.foreach(i => {
println(i)
println("学号:" + i._1)
println("姓名:" + i._2._1)
println("成绩:" + i._2._2)
println("----------------")
}
)
7.filter
/**
* filter算子,对RDD中每个元素进行判断,如果返回true则保留,返回false则剔除。
* 相当于过滤器
*/
val conf = new SparkConf().setMaster("local[2]").setAppName("filter")
val sc = new SparkContext(conf)
val numRDD = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
numRDD.filter(num => num % 2 == 0).foreach(println)
总结:
1).reduceByKey与groupByKey:reduceByKey先局部聚合再全局聚合,groupByKey全局聚合,因此reduceByKey效率比较高,reduceByKey返回的是k,v,groupByKey返回k,迭代器。
2).cogroup与join:cogroup返回所有,但是匹配上与没匹配上的结果不同,具体观察上面结果,join返回匹配上的
action算子:
1.count
/**
* count算子获取RDD元素总数。
*/
val conf=new SparkConf().setAppName("countByKeyDemo").setMaster("local")
val sc=new SparkContext(conf)
//模拟数据
val datas=sc.parallelize(Array(
Tuple2("class1","leo"),
Tuple2("class2","jack"),
Tuple2("class1","jen"),
Tuple2("class2","tom"),
Tuple2("class1","marray")
))
val result = datas.map(m=>(m._1,1)).reduceByKey(_+_).count()
println(result)
2.countByKey
//countByKey算子对每个key对应的value值进行count计数。
val conf=new SparkConf().setAppName("countByKeyDemo").setMaster("local")
val sc=new SparkContext(conf)
//模拟数据
val datas=sc.parallelize(Array(
Tuple2("class1","leo"),
Tuple2("class2","jack"),
Tuple2("class1","jen"),
Tuple2("class2","tom"),
Tuple2("class1","marray")
))
datas.countByKey().foreach(m=>println(m._1+"有"+m._2+"个学生"))
3.collect & take
/**
* collect算子将RDD中所有元素获取到本地客户端,take(n)算子获取RDD中前n个元素。
*/
val conf = new SparkConf().setAppName("collectAndTake").setMaster("local")
val sc = new SparkContext(conf)
val datas = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
val collectResult = datas.map(m=>2*m).collect()
val takeResult = datas.map(m=>2*m).take(3)
for (elem <- collectResult) {
println(elem)
}
println("----------------------------")
for (elem <- takeResult) {
println(elem)
}
4.reduce
/**
* reduce算子将RDD中的所有元素进行聚合操作。
* 第一个和第二个元素聚合,值与第三个元素聚合,值与第四个元素聚合,以此类推。
*/
val conf = new SparkConf().setMaster("local").setAppName("reduce")
val sc = new SparkContext(conf)
val datas = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
val result = datas.reduce((m,n) => m+n)
println(result)
5.foreach
val conf = new SparkConf().setMaster("local").setAppName("foreach")
val sc = new SparkContext(conf)
val datas = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
datas.foreach(println)
高级算子:
1.cartesian
/**
* cartesian,中文名笛卡尔乘积。比如说两个RDD,分别有10条数据,用了cartesian算子以后,
* 两个RDD的每一条数据都会和另外一个RDD的每一条数据执行一次join,
* 最终组成了一个笛卡尔乘积。
*/
val conf = new SparkConf().setAppName("cartesian").setMaster("local")
val sc = new SparkContext(conf)
val clothRDD=sc.parallelize(Array("夹克","风衣","冲锋衣","羽绒服"))
val kuziRDD=sc.parallelize(Array("牛仔裤","皮裤","西裤","运动裤"))
val resuleRDD = clothRDD.cartesian(kuziRDD)
resuleRDD.foreach(m=>println(m._1+"搭配"+m._2))
println(resuleRDD.count())
2.distinct
//去重
val conf = new SparkConf().setMaster("local").setAppName("distinct")
val sc = new SparkContext(conf)
val logsRDD = sc.parallelize(Array(
"user1 2016-01-01 23:58:42",
"user1 2016-01-01 23:58:43",
"user1 2016-01-01 23:58:44",
"user2 2016-01-01 12:58:42",
"user2 2016-01-01 12:58:46",
"user3 2016-01-01 12:58:42",
"user4 2016-01-01 12:58:42",
"user5 2016-01-01 12:58:42",
"user6 2016-01-01 12:58:42",
"user6 2016-01-01 12:58:45"
))
val resultRDD = logsRDD.map(m=>m.split(" ")(0)).distinct()
resultRDD.foreach(println)
println(resultRDD.count())
3.coalesce & repartition
/**
* repartition算子,用于任意将rdd的partition增多或者减少与coalesce不同之处在于,
* coalesce仅仅能将rdd的partition变少,但是repartition可以将rdd的partiton变多。
*/
val conf = new SparkConf().setAppName("coalesce").setMaster("local")
val sc = new SparkContext(conf)
val datasRDD=sc.parallelize(Array("tom","jack","leo","张三","李四","王五"),3)
val resultRDD1=datasRDD.repartition(2)
val resultRDD2 = datasRDD.coalesce(2)
resultRDD1.mapPartitionsWithIndex((x,y)=>{
val arrayBuffer=ArrayBuffer[String]()
while(y.hasNext){
val info="第"+(x+1)+"分区的数据:"+y.next()
arrayBuffer+=info
}
arrayBuffer.iterator
}).foreach(println)
println("-----------------------------------------")
resultRDD2.mapPartitionsWithIndex((x,y)=>{
val arrayBuffer=ArrayBuffer[String]()
while(y.hasNext){
val info="第"+(x+1)+"分区的数据:"+y.next()
arrayBuffer+=info
}
arrayBuffer.iterator
}).foreach(println)
4.intersection
val conf = new SparkConf().setAppName("intersection").setMaster("local")
val sc = new SparkContext(conf)
val stus1 = sc.parallelize(Array("leo","jack","tom","marry"))
val stus2 = sc.parallelize(Array("leo","jack","devid","honny"))
val resultRDD = stus1.intersection(stus2)
resultRDD.foreach(println)
5.mapPartitions
/**
* mapPartitions的输入函数作用于每个分区,也就是把每个分区中的内容作为整体来处理。
*/
val conf = new SparkConf().setMaster("local").setAppName("mapPartitions")
val sc = new SparkContext(conf)
val stusRDD = sc.parallelize(Array("leo", "jack", "tom", "marry"), 2)
val scores = Map(("leo", 600), ("jack", 620), ("tom", 650), ("marry", 500), ("jen", 550))
val resultRDD = stusRDD.mapPartitions(m => {
val result = ArrayBuffer[Int]()
var score = 0
while (m.hasNext) {
score = scores(m.next())
result += score
}
result.iterator
})
resultRDD.foreach(println)
6.mapPartitionsWithIndex
val conf = new SparkConf().setAppName("mapPartitionsWithIndex").setMaster("local")
val sc = new SparkContext(conf)
val stusRDD = sc.parallelize(Array("leo","jack","tom","marry","jenny"),2)
val resultRDD = stusRDD.mapPartitionsWithIndex((m,n)=>{
val result=ArrayBuffer[String]()
while(n.hasNext){
val stuName=n.next()
val info="学生"+stuName+"在"+(m+1)+"班"
result+=info
}
result.iterator
})
resultRDD.foreach(println)
7.mapValues
val conf = new SparkConf().setMaster("local").setAppName("mapValues")
val sc = new SparkContext(conf)
val values = sc.parallelize(Array(
Tuple2("class2",96),
Tuple2("class1",90),
Tuple2("class2",100),
Tuple2("class1",94)
))
val result = values.mapValues(x=>x*x).collect()
result.foreach(println)
8.subtract
//差集运算 intRDD1是List(3,1,2,5,5),扣除intRDD2 List(5,6)重复的部分5,所以结果是(1,2,3)
val conf = new SparkConf().setMaster("local").setAppName("subtract")
val sc = new SparkContext(conf)
val intRDD1 = sc.parallelize(List(1,2,3,5,5))
val intRDD2 = sc.parallelize(List(5,6))
val resultRDD = intRDD1.subtract(intRDD2)
resultRDD.foreach(println)
9.union
//union算子(等价于“++”)是将两个RDD取并集,取并集的过程中不会把相同元素去掉。
// union操作是输入分区与输出分区多对一模式。
val conf = new SparkConf().setAppName("union").setMaster("local")
val sc = new SparkContext(conf)
val stusRDD1 = sc.parallelize(Array("leo","jack","tom"))
val stusRDD2 = sc.parallelize(Array("tom","marry","jenny"))
val resultRDD = stusRDD1.union(stusRDD2)
resultRDD.foreach(println)