综述
转换类的算子Transformation,会生成新的RDD,lazy执行的。
所有的transformation只有遇到action才能被执行
行动类的算子action,会立即触发任务的执行,不会生成RDD
把数据写入到相应的介质,展示结果数据(收集到driver)
Transformation
map
一一映射的,对某一个RDD执行map,每一条数据执行操作
返回值的数据类型,取决于传递的函数的返回值类型
rdd中有几条数据,就会被迭代几次
scala> val rdd1 = sc.makeRDD(List(1,4,2,5,7,8),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at makeRDD at <console>:24
scala> val rdd2 = rdd1.map(_>5)
rdd2: org.apache.spark.rdd.RDD[Boolean] = MapPartitionsRDD[19] at map at <console>:26
scala> rdd2.collect
res8: Array[Boolean] = Array(false, false, false, false, true, true)
scala> rdd1.partitions.size
res9: Int = 3
scala> rdd2.partitions.size
res10: Int = 3
map总结:
1 map是一一映射的,rdd中有几条数据,就会被迭代几次
2 map操作的返回值类型,由函数的返回值类型决定
3 rdd的分区的数量,是不变的
mapValues
scala中的mapValuesMap集合
spark中的mapValues,作用于RDD【k,v】,key保持不变
scala> val rdd = sc.makeRDD(List(("reba",100),("fengjie",80)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[20] at makeRDD at <console>:24
scala> rdd.mapValues(_*100)
res11: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[21] at mapValues at <console>:27
scala> val rdd3 = rdd.mapValues(((t:Int)=>t*100))
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[22] at mapValues at <console>:26
scala> rdd3.collect
res12: Array[(String, Int)] = Array((reba,10000), (fengjie,8000))
mapValues总结:
mapValues得到的rdd的分区数量是不变的
类似于map,作用于RDD【k-v】的v。key保持不变
mapPartitions
作用于每一个rdd分区。传递的函数是一个迭代器
有几个分区,就会迭代几次
val conf =new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getSimpleName)
val sc = new SparkContext(conf)
val rdd1 = sc.makeRDD(List(1,4,2,5,7,8),3)
// zip 拉链操作
// zipWithIndex 索引 从0 开始
val index: RDD[(Int, Long)] = rdd1.zipWithIndex()
index.mapValues(t=>t)
val partitions = rdd1.mapPartitions(t => {
t.map(_ * 10)
})
partitions
mapPartitionsWithIndex
带分区编号的算子。分区编号从0开始。
val rdd1 = sc.makeRDD(List(1,4,2,5,7,8),3)
// mapPartitions
val rdd2: RDD[Int] = rdd1.mapPartitions(t => {
t.map(_ * 10)
})
// 定义一个函数,返回rdd中的数据,以及对应的分区编号
val f=(i:Int,it:Iterator[Int])=> {
it.map(t=> s"p=$i,v=$t")
}
val rdd3: RDD[String] = rdd1.mapPartitionsWithIndex(f)
println(rdd3.collect().toBuffer)
flatMap
flatMap = map + flatten
得到新的RDD的分区数量不变
scala> val rdd1 = sc.makeRDD(List("hello spark","hello world"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at makeRDD at <console>:24
scala> val rdd2 = rdd1.flatMap(_.split(" "))
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at flatMap at <console>:26
scala> rdd2.partitions.size
res15: Int = 2
scala> rdd1.partitions.size
res16: Int = 2
filter
过滤出所有的满足条件的元素
分区数量不变,即使某些分区没有数据,但是分区是依然存在的。
scala> val rdd1 = sc.makeRDD(List(1,3,4,5,6,7),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[25] at makeRDD at <console>:24
scala> val rdd2 = rdd1.filter(_>5)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[26] at filter at <console>:26
scala> rdd2.partitions.size
res17: Int = 3
scala> rdd1.partitions.size
res18: Int = 3
scala> rdd2.collect
res19: Array[Int] = Array(6, 7)
groupBy,groupByKey,reduceByKey
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
val sc: SparkContext = MySpark(this.getClass.getSimpleName)
val rdd1: RDD[Int] = sc.makeRDD(List(1, 4, 5, 6))
// groupBy RDD[K] RDD[K,V]
val groupedRdd: RDD[(Int, Iterable[Int])] = rdd1.groupBy(t => t)
val groupedRdd2: RDD[(String, Iterable[Int])] = rdd1.groupBy(t => t.toString)
// groupBy 返回值类型[(K,Iterable[T])] K:是传入的函数的返回值类型 T 是rdd的元素类型
val rdd2: RDD[(String, Int)] = sc.makeRDD(List(("rb", 1000), ("baby", 990),
("yangmi", 980), ("bingbing", 5000), ("bingbing", 1000), ("baby", 2000)), 3)
// 返回值的类型
val rdd3: RDD[(String, Iterable[(String, Int)])] = rdd2.groupBy(_._1)
val result1: RDD[(String, Int)] = rdd3.mapValues(_.map(_._2).sum)
// groupByKey RDD[K,V] Iterable[990,2000]
val rdd4: RDD[(String, Iterable[Int])] = rdd2.groupByKey()
println(s"rdd4 part = ${rdd4.partitions.size}")
val result3: RDD[(String, Int)] = rdd4.mapValues(_.sum)
println(rdd2.groupByKey(10).partitions.size)
// reduceByKey RDD[K,V]
val rdd6: RDD[(String, Int)] = rdd2.reduceByKey(_ + _)
// 指定生成的rdd的分区的数量
val rdd7: RDD[(String, Int)] = rdd2.reduceByKey(_ + _, 10)
val rdd5: RDD[(String, List[Int])] = sc.makeRDD(List(("a", List(1, 3)), ("b", List(2, 4))))
rdd5.reduceByKey(_ ++ _)
重点比较reduceByKey和groupByKey
1 都作用于RDD【K,V】
2 都是根据key来分组聚合
3 默认,分区的数量都是不变的,但是都可以通过参数来指定分区数量
sortBy和sortByKey
sortBy按照指定条件进行排序
sortByKey按照Key进行排序
scala> val rdd1 = sc.makeRDD(List(("a", 1), ("b", 11), ("c", 123)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[27] at makeRDD at <console>:24
scala> val result1 = rdd1.sortBy(_._2, false)
result1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[32] at sortBy at <console>:26
scala> rdd1.sortByKey(false).collect().foreach(println)
(c,123)
(b,11)
(a,1)
scala>
集合的交集并集和差集
并集 union
交集 intersection
差集 subtract
union得到的rdd的分区的数量 = 参与union的rdd的分区数量之和
scala> val rdd1 = sc.makeRDD(List(1,3,2,4,6,7),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(List(1,11),3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[37] at makeRDD at <console>:24
scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[Int] = UnionRDD[38] at union at <console>:28
scala> rdd3.collect
res21: Array[Int] = Array(1, 3, 2, 4, 6, 7, 1, 11)
scala> rdd3.partitions.size
res22: Int = 6
scala> rdd1.intersection(rdd2).collect
res23: Array[Int] = Array(1)
scala> rdd1.subtract(rdd2).collect
res24: Array[Int] = Array(3, 6, 4, 7, 2)
distinct
集合中的元素去重的算子,distinct算子,默认分区数量是不变的,但是可以传参数改变分区的数量
rdd1:RDD[Int]
map(x=>(x,null)).reduce
底层调用的是reduceByKey
action
foreach
一一映射,对集合中的每一条数据执行某些操作
foreach和map有什么区别:
foreach:Unit 常用于打印结果
map 有返回值
map是转换类的算子,foreach是action类的算子
foreachPartition
每次迭代一个分区的数量
scala> val rdd2 = sc.makeRDD(Array(1,3,4,5,6,7),3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at makeRDD at <console>:24
scala> rdd2.foreach(println)
4
5
1
3
6
7
scala> rdd2.foreachPartition(println)
non-empty iterator
non-empty iterator
non-empty iterator
scala> rdd2.foreachPartition(it => println(it.mkString("-")))
4-5
1-3
6-7
总结:集群模式下,foreach和foreachPartition打印的结果,都在executor中。
collect 之后,再写 最low 效率最低
数据,分析之后,存入到mysql中,哪一种算子最合适?
不需要返回值,
如果使用mapPartititions算子,还需要调用action类的算子。
foreach() 每一条数据,都要获取mysql的连接,
foreachPartition 一个分区的数据,共用一个连接。
常用的action算子
reduce归并:得到的结果数据顺序是不确定的
数据分布在不同的executor,收集的时候,顺序不确定
scala> val rdd2 = sc.makeRDD(List("a","b","c"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[50] at makeRDD at <console>:24
scala> rdd2.reduce(_++_)
res29: String = bca
scala> rdd2.reduce(_++_)
res30: String = abc
scala> rdd2.reduce(_++_)
res31: String = bca
scala> rdd2.partitions.size
res32: Int = 2
scala>
scala> val rdd2 = sc.makeRDD(List("a","b","c"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[50] at makeRDD at <console>:24
scala> rdd2.reduce(_++_)
res29: String = bca
scala> rdd2.reduce(_++_)
res30: String = abc
scala> rdd2.reduce(_++_)
res31: String = bca
scala> rdd2.partitions.size
res32: Int = 2
scala> val rdd1 = sc.makeRDD(List(1,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at makeRDD at <console>:24
scala> rdd1.reduce(_+_)
res33: Int = 13
scala> rdd1.count
res34: Long = 4
scala> val rdd1 = sc.makeRDD(List(11,13,14,5,1,6))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at makeRDD at <console>:24
scala> rdd1.first()
res35: Int = 11
scala> rdd1.take(3)
res36: Array[Int] = Array(11, 13, 14)
scala> rdd1.top(3)
res37: Array[Int] = Array(14, 13, 11)
scala> rdd1.takeOrdered(3)
res38: Array[Int] = Array(1, 5, 6)
scala> rdd1.takeOrdered(2)
res39: Array[Int] = Array(1, 5)
scala> val rdd2 = rdd1.zipWithIndex
rdd2: org.apache.spark.rdd.RDD[(Int, Long)] = ZippedWithIndexRDD[56] at zipWithIndex at <console>:26
scala> rdd2.countByKey()
res40: scala.collection.Map[Int,Long] = Map(5 -> 1, 14 -> 1, 1 -> 1, 6 -> 1, 13 -> 1, 11 -> 1)
scala>
action类的算子,会触发任务的执行?
spark-submit spark-shell ------à Application
正常情况下,调用一次action,就会产生一个job
还有一些特殊的算子:
take
sortBy
zipWithIndex
checkpoint
collect,collectAsMap
collect: 把数据从executor端收集到Driver,返回值类型是Array
collectAsMap 返回值类型是Map 只能作用于RDD[K,V]
scala> val rdd1 = sc.makeRDD(List(("a",1),("b",2)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[15] at makeRDD at <console>:24
scala> rdd1.collect
res21: Array[(String, Int)] = Array((a,1), (b,2))
scala> rdd1.collectAsMap
res24: scala.collection.Map[String,Int] = Map(b -> 2, a -> 1)
join
package cn.huge.spark33.day03
import cn.huge.spark33.utils.MySpark
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
/**
* ZhangJunJie
* 2018/9/27 15:33
**/
object JoinDemo2 {
def main(args: Array[String]): Unit = {
val sc : SparkContext = MySpark(this.getClass.getSimpleName)
//k - v
val rdd1: RDD[(String, Double)] = sc.makeRDD(List(("reba",9000.0),("naza",8000.0),("ruhua",10000.0)))
//k - v
val rdd2: RDD[(String, Int)] = sc.makeRDD(List(("reba",7),("naza",8),("yangmi",3)))
//默认分区数不变
val join: RDD[(String, (Double, Int))] = rdd1.join(rdd2)
//可以自己设定分区的数量
//val join2: RDD[(String, (Double, Int))] = rdd1.join(rdd2,5)
join.foreach(println)
println("===========右边可能关联不上=============")
//RDD[(K,(V,Option[W]))] 右边可能关联不上
val result1: RDD[(String, (Double, Option[Int]))] = rdd1.leftOuterJoin(rdd2)
//求当月出场费总额
val resultMoney: RDD[(String, Double)] = result1.mapValues(tp => tp._2.getOrElse(0) * tp._1)
resultMoney.foreach(println)
println("===========边可能关联不上=============")
//RDD[(K,(Option[V],W))] 左边可能关联不上
val result2: RDD[(String, (Option[Double], Int))] = rdd1.rightOuterJoin(rdd2)
result2.foreach(println)
println("===========cogroup=============")
val cogroup: RDD[(String, (Iterable[Double], Iterable[Int]))] = rdd1.cogroup(rdd2)
cogroup.foreach(println)
val cogroupResult = cogroup.mapValues(tp => {tp._1.sum * tp._2.sum})
cogroupResult.foreach(println)
sc.stop()
}
}
结果
相当于数据库里面的内连接
相当于数据库里面的左外连接
求当月出场费总额
相当于数据库里面的右外连接
笛卡尔积
笛卡尔积即俩个列表里面的元素俩俩结合
scala> val rdd1 = sc.makeRDD(List("tom","cat","jim"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[59] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(List(1,3))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[60] at makeRDD at <console>:24
scala> val rdd3 = rdd1.cartesian(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = CartesianRDD[61] at cartesian at <console>:28
scala> rdd3.collect
res41: Array[(String, Int)] = Array((tom,1), (tom,3), (cat,1), (jim,1), (cat,3), (jim,3))
修改分区数量的算子
repartition(分区数量)
coalesce(分区数量)
源码:
repartition会对数据进行重写的分发(shuffle) 同一个分区的数据,会被分发到不同的分区中去。
coalesce 默认是没有进行shuffle的,所以当用coalesce来扩大分区的数量,是失败的,分区数量不变。
package cn.huge.spark33.day03
import cn.huge.spark33.utils.MySpark
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
/**
* ZhangJunJie
* 2018/9/27 16:38
**/
object RepartitionDemo2 {
def main(args: Array[String]): Unit = {
val sc: SparkContext = MySpark(this.getClass.getSimpleName)
val rdd1: RDD[Int] = sc.makeRDD(List(1,3,5,6,7),3)
//repartition会对数据进行重新的分发(shuffle)同一个分区的数据,会被分发到不同的分区中去
rdd1.repartition(5)
//coalesce默认是没有进行shuffle的,所以当用coalesce来扩大分区的数量,是失败的额,分区数量不变
rdd1.coalesce(5)
}
}
scala> val rdd1 = sc.makeRDD(List(1,3,5,6,7,8),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at makeRDD at <console>:24
scala> rdd1.repartition(1)
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at repartition at <console>:27
scala> rdd1.repartition(2)
res2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at repartition at <console>:27
scala> rdd1.coalesce(2)
res3: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[12] at coalesce at <console>:27
scala> res2.partitions.size
res4: Int = 2
scala> res3.partitions.size
res5: Int = 2
scala> val f =(i:Int,it:Iterator[Int])=>
| it.map(t=> s"p=$i,v=$t")
f: (Int, Iterator[Int]) => Iterator[String] = <function2>
scala> res2.mapPartitionsWithIndex(f).collect
res6: Array[String] = Array(p=0,v=5, p=0,v=1, p=0,v=7, p=1,v=3, p=1,v=8, p=1,v=6)
scala> res3.mapPartitionsWithIndex(f).collect
res7: Array[String] = Array(p=0,v=1, p=0,v=3, p=1,v=5, p=1,v=6, p=1,v=7, p=1,v=8)
scala> rdd1.repartition(10)
res8: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[18] at repartition at <console>:27
scala> rdd1.coalesce(10)
res9: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[19] at coalesce at <console>:27
scala> res8.partitions.size
res10: Int = 10
scala> res9.partitions.size
res11: Int = 3
总结:
repartition(10) = rdd1.coalesce(10,true)
repartition会对数据进行重新的shuffle。 coalesce主要用于合并分区,不会进行数据的shuffle。
实际使用:
如果数据需要shuffle,选择 repartition。
repartition 常用于 扩大分区数量。 提升任务的并行度。
coalesce常用于合并分区(减少分区数量) 不能用于扩大分区数量。除非加shuffle为true。
aggregate,aggregateByKey
aggregate 是action算子,aggregateByKey 是转换算子。
aggregate 是action算子
第一个参数,是初始值。 初始值 ,参与分区内聚合, 还参与全局聚合
第二个参数: 是两个函数参数。第一个函数,表示分区内聚合,第二个函数,全局聚合。
scala> val rdd1 = sc.makeRDD(List(1,3,4,5),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at makeRDD at <console>:24
scala> rdd1.aggregate(0)(_+_,_+_)
res14: Int = 13
scala> rdd1.aggregate(0)(_+_,_+_)
res15: Int = 13
scala> rdd1.aggregate(10)(_+_,_+_)
res16: Int = 43
scala> rdd1.aggregate(10)((a,b)=>math.max(a,b),_+_)
res17: Int = 30
scala> val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> rdd2.aggregate("")(_ ++ _, _ ++ _)
res18: String = defabc
scala> rdd2.aggregate("")(_ ++ _, _ ++ _)
res19: String = abcdef
aggregateByKey:
转换类的算子
第一个参数:初始值 初始值 只参与分区聚合
第二个参数: 两个函数 第一个是分区聚合函数 第二个函数 是全局聚合函数
scala> val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> pairRDD.aggregate
aggregate aggregateByKey
scala> pairRDD.aggregateByKey(0)(_+_,_+_)
res20: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[27] at aggregateByKey at <console>:27
scala> res20.collect
res21: Array[(String, Int)] = Array((dog,12), (cat,19), (mouse,6))
scala> pairRDD.aggregateByKey(10)(_+_,_+_).collect
res22: Array[(String, Int)] = Array((dog,22), (cat,39), (mouse,26))
scala> cat 17 mouse 14 dog 22 cat 22 mouse 12
算子的总结
转换类的算子:RDD之间的依赖
普通的算子: map fliter flatMap
数据是一对一的,
shuffle类的算子:
分区内的数据会进行重新的分发
reduceByKey join distinct
action类的算子:
算子,作用于RDD[k]
算子: 必须作用于 RDD[K,V]