SparkCore
1.RDD基本概念
RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变、可分区、里面的元素可并行计算的集合。
什么弹性:
数据呈现的方式是先装在内存中处理,如果内存不够,就将数据存入磁盘上处理
内存+磁盘
2.RDD五大属性
源码
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions 分区的列表
* - A function for computing each split 作用在每一个文件切片上的函数
* - A list of dependencies on other RDDs 依赖于其他的一些RDD的列表
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)对于key-value类型的RDD有分区函数
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for 位置优先性,移动计算不移动数据,尽可能将计算任何和数据放在同一个位置
* an HDFS file)
RDD弹性:
1-自动进行内存以及磁盘交换
2-基于血统linage的高效容错机制,linage记录了rdd的依赖关系
3-task执行失败会自动重试
4-stages失败会自动重试
5-checkpoint实现数据的持久化,可以进行RDD的依赖关系的重建
RDD的特性:
1-分区--一个RDD有多个分区
2-只读-rdd当中数据是只读的,不能更改,如果想更改需要重新建立rdd
3-依赖-rdd之间有依赖关系,血统
4-缓存:常用的rdd,多次使用rdd可以缓存起来
5-checkpoint:持久化,对于一些比较珍贵的rdd可以使用检查点机制,重建RDD
2.RDD的创建以及操作方式
2.1创建RDD的三种方式
第一种方式创建RDD:由一个已经存在的集合创建
val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8))
指定分区数
val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8),2)
第二种方式创建RDD:由外部存储文件创建
包括本地的文件系统,还有所有Hadoop支持的数据集,比如HDFS、Cassandra、HBase等。
val rdd2 = sc.textFile("/words.txt")
第三种方式创建RDD:由已有的RDD经过算子转换,生成新的RDD
val rdd3=rdd2.flatMap(_.split(" "))
Or:
val rdd3 = sc.parallelize(Array(1,2,3,4,5,6,7,8))
或者
//makeRDD()的底层是parallelize()
val rdd4 = sc.makeRDD(List(1,2,3,4,5,6,7,8))
2.2RDD的算子分类
transformation算子
RDD之间的转换
action算子
RDD的结果的执行
RDD是懒执行
只有当触发了Action的RDD,前面所有的tranformation计算才会执行
之所以使用惰性求值/延迟执行,是因为这样可以在Action时对RDD操作形成DAG有向无环图进行Stage的划分和并行优化,这种设计让Spark更加有效率地运行。
2.2.2Transformation转换算子
查看分区数
sc.parallelize(List(5,6,4,7,3,8,2,9,1,10)).partitions.length
//没有指定分区数,默认值是2
sc.parallelize(List(5,6,4,7,3,8,2,9,1,10),3).partitions.length
//指定了分区数为3
sc.textFile("hdfs://node01:8020/wordcount/input/words.txt").partitions.length
//2
●RDD分区的数据取决于哪些因素?
RDD分区的原则是使得分区的个数尽量等于集群中的CPU核心(core)数目,这样可以充分利用CPU的计算资源
●扩展:
1.启动的时候指定的CPU核数确定了一个参数值:
spark.default.parallelism=指定的CPU核数(集群模式最小2)
2.对于Scala集合调用parallelize(集合,分区数)方法,
如果没有指定分区数,就使用spark.default.parallelism,
如果指定了就使用指定的分区数(不要指定大于spark.default.parallelism)
3.对于textFile(文件,分区数)
defaultMinPartitions
如果没有指定分区数sc.defaultMinPartitions=min(defaultParallelism,2)
如果指定了就使用指定的分区数sc.defaultMinPartitions=指定的分区数
rdd的分区数
对于本地文件:
rdd的分区数 = max(本地file的分片数, sc.defaultMinPartitions)
对于HDFS文件:
rdd的分区数 = max(hdfs文件的block数目, sc.defaultMinPartitions)
所以如果分配的核数为多个,且从文件中读取数据创建RDD,即使hdfs文件只有1个切片,最后的Spark的RDD的partition数也有可能是2
map()
需求:创建一个1-10数组的RDD,将所有元素*2形成新的RDD
scala> var source = sc.parallelize(1 to 10)
source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> source.collect()
res7: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> val mapadd = source.map(_ * 2)
mapadd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:26
scala> mapadd.collect()
res8: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
filter()
需求:创建一个RDD(由字符串组成),过滤出一个新RDD(包含”xiao”子串)
scala> var sourceFilter = sc.parallelize(Array("xiaoming","xiaojiang","xiaohe","dazhi"))
sourceFilter: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val filter = sourceFilter.filter(_.contains("xiao"))
filter: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at filter at <console>:26
scala> sourceFilter.collect()
res9: Array[String] = Array(xiaoming, xiaojiang, xiaohe, dazhi)
scala> filter.collect()
res10: Array[String] = Array(xiaoming, xiaojiang, xiaohe)
flatMap()
需求:创建一个元素为1-5的RDD,运用flatMap创建一个新的RDD,新的RDD为原RDD的每个元素的2倍(2,4,6,8,10)
scala> val sourceFlat = sc.parallelize(1 to 5)
sourceFlat: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24
scala> sourceFlat.collect()
res11: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val flatMap = sourceFlat.flatMap(1 to _)
flatMap: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at flatMap at <console>:26
scala> flatMap.collect()
res12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)
sample()
需求:创建一个RDD(1-10),从中选择放回和不放回抽样
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24
scala> rdd.collect()
res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> var sample1 = rdd.sample(true,0.4,2)
sample1: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[21] at sample at <console>:26
scala> sample1.collect()
res16: Array[Int] = Array(1, 2, 2, 7, 7, 8, 9)
scala> var sample2 = rdd.sample(false,0.2,3)
sample2: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[22] at sample at <console>:26
scala> sample2.collect()
res17: Array[Int] = Array(1, 9)
takeSample()
glom()
获取每个区分的数据
sortBy()
scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24
scala> rdd.sortBy(x => x).collect()
res11: Array[Int] = Array(1, 2, 3, 4)
scala> rdd.sortBy(x => x%3).collect()
res12: Array[Int] = Array(3, 4, 1, 2)
coalesce()
如果缩减分区不走shuffle,如果扩大分区需要走shuffle ,默认为false
scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at <console>:24
scala> rdd.partitions.size
res20: Int = 4
scala> val coalesceRDD = rdd.coalesce(3)
coalesceRDD: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[55] at coalesce at <console>:26
scala> coalesceRDD.partitions.size
res21: Int = 3
scala> rdd.coalesce(8,true)
res13: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at coalesce at <console>:27
scala> res13.glom().collect
res14: Array[Array[Int]] = Array(Array(3, 7, 11, 15), Array(4, 8, 12, 16), Array(), Array(), Array(), Array(), Array(1, 5, 9, 13), Array(2, 6, 10, 14))
repartition()
scala> val rdd = sc.parallelize(1 to 16,4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at <console>:24
scala> rdd.partitions.size
res22: Int = 4
scala> val rerdd = rdd.repartition(2)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[60] at repartition at <console>:26
scala> rerdd.partitions.size
res23: Int = 2
scala> val rerdd = rdd.repartition(4)
rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[64] at repartition at <console>:26
scala> rerdd.partitions.size
res24: Int = 4
union()
需求:创建两个RDD,求并集
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24
scala> val rdd3 = rdd1.union(rdd2)
rdd3: org.apache.spark.rdd.RDD[Int] = UnionRDD[25] at union at <console>:28
scala> rdd3.collect()
res18: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)
intersection()
需求:创建两个RDD,求两个RDD的交集
scala> val rdd1 = sc.parallelize(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(5 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at parallelize at <console>:24
scala> val rdd3 = rdd1.intersection(rdd2)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[33] at intersection at <console>:28
scala> rdd3.collect()
[Stage 15:=============================> (2 + 2) res19: Array[Int] = Array(5, 6, 7)
distinct()
去重
scala> val distinctRdd = sc.parallelize(List(1,2,1,5,2,9,6,1))
distinctRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24
scala> val unionRDD = distinctRdd.distinct()
unionRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at distinct at <console>:26
scala> unionRDD.collect()
[Stage 16:> (0 + 4) [Stage 16:=============================> (2 + 2) res20: Array[Int] = Array(1, 9, 5, 6, 2)
重新指定去重后的分区数
scala> val unionRDD = distinctRdd.distinct(2)
unionRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at distinct at <console>:26
scala> unionRDD.collect()
res21: Array[Int] = Array(6, 2, 1, 9, 5)
subtract()
差集
scala> val rdd = sc.parallelize(3 to 8)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[70] at parallelize at <console>:24
scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at parallelize at <console>:24
scala> rdd.subtract(rdd1).collect()
res27: Array[Int] = Array(8, 6, 7)
zip()
需求:创建两个RDD,并将两个RDD组合到一起形成一个(k,v)RDD
两个RDD分区数必须相同
(1)创建第一个RDD
scala> val rdd1 = sc.parallelize(Array(1,2,3),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
(2)创建第二个RDD(与1分区数相同)
scala> val rdd2 = sc.parallelize(Array("a","b","c"),3)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24
(3)第一个RDD组合第二个RDD并打印
scala> rdd1.zip(rdd2).collect
res1: Array[(Int, String)] = Array((1,a), (2,b), (3,c))
(4)第二个RDD组合第一个RDD并打印
scala> rdd2.zip(rdd1).collect
res2: Array[(String, Int)] = Array((a,1), (b,2), (c,3))
(5)创建第三个RDD(与1,2分区数不同)
scala> val rdd3 = sc.parallelize(Array("a","b","c"),2)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at <console>:24
(6)第一个RDD组合第三个RDD并打印
scala> rdd1.zip(rdd3).collect
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(3, 2)
at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
... 48 elided
partionBy()
对RDD进行分区操作,如果原有的partionRDD和现有的partionRDD是一致的话就不进行分区, 否则会生成ShuffleRDD.
scala> val rdd = sc.parallelize(Array((1,"aaa"),(2,"bbb"),(3,"ccc"),(4,"ddd")),4)
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[44] at parallelize at <console>:24
scala> rdd.partitions.size
res24: Int = 4
scala> var rdd2 = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[45] at partitionBy at <console>:26
scala> rdd2.partitions.size
res25: Int = 2
reduceByKey()
scala> val rdd = sc.parallelize(List(("female",1),("male",5),("female",5),("male",2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[46] at parallelize at <console>:24
scala> val reduce = rdd.reduceByKey((x,y) => x+y)
reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[47] at reduceByKey at <console>:26
scala> reduce.collect()
res29: Array[(String, Int)] = Array((female,6), (male,7))
groupByKey()
scala> val words = Array("one", "two", "two", "three", "three", "three")
words: Array[String] = Array(one, two, two, three, three, three)
scala> val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
wordPairsRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26
scala> val group = wordPairsRDD.groupByKey()
group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at <console>:28
scala> group.collect()
res1: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))
scala> group.map(t => (t._1, t._2.sum))
res2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at <console>:31
scala> res2.collect()
res3: Array[(String, Int)] = Array((two,2), (one,1), (three,3))
scala> val map = group.map(t => (t._1, t._2.sum))
map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at map at <console>:30
scala> map.collect()
res4: Array[(String, Int)] = Array((two,2), (one,1), (three,3))
reduceByKey和groupByKey的区别
1. reduceByKey:按照key进行聚合,在shuffle之前有combine(预聚合)操作,返回结果是RDD[k,v].
2. groupByKey:按照key进行分组,直接进行shuffle。
3. 开发指导:reduceByKey比groupByKey,建议使用。但是需要注意是否会影响业务逻辑。
sortByKey()
scala> val rdd = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala> rdd.sortByKey(true).collect()
res9: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))
scala> rdd.sortByKey(false).collect()
res10: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))
sortBy()底层调用了sortByKey()方法
join()
scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24scala> val rdd1 = sc.parallelize(Array((1,4),(2,5),(4,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24scala> rdd.join(rdd1).collect
res0: Array[(Int, (String, Int))] = Array((1,(a,4)), (2,(b,5)))
cartesian()
scala> val rdd1 = sc.parallelize(1 to 3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(2 to 5)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at parallelize at <console>:24
scala> rdd1.cartesian(rdd2).collect()
res17: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5))
mapValues ()
只对key-value类型的value进行操作
scala> val rdd3 = sc.parallelize(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[67] at parallelize at <console>:24
scala> rdd3.mapValues(_+"|||").collect()
res26: Array[(Int, String)] = Array((1,a|||), (1,d|||), (2,b|||), (3,c|||))
2.2.3Action算子
reduce()
聚合rdd中的所有元素
需求:创建一个RDD,将所有元素聚合得到结果。
scala> val rdd1 = sc.makeRDD(1 to 10,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[85] at makeRDD at <console>:24
scala> rdd1.reduce(_+_)
res50: Int = 55
scala> val rdd2 = sc.makeRDD(Array(("a",1),("a",3),("c",3),("d",5)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[86] at makeRDD at <console>:24
scala> rdd2.reduce((x,y)=>(x._1 + y._1,x._2 + y._2))
res51: (String, Int) = (aacd,12)
count()
需求:创建一个RDD,并将RDD内容收集到Driver端打印
scala> var rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[59] at makeRDD at <console>:24
scala> rdd.count()
res31: Long = 10
first()
需求:创建一个RDD,返回该RDD中的第一个元素
scala> var rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[60] at makeRDD at <console>:24
scala> rdd.first()
res32: Int = 1
take(N)
2. 需求:创建一个RDD,统计该RDD的条数
scala> var rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[61] at makeRDD at <console>:24
scala> rdd.take(5)
res33: Array[Int] = Array(1, 2, 3, 4, 5)
takeSample()
返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子
scala> var rdd =sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[62] at parallelize at <console>:24
scala> rdd.collect()
res34: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd.takeSample(true,5,3)
res35: Array[Int] = Array(3, 5, 5, 9, 7)
takeOrdered()
返回排序后的前几个值
scala> var rdd1 = sc.makeRDD(Seq(10,4,2,12,3))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at makeRDD at <console>:24
scala> rdd1.top(2)
res36: Array[Int] = Array(12, 10)
scala> rdd1.takeOrdered(2)
res37: Array[Int] = Array(2, 3)
savASTextFile
将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本
countByKey()
需求:创建一个PairRDD,统计每种key的个数
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[95] at parallelize at <console>:24
scala> rdd.countByKey()
res63: scala.collection.Map[Int,Long] = Map(3 -> 2, 1 -> 3, 2 -> 1)
foreach(func)
在数据集的每一个元素上,运行函数func进行更新。
需求:创建一个RDD,对每个元素进行打印
scala> var rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[107] at makeRDD at <console>:24
scala> var sum = sc.accumulator(0)
warning: there were two deprecation warnings; re-run with -deprecation for details
sum: org.apache.spark.Accumulator[Int] = 0
scala> rdd.foreach(sum+=_)
scala> sum.value
res68: Int = 55
scala> rdd.collect().foreach(println)
3.复杂算子
aggregateByKey()
如果分三个分区,前两个是一个分区,中间两个是一个分区,最后两个是一个分区,第一个分区的最终结果为(1,3),第二个分区为(1,4)(2,3),最后一个分区为(3,8),combine后为 (3,8), (1,7), (2,3)
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24
rdd.glom.collect各个元素的分区的情况
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[13] at aggregateByKey at <console>:26
scala> agg.collect()
res7: Array[(Int, Int)] = Array((3,8), (1,7), (2,3))
scala> agg.partitions.size
res8: Int = 3
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),1)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_).collect()
agg: Array[(Int, Int)] = Array((1,4), (3,8), (2,3))
foldByKey ()
分区内和分区间执行的运算相同
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[91] at parallelize at <console>:24
scala> val agg = rdd.foldByKey(0)(_+_)
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[92] at foldByKey at <console>:26
scala> agg.collect()
res61: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))
cogroup()
在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD
aggerate ()
fold()
总结:
1-combineByKey将相同key的value进行组合
2-aggregateByKey是根据初始值和seqOp分区内的函数进行操作,在将结果和分区间的函数进行操作(aggregateByKey在分区间是不进行初始值的操作)
3-foldByKey是aggregateByKey的简化的版本,一般分区内的函数和分区间的函数相同的
4-cogroup--相同key的value进行组合
5-Action类型的aggregate 是将初始值和分区内的函数进行操作,分区间在和初始值合并操作(aggregate在分区间是进行初始值的操作)
6-Action类型的fold是aggregate的简化版本,分区内的函数和分区间的函数合并在一起