SparkCore之算子

SparkCore

1.RDD基本概念

RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变、可分区、里面的元素并行计算的集合。

什么弹性:

  • 数据呈现的方式是先装在内存中处理,如果内存不够,就将数据存入磁盘上处理

  • 内存+磁盘

2.RDD五大属性

源码

* Internally, each RDD is characterized by five main properties:
*
*  - A list of partitions 分区的列表
*  - A function for computing each split 作用在每一个文件切片上的函数
*  - A list of dependencies on other RDDs 依赖于其他的一些RDD的列表
*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)对于key-value类型的RDD有分区函数
*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for 位置优先性,移动计算不移动数据,尽可能将计算任何和数据放在同一个位置
*    an HDFS file)

 

  • RDD弹性:

    • 1-自动进行内存以及磁盘交换

    • 2-基于血统linage的高效容错机制,linage记录了rdd的依赖关系

    • 3-task执行失败会自动重试

    • 4-stages失败会自动重试

    • 5-checkpoint实现数据的持久化,可以进行RDD的依赖关系的重建

  • RDD的特性:

    • 1-分区--一个RDD有多个分区

    • 2-只读-rdd当中数据是只读的,不能更改,如果想更改需要重新建立rdd

    • 3-依赖-rdd之间有依赖关系,血统

    • 4-缓存:常用的rdd,多次使用rdd可以缓存起来

    • 5-checkpoint:持久化,对于一些比较珍贵的rdd可以使用检查点机制,重建RDD

2.RDD的创建以及操作方式

2.1创建RDD的三种方式

第一种方式创建RDD:由一个已经存在的集合创建

val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8))

指定分区数

val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8),2)

第二种方式创建RDD:由外部存储文件创建

包括本地的文件系统,还有所有Hadoop支持的数据集,比如HDFS、Cassandra、HBase等。

val rdd2 = sc.textFile("/words.txt")

第三种方式创建RDD:由已有的RDD经过算子转换,生成新的RDD

val rdd3=rdd2.flatMap(_.split(" "))

Or:

val rdd3 = sc.parallelize(Array(1,2,3,4,5,6,7,8))

或者

//makeRDD()的底层是parallelize()

val rdd4 = sc.makeRDD(List(1,2,3,4,5,6,7,8))

 2.2RDD的算子分类

  • transformation算子

    • RDD之间的转换

  • action算子

    • RDD的结果的执行

  • RDD是懒执行

    • 只有当触发了Action的RDD,前面所有的tranformation计算才会执行

    • 之所以使用惰性求值/延迟执行,是因为这样可以在Action时对RDD操作形成DAG有向无环图进行Stage的划分和并行优化,这种设计让Spark更加有效率地运行。

 2.2.2Transformation转换算子

查看分区数

sc.parallelize(List(5,6,4,7,3,8,2,9,1,10)).partitions.length

//没有指定分区数,默认值是2

sc.parallelize(List(5,6,4,7,3,8,2,9,1,10),3).partitions.length

//指定了分区数为3

sc.textFile("hdfs://node01:8020/wordcount/input/words.txt").partitions.length

//2

 

●RDD分区的数据取决于哪些因素?

RDD分区的原则是使得分区的个数尽量等于集群中的CPU核心(core)数目,这样可以充分利用CPU的计算资源

●扩展:

1.启动的时候指定的CPU核数确定了一个参数值:

spark.default.parallelism=指定的CPU核数(集群模式最小2)

2.对于Scala集合调用parallelize(集合,分区数)方法,

如果没有指定分区数,就使用spark.default.parallelism,

如果指定了就使用指定的分区数(不要指定大于spark.default.parallelism)

3.对于textFile(文件,分区数)

defaultMinPartitions

如果没有指定分区数sc.defaultMinPartitions=min(defaultParallelism,2)

如果指定了就使用指定的分区数sc.defaultMinPartitions=指定的分区数

rdd的分区数

对于本地文件:

rdd的分区数 = max(本地file的分片数, sc.defaultMinPartitions)

对于HDFS文件:

rdd的分区数 = max(hdfs文件的block数目, sc.defaultMinPartitions)

所以如果分配的核数为多个,且从文件中读取数据创建RDD,即使hdfs文件只有1个切片,最后的Spark的RDD的partition数也有可能是2

map() 

 需求:创建一个1-10数组的RDD,将所有元素*2形成新的RDD

scala> var source  = sc.parallelize(1 to 10)

source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24

 

scala> source.collect()

res7: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

 

scala> val mapadd = source.map(_ * 2)

mapadd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:26

 

scala> mapadd.collect()

res8: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

 filter()

需求:创建一个RDD(由字符串组成),过滤出一个新RDD(包含”xiao”子串)

scala> var sourceFilter = sc.parallelize(Array("xiaoming","xiaojiang","xiaohe","dazhi"))

sourceFilter: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> val filter = sourceFilter.filter(_.contains("xiao"))

filter: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at filter at <console>:26

scala> sourceFilter.collect()

res9: Array[String] = Array(xiaoming, xiaojiang, xiaohe, dazhi)

scala> filter.collect()

res10: Array[String] = Array(xiaoming, xiaojiang, xiaohe)

flatMap() 

需求:创建一个元素为1-5的RDD,运用flatMap创建一个新的RDD,新的RDD为原RDD的每个元素的2倍(2,4,6,8,10)

 

scala> val sourceFlat = sc.parallelize(1 to 5)

sourceFlat: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

 

scala> sourceFlat.collect()

res11: Array[Int] = Array(1, 2, 3, 4, 5)

 

scala> val flatMap = sourceFlat.flatMap(1 to _)

flatMap: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at flatMap at <console>:26

 

scala> flatMap.collect()

res12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)

sample()

需求:创建一个RDD(1-10),从中选择放回和不放回抽样

scala> val rdd = sc.parallelize(1 to 10)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24

 

scala> rdd.collect()

res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

 

scala> var sample1 = rdd.sample(true,0.4,2)

sample1: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[21] at sample at <console>:26

 

scala> sample1.collect()

res16: Array[Int] = Array(1, 2, 2, 7, 7, 8, 9)

 

scala> var sample2 = rdd.sample(false,0.2,3)

sample2: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[22] at sample at <console>:26

 

scala> sample2.collect()

res17: Array[Int] = Array(1, 9)

takeSample()

 glom()

获取每个区分的数据

sortBy()

 

scala> val rdd = sc.parallelize(List(1,2,3,4))

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24

 

scala> rdd.sortBy(x => x).collect()

res11: Array[Int] = Array(1, 2, 3, 4)

 

scala> rdd.sortBy(x => x%3).collect()

res12: Array[Int] = Array(3, 4, 1, 2)

coalesce()

如果缩减分区不走shuffle,如果扩大分区需要走shuffle ,默认为false

scala> val rdd = sc.parallelize(1 to 16,4)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at <console>:24

 

scala> rdd.partitions.size

res20: Int = 4

 

scala> val coalesceRDD = rdd.coalesce(3)

coalesceRDD: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[55] at coalesce at <console>:26

 

scala> coalesceRDD.partitions.size

res21: Int = 3

 scala> rdd.coalesce(8,true)
res13: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at coalesce at <console>:27
scala> res13.glom().collect
res14: Array[Array[Int]] = Array(Array(3, 7, 11, 15), Array(4, 8, 12, 16), Array(), Array(), Array(), Array(), Array(1, 5, 9, 13), Array(2, 6, 10, 14))

repartition()

scala> val rdd = sc.parallelize(1 to 16,4)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at <console>:24

 

scala> rdd.partitions.size

res22: Int = 4

 

scala> val rerdd = rdd.repartition(2)

rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[60] at repartition at <console>:26

 

scala> rerdd.partitions.size

res23: Int = 2

 

scala> val rerdd = rdd.repartition(4)

rerdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[64] at repartition at <console>:26

 

scala> rerdd.partitions.size

res24: Int = 4

 

 union()

需求:创建两个RDD,求并集

scala> val rdd1 = sc.parallelize(1 to 5)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24

 

scala> val rdd2 = sc.parallelize(5 to 10)

rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24

 

scala> val rdd3 = rdd1.union(rdd2)

rdd3: org.apache.spark.rdd.RDD[Int] = UnionRDD[25] at union at <console>:28

 

scala> rdd3.collect()

res18: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)

intersection()

需求:创建两个RDD,求两个RDD的交集

scala> val rdd1 = sc.parallelize(1 to 7)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24

 

scala> val rdd2 = sc.parallelize(5 to 10)

rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at parallelize at <console>:24

 

scala> val rdd3 = rdd1.intersection(rdd2)

rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[33] at intersection at <console>:28

 

scala> rdd3.collect()

[Stage 15:=============================>                       (2 + 2)                                                                             res19: Array[Int] = Array(5, 6, 7)

distinct()

 去重

scala> val distinctRdd = sc.parallelize(List(1,2,1,5,2,9,6,1))

distinctRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24

 

scala> val unionRDD = distinctRdd.distinct()

unionRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at distinct at <console>:26

 

scala> unionRDD.collect()

[Stage 16:> (0 + 4) [Stage 16:=============================>                            (2 + 2)                                                                             res20: Array[Int] = Array(1, 9, 5, 6, 2)

重新指定去重后的分区数

scala> val unionRDD = distinctRdd.distinct(2)

unionRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at distinct at <console>:26

 

scala> unionRDD.collect()

res21: Array[Int] = Array(6, 2, 1, 9, 5)

subtract()

差集

scala> val rdd = sc.parallelize(3 to 8)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[70] at parallelize at <console>:24

 

scala> val rdd1 = sc.parallelize(1 to 5)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at parallelize at <console>:24

 

scala> rdd.subtract(rdd1).collect()

res27: Array[Int] = Array(8, 6, 7)

zip()

需求:创建两个RDD,并将两个RDD组合到一起形成一个(k,v)RDD

两个RDD分区数必须相同

(1)创建第一个RDD

scala> val rdd1 = sc.parallelize(Array(1,2,3),3)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

(2)创建第二个RDD(与1分区数相同)

scala> val rdd2 = sc.parallelize(Array("a","b","c"),3)

rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24

(3)第一个RDD组合第二个RDD并打印

scala> rdd1.zip(rdd2).collect

res1: Array[(Int, String)] = Array((1,a), (2,b), (3,c))

(4)第二个RDD组合第一个RDD并打印

scala> rdd2.zip(rdd1).collect

res2: Array[(String, Int)] = Array((a,1), (b,2), (c,3))

(5)创建第三个RDD(与1,2分区数不同)

scala> val rdd3 = sc.parallelize(Array("a","b","c"),2)

rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at <console>:24

(6)第一个RDD组合第三个RDD并打印

scala> rdd1.zip(rdd3).collect

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(3, 2)

  at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)

  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)

  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

  at org.apache.spark.rdd.RDD.collect(RDD.scala:935)

  ... 48 elided

 

partionBy()

对RDD进行分区操作,如果原有的partionRDD和现有的partionRDD是一致的话就不进行分区, 否则会生成ShuffleRDD.

scala> val rdd = sc.parallelize(Array((1,"aaa"),(2,"bbb"),(3,"ccc"),(4,"ddd")),4)

rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[44] at parallelize at <console>:24

 

scala> rdd.partitions.size

res24: Int = 4

 

scala> var rdd2 = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))

rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[45] at partitionBy at <console>:26

 

scala> rdd2.partitions.size

res25: Int = 2

 

reduceByKey()

scala> val rdd = sc.parallelize(List(("female",1),("male",5),("female",5),("male",2)))

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[46] at parallelize at <console>:24

 

scala> val reduce = rdd.reduceByKey((x,y) => x+y)

reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[47] at reduceByKey at <console>:26

 

scala> reduce.collect()

res29: Array[(String, Int)] = Array((female,6), (male,7))

 

groupByKey()

scala> val words = Array("one", "two", "two", "three", "three", "three")

words: Array[String] = Array(one, two, two, three, three, three)

 

scala> val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))

wordPairsRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26

 

scala> val group = wordPairsRDD.groupByKey()

group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at <console>:28

 

scala> group.collect()

res1: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))

 

scala> group.map(t => (t._1, t._2.sum))

res2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at <console>:31

 

scala> res2.collect()

res3: Array[(String, Int)] = Array((two,2), (one,1), (three,3))

 

scala> val map = group.map(t => (t._1, t._2.sum))

map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at map at <console>:30

 

scala> map.collect()

res4: Array[(String, Int)] = Array((two,2), (one,1), (three,3))

 

reduceByKey和groupByKey的区别

1. reduceByKey:按照key进行聚合,在shuffle之前有combine(预聚合)操作,返回结果是RDD[k,v].

2. groupByKey:按照key进行分组,直接进行shuffle。

3. 开发指导:reduceByKey比groupByKey,建议使用。但是需要注意是否会影响业务逻辑。

 

sortByKey()

scala> val rdd = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))

rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[14] at parallelize at <console>:24

 

scala> rdd.sortByKey(true).collect()

res9: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))

 

scala> rdd.sortByKey(false).collect()

res10: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))

 sortBy()底层调用了sortByKey()方法

join()

scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala>  val rdd1 = sc.parallelize(Array((1,4),(2,5),(4,6)))
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd.join(rdd1).collect
res0: Array[(Int, (String, Int))] = Array((1,(a,4)), (2,(b,5)))

cartesian()

scala> val rdd1 = sc.parallelize(1 to 3)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at parallelize at <console>:24

 

scala> val rdd2 = sc.parallelize(2 to 5)

rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at parallelize at <console>:24

 

scala> rdd1.cartesian(rdd2).collect()

res17: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5))

 

mapValues ()

 只对key-value类型的value进行操作

scala> val rdd3 = sc.parallelize(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))

rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[67] at parallelize at <console>:24

 

scala> rdd3.mapValues(_+"|||").collect()

res26: Array[(Int, String)] = Array((1,a|||), (1,d|||), (2,b|||), (3,c|||))

2.2.3Action算子

reduce()

聚合rdd中的所有元素

需求:创建一个RDD,将所有元素聚合得到结果。

 

scala> val rdd1 = sc.makeRDD(1 to 10,2)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[85] at makeRDD at <console>:24

 

scala> rdd1.reduce(_+_)

res50: Int = 55

 

scala> val rdd2 = sc.makeRDD(Array(("a",1),("a",3),("c",3),("d",5)))

rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[86] at makeRDD at <console>:24

 

scala> rdd2.reduce((x,y)=>(x._1 + y._1,x._2 + y._2))

res51: (String, Int) = (aacd,12)

count()

需求:创建一个RDD,并将RDD内容收集到Driver端打印

 

scala> var rdd = sc.makeRDD(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[59] at makeRDD at <console>:24

scala> rdd.count()

res31: Long = 10

first()

需求:创建一个RDD,返回该RDD中的第一个元素

 

scala> var rdd = sc.makeRDD(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[60] at makeRDD at <console>:24

scala> rdd.first()

res32: Int = 1

take(N)

2. 需求:创建一个RDD,统计该RDD的条数

scala> var rdd = sc.makeRDD(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[61] at makeRDD at <console>:24

scala> rdd.take(5)

res33: Array[Int] = Array(1, 2, 3, 4, 5)

takeSample()

返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子

scala> var rdd =sc.parallelize(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[62] at parallelize at <console>:24

 

scala> rdd.collect()

res34: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

 

scala> rdd.takeSample(true,5,3)

res35: Array[Int] = Array(3, 5, 5, 9, 7)

takeOrdered()

返回排序后的前几个值

scala> var rdd1 = sc.makeRDD(Seq(10,4,2,12,3))

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at makeRDD at <console>:24

scala> rdd1.top(2)

res36: Array[Int] = Array(12, 10)

 

scala> rdd1.takeOrdered(2)

res37: Array[Int] = Array(2, 3)

savASTextFile

将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本

countByKey()

需求:创建一个PairRDD,统计每种key的个数

scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)

rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[95] at parallelize at <console>:24

 

scala> rdd.countByKey()

res63: scala.collection.Map[Int,Long] = Map(3 -> 2, 1 -> 3, 2 -> 1)

foreach(func)

在数据集的每一个元素上,运行函数func进行更新。

需求:创建一个RDD,对每个元素进行打印

scala> var rdd = sc.makeRDD(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[107] at makeRDD at <console>:24

 

scala> var sum = sc.accumulator(0)

warning: there were two deprecation warnings; re-run with -deprecation for details

sum: org.apache.spark.Accumulator[Int] = 0

 

scala> rdd.foreach(sum+=_)

 

scala> sum.value

res68: Int = 55

 

scala> rdd.collect().foreach(println)

3.复杂算子

aggregateByKey()

如果分三个分区,前两个是一个分区,中间两个是一个分区,最后两个是一个分区,第一个分区的最终结果为(1,3),第二个分区为(1,4)(2,3),最后一个分区为(3,8),combine后为 (3,8), (1,7), (2,3)

scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)

rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24

rdd.glom.collect各个元素的分区的情况

scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_)

agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[13] at aggregateByKey at <console>:26

 

scala> agg.collect()

res7: Array[(Int, Int)] = Array((3,8), (1,7), (2,3))

 

scala> agg.partitions.size

res8: Int = 3

 

scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),1)

rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:24

 

scala> val agg = rdd.aggregateByKey(0)(math.max(_,_),_+_).collect()

agg: Array[(Int, Int)] = Array((1,4), (3,8), (2,3))

foldByKey ()

分区内和分区间执行的运算相同

scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)

rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[91] at parallelize at <console>:24

 

scala> val agg = rdd.foldByKey(0)(_+_)

agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[92] at foldByKey at <console>:26

 

scala> agg.collect()

res61: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

cogroup()

在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD

aggerate ()

fold()

总结:

  • 1-combineByKey将相同key的value进行组合

  • 2-aggregateByKey是根据初始值和seqOp分区内的函数进行操作,在将结果和分区间的函数进行操作(aggregateByKey在分区间是不进行初始值的操作)

  • 3-foldByKey是aggregateByKey的简化的版本,一般分区内的函数和分区间的函数相同的

  • 4-cogroup--相同key的value进行组合

  • 5-Action类型的aggregate 是将初始值和分区内的函数进行操作,分区间在和初始值合并操作(aggregate在分区间是进行初始值的操作)

  • 6-Action类型的fold是aggregate的简化版本,分区内的函数和分区间的函数合并在一起

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值