0.aggregare
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
/*//把一个序列分成两个切片。这地反我有些不明白,parallelize的第二个参数是把一个序列分成RDD的切片的个数。这里是2.*/
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])
z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9
// This example returns 16 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(5, 4, 5, 6) = 6
// final reduce across partitions will be 5 + 5 + 6 = 16
// note the final reduce include the initial value
z.aggregate(5)(math.max(_, _), _ + _)
res29: Int = 16
val z = sc.parallelize(List("a","b","c","d","e","f"),2)
def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
z.mapPartitionsWithIndex(myfunc).collect
res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f])
z.aggregate("")(_ + _, _+_)
res115: String = abcdef
// See here how the initial value "x" is applied three times.
// - once for each partition
// - once when combining all the partitions in the second reduce function.
z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc
val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 24
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res142: String = 11
val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10
val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res144: String = 11
aggregateByKey函数对PairRDD中相同Key的值进行聚合操作,在聚合过程中同样使用了一个中立的初始值。和aggregate函数类似,aggregateByKey返回值的类型不需要和RDD中value的类型一致。因为aggregateByKey是对相同Key中的值进行聚合操作,所以aggregateByKey函数最终返回的类型还是Pair RDD,对应的结果是Key和聚合好的值;而aggregate函数直接是返回非RDD的结果,这点需要注意。在实现过程中,定义了三个aggregateByKey函数原型,但最终调用的aggregateByKey函数都一致。
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))
scala> pairRDD.aggregateByKey(100)(math.min(_, _), _ + _).collect
res21: Array[(String, Int)] = Array((dog,12), (cat,14), (mouse,6))
1.cartesian
/* 此方法用于对不同的数组进行笛卡尔积操作,要求是数据集的长度必须相同,结果作为一个新的数据集返回 */
scala> val x = sc.parallelize(List(1,2,3,4,5))
scala> val y = sc.parallelize(List(1,2,3,4,5))
scala> x.cartesian(y).collect
res25: Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (1,5), (2,1), (2,2), (2,3), (2,4), (2,5), (3,1), (3,2), (3,3), (3,4), (3,5), (4,1), (4,2), (4,3), (4,4), (4,5), (5,1), (5,2), (5,3), (5,4), (5,5))
2.checkpoint
Will create a checkpoint when the RDD is computed next. Checkpointed RDDs are stored as a binary file within the checkpoint directory which can be specified using the Spark context. (Warning: Spark applies lazy evaluation. Checkpointing will not occur until an action is invoked.)
Important note: the directory “my_directory_name” should exist in all slaves. As an alternative you could use an HDFS directory URL as well.
scala> sc.setCheckpointDir("my_directory_name")
scala> val a = sc.parallelize(1 to 4)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> a.checkpoint
scala> a.count
res29: Long = 4
3.coalesce
def coalesce ( numPartitions : Int , shuffle : Boolean = false ): RDD [T]
def repartition ( numPartitions : Int ): RDD [T]
/* coalesce方法是将已经存储的数据重新分片后再进行存储
* 第一个参数是将数据重新分成的片数,布尔型数指的是将数据分成更小的片时使用。
*/
val y = sc.parallelize(1 to 10, 10)
val z = y.coalesce(2, false)
z.partitions.length
res9: Int = 2
scala> val z = y.coalesce(4, false)
scala> z.partitions.length
res36: Int = 4
4.collectAsMap [Pair]
scala> val a = sc.parallelize(List(1, 2, 1, 3), 1)
scala> val b = a.zip(a).collectAsMap
b: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
scala> val b = a.zip(a).collect
b: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))
5.context
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.context
res8: org.apache.spark.SparkContext = org.apache.spark.SparkContext@58c1c2f1
6.countApproxDistinct
def countApproxDistinct(relativeSD: Double = 0.05): Long
- countApproxDistinct : RDD的一个方法,作用是对RDD集合内容进行去重统计。
- 该统计是一个大约的统计,参数relativeSD控制统计的精确度。
- relativeSD越小,结果越准确
scala> val a = sc.parallelize(1 to 5000 ,20)
scala> val b = a++a++a++a++a
scala> b.countApproxDistinct(0.01)
res42: Long = 4974
scala> b.countApproxDistinct(0.05)
res43: Long = 5082
scala> b.countApproxDistinct(0.001)
res44: Long = 4999