Spark RDD API

最新推荐文章于 2020-04-04 09:25:11 发布

Reminders

最新推荐文章于 2020-04-04 09:25:11 发布

阅读量467

点赞数

分类专栏： spark 文章标签： spark api

本文链接：https://blog.csdn.net/u012123511/article/details/77646495

版权

spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

0.aggregare

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
/*//把一个序列分成两个切片。这地反我有些不明白，parallelize的第二个参数是把一个序列分成RDD的切片的个数。这里是2.*/
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
  iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
}

z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])

z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

// This example returns 16 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(5, 4, 5, 6) = 6
// final reduce across partitions will be 5 + 5 + 6 = 16
// note the final reduce include the initial value
z.aggregate(5)(math.max(_, _), _ + _)
res29: Int = 16

val z = sc.parallelize(List("a","b","c","d","e","f"),2)
def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
  iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
}

z.mapPartitionsWithIndex(myfunc).collect
res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f])

z.aggregate("")(_ + _, _+_)
res115: String = abcdef

// See here how the initial value "x" is applied three times.
//  - once for each partition
//  - once when combining all the partitions in the second reduce function.
z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc

val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 24

z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10

val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res144: String = 11
aggregateByKey函数对PairRDD中相同Key的值进行聚合操作，在聚合过程中同样使用了一个中立的初始值。和aggregate函数类似，aggregateByKey返回值的类型不需要和RDD中value的类型一致。因为aggregateByKey是对相同Key中的值进行聚合操作，所以aggregateByKey函数最终返回的类型还是Pair RDD，对应的结果是Key和聚合好的值；而aggregate函数直接是返回非RDD的结果，这点需要注意。在实现过程中，定义了三个aggregateByKey函数原型，但最终调用的aggregateByKey函数都一致。
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
  iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

scala> pairRDD.aggregateByKey(100)(math.min(_, _), _ + _).collect
res21: Array[(String, Int)] = Array((dog,12), (cat,14), (mouse,6))

1.cartesian

/*   此方法用于对不同的数组进行笛卡尔积操作，要求是数据集的长度必须相同，结果作为一个新的数据集返回 */

scala> val x = sc.parallelize(List(1,2,3,4,5))
scala> val y = sc.parallelize(List(1,2,3,4,5))
scala> x.cartesian(y).collect
res25: Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (1,5), (2,1), (2,2), (2,3), (2,4), (2,5), (3,1), (3,2), (3,3), (3,4), (3,5), (4,1), (4,2), (4,3), (4,4), (4,5), (5,1), (5,2), (5,3), (5,4), (5,5))

2.checkpoint
Will create a checkpoint when the RDD is computed next. Checkpointed RDDs are stored as a binary file within the checkpoint directory which can be specified using the Spark context. (Warning: Spark applies lazy evaluation. Checkpointing will not occur until an action is invoked.)

Important note: the directory “my_directory_name” should exist in all slaves. As an alternative you could use an HDFS directory URL as well.

scala> sc.setCheckpointDir("my_directory_name")

scala> val a = sc.parallelize(1 to 4)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24

scala> a.checkpoint

scala> a.count
res29: Long = 4

3.coalesce

def coalesce ( numPartitions : Int , shuffle : Boolean = false ): RDD [T]
def repartition ( numPartitions : Int ): RDD [T]

/*  coalesce方法是将已经存储的数据重新分片后再进行存储 
    * 第一个参数是将数据重新分成的片数，布尔型数指的是将数据分成更小的片时使用。 
    */  
val y = sc.parallelize(1 to 10, 10)
val z = y.coalesce(2, false)
z.partitions.length
res9: Int = 2

scala> val z = y.coalesce(4, false)
scala> z.partitions.length
res36: Int = 4

4.collectAsMap [Pair]

scala> val a = sc.parallelize(List(1, 2, 1, 3), 1)
scala> val b = a.zip(a).collectAsMap
b: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
scala> val b = a.zip(a).collect
b: Array[(Int, Int)] = Array((1,1), (2,2), (1,1), (3,3))

5.context

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.context
res8: org.apache.spark.SparkContext = org.apache.spark.SparkContext@58c1c2f1

6.countApproxDistinct

def countApproxDistinct(relativeSD: Double = 0.05): Long

countApproxDistinct : RDD的一个方法，作用是对RDD集合内容进行去重统计。
该统计是一个大约的统计，参数relativeSD控制统计的精确度。
relativeSD越小，结果越准确

scala> val  a = sc.parallelize(1 to 5000 ,20)
scala> val b = a++a++a++a++a
scala> b.countApproxDistinct(0.01)
res42: Long = 4974
scala> b.countApproxDistinct(0.05)
res43: Long = 5082
scala> b.countApproxDistinct(0.001)
res44: Long = 4999

Reminders

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark RDD API

0.aggregareval z = sc.parallelize(List(1,2,3,4,5,6), 2)/*//把一个序列分成两个切片。这地反我有些不明白，parallelize的第二个参数是把一个序列分成RDD的切片的个数。这里是2.*/def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = { iter.
复制链接

扫一扫