Spark RDD

数据抽象RDD

定义:RDD是只读的、分区记录集合。
特点:支持工作集、自动容错、位置感知、可伸缩;
5个属性:
1. 一组分片(Partition):每个分片逻辑上被映射成Block,会被一个计算任务处理。RDD默认分片个数为程序分配到的CPU核数;
2. 一个分片计算函数:每个RDD都会实现compute函数;
3. 依赖关系:RDD之间转换形成前后依赖关系,当部分分区数据丢失时,Spark通过依赖关系重新计算丢失分区数据;
4. RDD分片函数:目前有两种:HashPartitioner和RangePartitioner,Partitioner函数决定了RDD的分片数量和父RDD Shuffle输出的分片数量;
5. 一个列表:存储每个Partition的优先位置。

RDD依赖关系

根据RDD和它父RDD之间的关系分分为两种依赖类型:窄依赖和宽依赖。窄依赖是每个父RDD的Partition最多被一个子RDD的Partition使用,而宽依赖是每个父RDD的Partition被多个子RDD的Partition依赖。
窄依赖(NarrowDependency)有两种具体的实现:OneToOneDependency和RangeDependency,RangeDependency只被UnionRDD使用。宽依赖只有一种具体实现ShuffleDependency,支持两种Shuffle Manager(HashShuffleManager和SortShuffleManager)。
最初的RDD经过一系列的转换形成DAG,这种依赖关系也被称之为Lineage,Spark根据宽依赖将DAG划分为不同的Stage。在一个Stage内部,给每个Partition分配一个计算任务Task,每个Partition上的Task并行执行。DAG中的Stage之间,只有父Stage执行完成之后子Stage才可以执行。DAG的最后一个Stage为每个结果的Partition生成一个ResultTask,其他的Stage生成ShuffleMapTask,生成的Task提交到Executor上执行。

RDD检查点

RDD的检查点checkpoint是为了避免缓存cache()数据由于丢失而重新计算。区别在于:缓存是将计算结束之后的结果写入到指定存储级别的介质中,而检查点是在计算完成之后,建立一个新的job,利用RDD缓存快速完成计算。

RDD示例

map Example

API: map[U](f: (T) ⇒ U): RDD[U]

// Basic map example in scala
scala> val x = sc.parallelize(List("spark", "rdd", "example",  "sample", "example"), 3)
scala> val y = x.map(x => (x, 1))
scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))

// rdd y can be re writen with shorter syntax in scala as 
scala> val y = x.map((_, 1))
scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))

// Another example of making tuple with string and it's length
scala> val y = x.map(x => (x, x.length))
scala> y.collect
res0: Array[(String, Int)] = Array((spark,5), (rdd,3), (example,7), (sample,6), (example,7))

flatMap Example

API: flatMap[U](f: (T) ⇒ TraversableOnce[U]): RDD[U]

scala> val x = sc.parallelize(List("spark rdd example",  "sample example"), 2)

// map operation will return Array of Arrays in following case : check type of res0
scala> val y = x.map(x => x.split(" ")) // split(" ") returns an array of words
scala> y.collect
res0: Array[Array[String]] = Array(Array(spark, rdd, example), Array(sample, example))

// flatMap operation will return Array of words in following case : Check type of res1
scala> val y = x.flatMap(x => x.split(" "))
scala> y.collect
res1: Array[String] = Array(spark, rdd, example, sample, example)

// rdd y can be re written with shorter syntax in scala as 
scala> val y = x.flatMap(_.split(" "))
scala> y.collect
res2: Array[String] = Array(spark, rdd, example, sample, example)

一对多映射导致maxRDD数据变大,后面reduceByKey代价/耗时可能变高。

filter Example

API: filter(f: (T) ⇒ Boolean): RDD[T]

scala> val x = sc.parallelize(1 to 10, 2)

// filter operation 
scala> val y = x.filter(e => e%2==0) 
scala> y.collect
res0: Array[Int] = Array(2, 4, 6, 8, 10)

// rdd y can be re written with shorter syntax in scala as 
scala> val y = x.filter(_ % 2 == 0)
scala> y.collect
res1: Array[Int] = Array(2, 4, 6, 8, 10)

mapPartitions Example

API: mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0:ClassTag[U]): RDD[U]

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> parallel.mapPartitions( x => List(x.next).iterator).collect
res0: Array[Int] = Array(1, 4, 7)

// compare to the same, but with default parallelize
scala> val parallel = sc.parallelize(1 to 9)

scala> parallel.mapPartitions( x => List(x.next).iterator).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8)

输入为partition所有元素组成的Iterator变量,返回变化结果的iterator变量,主要用于求解局部解,然后局部解汇总得到最后解。通常采用while的结构控制遍历:

parallelRDD.mapPartitions { iter =>
    while( iter.hasnext()) {
        val value = iter.next
        ...
    }
}

而且不要把partition的数据保存在array中。

mapPartitionsWithIndex Example

API: mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

index为partiontion的标识索引

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => it.toList.map(x => index + ", "+x).iterator).collect
res0: Array[String] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2, 9)
// rdd不同分区的元素分布,以及访问次序
parallel.mapPartitionsWithIndex{
   |    (x,iter) => {
   |      var result = List[String]()
   |        while(iter.hasNext){
   |          result ::= ("part_" + x + "|" + iter.next())
   |        }
   |        result.iterator
   |
   |    }
   |  }.collect
res1: Array[String] = Array(part_0|3, part_0|2, part_0|1, part_1|6, part_1|5, part_1|4, part_2|9, part_2|8, part_2|7)

zipWithIndex Example

scala> var rdd = sc.makeRDD(Seq("A","B","R","D","F"),3)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at makeRDD at <console>:27
scala> rdd.zipWithIndex.collect
res7: Array[(String, Long)] = Array((A,0), (B,1), (R,2), (D,3), (F,4))

zip & zipPartitions Example

API: def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

scala> val rdd1 = sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:27

scala> val rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at makeRDD at <console>:27

scala> rdd1.zip(rdd2).collect
res0: Array[(Int, String)] = Array((1,A), (2,B), (3,C), (4,D), (5,E))

scala> rdd1.zipPartitions(rdd2){
     |     (rdd1Iter,rdd2Iter) => {
     |       var result = List[String]()
     |       while(rdd1Iter.hasNext && rdd2Iter.hasNext) {
     |         result::=(rdd1Iter.next() + "_" + rdd2Iter.next())
     |       }
     |       result.iterator
     |     }
     |   }.collect
res3: Array[String] = Array(2_B, 1_A, 5_E, 4_D, 3_C)

groupBy Example

// Bazic groupBy example in scala
scala> val x = sc.parallelize(Array("Joseph", "Jimmy", "Tina",
     | "Thomas", "James", "Cory",
     | "Christine", "Jackeline", "Juan"), 3)
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>:21

// create group per first character
scala> val y = x.groupBy(word => word.charAt(0))
y: org.apache.spark.rdd.RDD[(Char, Iterable[String])] = ShuffledRDD[18] at groupBy at <console>:23

scala> y.collect
res0: Array[(Char, Iterable[String])] = Array((T,CompactBuffer(Tina, Thomas)), (C,CompactBuffer(Cory,
 Christine)), (J,CompactBuffer(Joseph, Jimmy, James, Jackeline, Juan)))

// Another short syntax
 scala> val y = x.groupBy(_.charAt(0))
y: org.apache.spark.rdd.RDD[(Char, Iterable[String])] = ShuffledRDD[3] at groupBy at <console>:23

scala> y.collect
res1: Array[(Char, Iterable[String])] = Array((T,CompactBuffer(Tina, Thomas)), (C,CompactBuffer(Cory,
 Christine)), (J,CompactBuffer(Joseph, Jimmy, James, Jackeline, Juan)))

reduceByKey Example

合并每个key下面的vlaue值

// Bazic reduceByKey example in scala
// Creating PairRDD x with key value pairs
scala> val x = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1),
     | ("a", 1), ("b", 1), ("b", 1),
     | ("b", 1), ("b", 1)), 3)
x: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:21

// Applying reduceByKey operation on x
scala> val y = x.reduceByKey((accum, n) => (accum + n))
y: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:23

scala> y.collect
res0: Array[(String, Int)] = Array((a,3), (b,5))

// Another way of applying associative function
scala> val y = x.reduceByKey(_ + _)
y: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[3] at reduceByKey at <console>:23

scala> y.collect
res1: Array[(String, Int)] = Array((a,3), (b,5))

// Define associative function separately
scala> def sumFunc(accum:Int, n:Int) =  accum + n
sumFunc: (accum: Int, n: Int)Int

scala> val y = x.reduceByKey(sumFunc)
y: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:25

scala> y.collect
res2: Array[(String, Int)] = Array((a,3), (b,5))

cartesian Example

API: def cartesian [U: ClassTag ] (other:RDD[U]):RDD [(T, U)]

scala> val x = sc. parallelize (List (1 ,2 ,3 ,4 ,5))                                                                                                                                                                                                                
scala> val y = sc. parallelize (List (6 ,7 ,8 ,9 ,10))                                                                             
scala> x. cartesian (y). collect                                                                                                   
res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9)
, (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))                                                                                                                                                                               

join Example

scala> val names1 = sc.parallelize(List("abe", "abby", "apple")).map(a => (a, 1))

scala> val names2 = sc.parallelize(List("apple", "beatty", "beatrice")).map(a => (a, 1))

scala> names1.join(names2).collect
res0: Array[(String, (Int, Int))] = Array((apple,(1,1)))

scala> names1.leftOuterJoin(names2).collect
res1: Array[(String, (Int, Option[Int]))] = Array((abby,(1,None)), (apple,(1,Some(1))), (abe,(1,None)))

scala> names1.rightOuterJoin(names2).collect
res2: Array[(String, (Option[Int], Int))] = Array((apple,(Some(1),1)), (beatty,(None,1)), (beatrice,(None,1)))

reduce Example

API: map[U](f: (T) ⇒ U): RDD[U]

将RDD的数据进行merge

// reduce numbers 1 to 10 by adding them up
scala> val x = sc.parallelize(1 to 10, 2)
scala> val y = x.reduce((accum,n) => (accum + n)) 
y: Int = 55

// shorter syntax
scala> val y = x.reduce(_ + _) 
y: Int = 55

// same thing for multiplication
scala> val y = x.reduce(_ * _) 
y: Int = 3628800

count Example

scala> val names2 = sc.parallelize(List("apple", "beatty", "beatrice"))
scala> names2.count
res0: Long = 3

take Example

scala> val names2 = sc.parallelize(List("apple", "beatty", "beatrice"))

scala> names2.take(2)
res0: Array[String] = Array(apple, beatty)

countByKey Example

scala> val hockeyTeams = sc.parallelize(List("wild", "blackhawks", "red wings", "wild", "oilers", "whalers", "jets", "wild"))

scala> hockeyTeams.map(k => (k,1)).countByKey
res0: scala.collection.Map[String,Long] = Map(jets -> 1, blackhawks -> 1, red wings -> 1, oilers -> 1, whalers -> 1, wild -> 3)

aggregate Example

API def aggregate [U: ClassTag] (zeroValue : U)( seqOp : (U, T) => U, combOp : (U, U) => U): U

scala> val z = sc.parallelize(List (1 ,2 ,3 ,4 ,5 ,6) , 2)
scala> z.aggregate(0)(math.max(_, _), _ + _)
res0: Int = 9

scala> val z = sc.parallelize(List ("a","b","c","d","e","f") ,2)
scala> z.aggregate("x")(_ + _, _ + _)
res1 : String = xxabcxdef

scala> val z = sc.parallelize(List ("12" ,"23" ,"345" ,"") ,2)
scala> z.aggregate("")((x,y) => math.min(x.length , y. length ).toString , (x,y)=> x + y)
res3: String = 01

过往记忆对aggreage的详细解释

treeAggregate Example

API: def treeAggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U, depth: Int = 2)(implicit arg0: ClassTag[U]): U

scala> val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:27

scala> def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
     |   iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
     | }
myfunc: (index: Int, iter: Iterator[Int])Iterator[String]

scala> z.mapPartitionsWithIndex(myfunc).collect
res11: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])

scala> z.treeAggregate(0)(math.max(_, _), _ + _)
res12: Int = 9

scala> z.treeAggregate(5)(math.max(_, _), _ + _)
res13: Int = 11

fold Example

API: def fold( zeroValue : T)(op: (T, T) => T): T

scala> val a = sc. parallelize (List (1 ,2 ,3) , 3)
scala> a.fold(0)(_ + _)
res0: Int = 6                                                                                                                                                                           

参考资料

The RDD API By Example

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值