1. RDD基本操作
val rdd1 = sc. parallelize ( List ( 1 , 2 , 3 , 4 , 4 ) )
输出结果:rdd1: org. apache. spark. rdd. RDD[ Int] = ParallelCollectionRDD[ 0 ] at parallelize at < console> : 24
scala> rdd1. collect
res0: Array[ Int] = Array ( 1 , 2 , 3 , 4 , 4 )
scala> val rdd2 = sc. parallelize ( List ( "apple" , "orange" , "banana" , "Grape" ) )
rdd2: org. apache. spark. rdd. RDD[ String] = ParallelCollectionRDD[ 1 ] at parallelize at < console> : 24
scala> rdd2. collect
res1: Array[ String] = Array ( apple, orange, banana, Grape)
scala> rdd1. map ( _ + 1 ) . collect
res2: Array[ Int] = Array ( 2 , 3 , 4 , 5 , 5 )
scala> rdd2. map ( x = > "fruit:" + x) . collect
res5: Array[ String] = Array ( fruit: apple, fruit: orange, fruit: banana, fruit: Grape)
scala> rdd1. filter ( _ < 3 ) . collect
res6: Array[ Int] = Array ( 1 , 2 )
scala> rdd1. filter ( x = > x < 3 ) . collect
scala> rdd2. filter ( x = > x. contains ( "ra" ) ) . collect
res8: Array[ String] = Array ( orange, Grape)
scala> rdd1. distinct. collect
res9: Array[ Int] = Array ( 4 , 2 , 1 , 3 )
scala> val sRDD = rdd1. randomSplit ( Array ( 0.4 , 0.6 ) )
sRDD: Array[ org. apache. spark. rdd. RDD[ Int] ] = Array ( MapPartitionsRDD[ 12 ] at randomSplit at < console> : 25 , MapPartitionsRDD[ 13 ] at randomSplit at < console> : 25 )
scala> sRDD ( 0 ) . collect
res3: Array[ Int] = Array ( 1 , 2 , 4 , 4 )
scala> sRDD ( 1 ) . collect
res4: Array[ Int] = Array ( 3 )
scala> val gRDD = rdd1. groupBy ( x = > if ( x% 2 == 0 ) "even" else "odd" ) . collect
gRDD: Array[ ( String, Iterable[ Int] ) ] = Array ( ( even, CompactBuffer ( 2 , 4 , 4 ) ) , ( odd, CompactBuffer ( 3 , 1 ) ) )
scala> gRDD ( 0 )
res5: ( String, Iterable[ Int] ) = ( even, CompactBuffer ( 2 , 4 , 4 ) )
scala> gRDD ( 1 )
res6: ( String, Iterable[ Int] ) = ( odd, CompactBuffer ( 3 , 1 ) )
2. 多个RDD转换操作,RDD支持执行多个RDD的运算
scala> val rdd1 = sc. parallelize ( List ( 3 , 1 , 2 , 5 , 5 ) )
rdd1: org. apache. spark. rdd. RDD[ Int] = ParallelCollectionRDD[ 6 ] at parallelize at < console> : 24
scala> val rdd2 = sc. parallelize ( List ( 5 , 6 ) )
rdd2: org. apache. spark. rdd. RDD[ Int] = ParallelCollectionRDD[ 7 ] at parallelize at < console> : 24
scala> val rdd3 = sc. parallelize ( List ( 2 , 7 ) )
rdd3: org. apache. spark. rdd. RDD[ Int] = ParallelCollectionRDD[ 8 ] at parallelize at < console> : 24
scala> rdd1. union ( rdd2) . union ( rdd3) . collect
res7: Array[ Int] = Array ( 3 , 1 , 2 , 5 , 5 , 5 , 6 , 2 , 7 )
scala> ( rdd1 ++ rdd2 ++ rdd3) . collect
res8: Array[ Int] = Array ( 3 , 1 , 2 , 5 , 5 , 5 , 6 , 2 , 7 )
scala> rdd1. intersection ( rdd2) . collect
res9: Array[ Int] = Array ( 5 )
scala> rdd1. subtract ( rdd2) . collect
res10: Array[ Int] = Array ( 2 , 1 , 3 )
scala> rdd1. cartesian ( rdd2) . collect
res11: Array[ ( Int, Int) ] = Array ( ( 3 , 5 ) , ( 1 , 5 ) , ( 3 , 6 ) , ( 1 , 6 ) , ( 2 , 5 ) , ( 5 , 5 ) , ( 5 , 5 ) , ( 2 , 6 ) , ( 5 , 6 ) , ( 5 , 6 ) )
3. 基本动作运算(这都是Actions运算,会马上执行结果)
scala> rdd1. first
res12: Int = 3
scala> rdd1. take ( 2 )
res13: Array[ Int] = Array ( 3 , 1 )
scala> rdd1. takeOrdered ( 4 ) ( Ordering[ Int] . reverse)
res15: Array[ Int] = Array ( 5 , 5 , 3 , 2 )
scala> rdd1. stats
res16: org. apache. spark. util. StatCounter = ( count: 5 , mean: 3.200000 , stdev: 1.600000 , max: 5.000000 , min: 1.000000 )
scala> rdd1. min
res17: Int = 1
scala> rdd1. min
res17: Int = 1
scala> rdd1. max
res18: Int = 5
scala> rdd1. stdev
res19: Double = 1.6
scala> rdd1. count
res20: Long = 5
scala> rdd1. sum
res21: Double = 16.0
scala> rdd1. mean
res22: Double = 3.2
4. RDD Key-Value 基本 “转换” 运算 -这个是map-reduce的基础
scala> val kv = sc. parallelize ( List ( ( 3 , 4 ) , ( 3 , 6 ) , ( 5 , 6 ) , ( 1 , 2 ) ) )
kv: org. apache. spark. rdd. RDD[ ( Int, Int) ] = ParallelCollectionRDD[ 34 ] at parallelize at < console> : 24
scala> kv. collect
res23: Array[ ( Int, Int) ] = Array ( ( 3 , 4 ) , ( 3 , 6 ) , ( 5 , 6 ) , ( 1 , 2 ) )
scala> kv. keys. collect
res24: Array[ Int] = Array ( 3 , 3 , 5 , 1 )
scala> kv. values. collect
res25: Array[ Int] = Array ( 4 , 6 , 6 , 2 )
scala> kv. filter{ case ( key, value) = > key< 5 } . collect
res28: Array[ ( Int, Int) ] = Array ( ( 3 , 4 ) , ( 3 , 6 ) , ( 1 , 2 ) )
scala> kv. mapValues ( x = > x* x) . collect
res29: Array[ ( Int, Int) ] = Array ( ( 3 , 16 ) , ( 3 , 36 ) , ( 5 , 36 ) , ( 1 , 4 ) )
scala> kv. sortByKey ( true ) . collect
res30: Array[ ( Int, Int) ] = Array ( ( 1 , 2 ) , ( 3 , 4 ) , ( 3 , 6 ) , ( 5 , 6 ) )
scala> kv. sortByKey ( false ) . collect
res32: Array[ ( Int, Int) ] = Array ( ( 5 , 6 ) , ( 3 , 4 ) , ( 3 , 6 ) , ( 1 , 2 ) )
1 ) . 例如:Array ( ( 1 , 2 ) , ( 3 , 4 ) , ( 3 , 6 ) , ( 5 , 6 ) ) ,第一个是key,第二个是value
2 ) . reduceByKey 会虚招相同的key合并,相同的key数据有( 3 , 4 ) , ( 3 , 6 )
3 ) . 合并之后的结果为:( 3 , 4 + 6 )
4 ) . 剩下的( 1 , 2 ) ,( 5 , 6 ) ,因为没有相同的key,保持不变
rdd1. reduceByKey ( ( x, y) = > ( x+ y) ) . collect
5. 多个RDD Key-Value “转换” 运算
val rdd1 = sc. parallelize ( List ( ( 3 , 4 ) , ( 3 , 6 ) , ( 5 , 6 ) , ( 1 , 2 ) ) )
val rdd2 = sc. parallelize ( List ( ( 3 , 8 ) , ( 6 , 8 ) ) )
scala> rdd1. join ( rdd2)
res4: org. apache. spark. rdd. RDD[ ( Int, ( Int, Int) ) ] = MapPartitionsRDD[ 8 ] at join at < console> : 28
scala> rdd1. join ( rdd2) . collect
res5: Array[ ( Int, ( Int, Int) ) ] = Array ( ( 3 , ( 4 , 8 ) ) , ( 3 , ( 6 , 8 ) ) )
scala> rdd1. join ( rdd2) . collect. foreach ( println)
( 3 , ( 4 , 8 ) )
( 3 , ( 6 , 8 ) )
1 ) . leftOuterJoin 会从左边的集合( rdd1) 对应到右边的集合( rdd2) ,并显示所有左边集合( rdd1) 中的元素
2 ) . 如果rdd1的key值对应到rdd2,会显示相同的key ( 3 , ( 4 , Some ( 8 ) ) ) 、( 3 , ( 6 , Some ( 8 ) ) )
3 ) . 如果rdd1的key值对应不到rdd2,会显示None ( 5 , ( 6 , None) ) 、( 1 , ( 2 , None) )
scala> rdd1. leftOuterJoin ( rdd2) . collect
res7: Array[ ( Int, ( Int, Option[ Int] ) ) ] = Array ( ( 1 , ( 2 , None) ) , ( 3 , ( 4 , Some ( 8 ) ) ) , ( 3 , ( 6 , Some ( 8 ) ) ) , ( 5 , ( 6 , None) ) )
scala> rdd1. leftOuterJoin ( rdd2) . collect. foreach ( println)
( 1 , ( 2 , None) )
( 3 , ( 4 , Some ( 8 ) ) )
( 3 , ( 6 , Some ( 8 ) ) )
( 5 , ( 6 , None) )
1 ) . rightOuterJoin 会从右边的集合( rdd1) 对应到左边的集合( rdd2) ,并显示所有右边集合( rdd1) 中的元素
2 ) . 如果rdd1的key值对应到rdd2,会显示相同的key ( 3 , ( 4 , Some ( 8 ) ) ) 、( 3 , ( 6 , Some ( 8 ) ) )
scala> rdd1. rightOuterJoin ( rdd2) . collect
res9: Array[ ( Int, ( Option[ Int] , Int) ) ] = Array ( ( 6 , ( None, 8 ) ) , ( 3 , ( Some ( 4 ) , 8 ) ) , ( 3 , ( Some ( 6 ) , 8 ) ) )
scala> rdd1. rightOuterJoin ( rdd2) . collect. foreach ( println)
( 6 , ( None, 8 ) )
( 3 , ( Some ( 4 ) , 8 ) )
( 3 , ( Some ( 6 ) , 8 ) )
scala> rdd1. subtractByKey ( rdd2) . collect. foreach ( println)
( 1 , 2 )
( 5 , 6 )
5. RDD Key-Value "动作"运算
scala> rdd1. first
res4: ( Int, Int) = ( 3 , 4 )
scala> rdd1. take ( 2 )
res5: Array[ ( Int, Int) ] = Array ( ( 3 , 4 ) , ( 3 , 6 ) )
scala> rdd1. first. _1
res6: Int = 3
scala> rdd1. first. _2
res7: Int = 4
scala> rdd1. countByKey
res8: scala. collection. Map[ Int, Long] = Map ( 1 - > 1 , 3 - > 2 , 5 - > 1 )
scala> rdd1. collectAsMap
res9: scala. collection. Map[ Int, Int] = Map ( 5 - > 6 , 1 - > 2 , 3 - > 6 )
scala> res9 ( 5 )
res10: Int = 6
scala> rdd1. lookup ( 3 )
res12: Seq[ Int] = WrappedArray ( 4 , 6 )
scala> rdd1. lookup ( 5 )
res13: Seq[ Int] = WrappedArray ( 6 )
6. Broadcast 广播变量
scala> val kv = sc. parallelize ( List ( ( 1 , "apple" ) , ( 2 , "orange" ) , ( 3 , "banana" ) , ( 4 , "grape" ) ) )
kv: org. apache. spark. rdd. RDD[ ( Int, String) ] = ParallelCollectionRDD[ 16 ] at parallelize at < console> : 24
scala> val fruitMap = kv. collectAsMap
fruitMap: scala. collection. Map[ Int, String] = Map ( 2 - > orange, 4 - > grape, 1 - > apple, 3 - > banana)
val fruitIds = kv. keys. collect
val fruitIds = kv. keys. collect. toList
7. accumulator 累加器
spark提供了accumulator累加器共享变量( shared variable) . 使用规则如下:
1 ) . accumulator 累加器可以使用SparkContext. accumulator ( [ 初始值] ) 来创建
2 ) . 使用 "+=" 累加
3 ) . 在task中,例如 foreach循环中,不能读取累加器的值
4 ) . 只有驱动程序,也就是循环外,才可以使用 . value 来读取累加器的值
scala> val ls = sc. parallelize ( List ( 1 , 2 , 3 , 4 , 5 , 6 , 7 ) )
ls: org. apache. spark. rdd. RDD[ Int] = ParallelCollectionRDD[ 21 ] at parallelize at < console> : 24
scala> val total = sc. accumulator ( 0.0 )
warning: there were two deprecation warnings; re- run with - deprecation for details
total: org. apache. spark. Accumulator[ Double] = 0.0
scala> val num = sc. accumulator ( 0 )
warning: there were two deprecation warnings; re- run with - deprecation for details
num: org. apache. spark. Accumulator[ Int] = 0
scala> ls. foreach ( i = > { total += i; num += 1 } )
scala> println ( total. value)
28.0
scala> println ( num. value)
7
8. RDD persistence 持久化
RDD persistence 持久化机制:可以用于将需要重复运算的RDD存储在内存中,以便大幅提升运算效率
Spark RDD持久化方法如下:
RDD.persist(存储等级) -- 默认存储等级是:MEMERY_ONLY,也就是存储在内存中
持久等级:
MEMERY_ONLY:默认,存储RDD的方式是以Java对象反串行化在JVM内存中。如果RDD太大无法完全
存储在内存中,多余的RDD partitions不会cache在内存中,而是需要时再重新计算
MEMERY_AND_DISK: 存储RDD的方式是以Java对象反串行化在JVM内存中。如果RDD太大无法完全
存储在内存中,多余的RDD partitions存储在硬盘,需要时从硬盘读取
MEMERY_AND_SER:类似 MEMERY_ONLY
MEMERY_AND_DISK_SER: 类似 MEMERY_AND_DISK
DISK_ONLY: 存储在硬盘上
RDD.unpersist() -- 取消持久化
范例:略去