小知识点实践——RDD 在Stage 中计算时的pipeline测试
1 分析
在Stage中,以pipeline方式进行计算,计算时对分区的每一条记录是依次从头到尾(在数据获取角度存在回溯的概念,但执行上是从前到后依次计算)使用各个操作算子进来实现的。
下面以一条记录计算后对应一条记录为例进行说明(可以认为是map操作,对应逻辑可以扩展到filter、flatMap等):
Src_partition_x | Desc_partition_x | ||||
Src_Rec_1 | Opts_1 --------------> | Opts_2 --------------> | …… --------------> | Opts_x --------------> | Desc_Rec_1 |
Src_Rec_2 | Desc_Rec_2 | ||||
Src_Rec_3 | Desc_Rec_3 | ||||
… | … | ||||
Src_Rec_n | Desc_Rec_n |
Compute中从Iterator中next到一条记录,如Desc_Rec_1时,回溯过程:
Opts_x(rec_x-1) -> …… -> Opts_x( ……( Opts_2(rec_1) ) -> Opts_x( ……( Opts_2 ( Opts_1(Src_Rec_1) ) ) ……)
2 测试
scala> val a = sc.parallelize(1 to 4, 1)
a: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[0] at parallelize at <console>:21
scala> a.toDebugString
res0: String = (1) ParallelCollectionRDD[0]at parallelize at <console>:21 []
scala> a.filter{ x =>println("hello"); x>2}.map{ x => println("world");x}.count
hello
hello
hello
world
hello
world
res1: Long = 2
scala> a.flatMap{ x =>println("hello"+x); List(x, x)}.map{ x =>println("world"+x); x}.collect
hello1
world1
world1
hello2
world2
world2
hello3
world3
world3
hello4
world4
world4
res8: Array[Int] = Array(1, 1, 2, 2, 3, 3,4, 4)
自定义函数模拟compute操作Iterator:
scala> def map[T](iter : Iterator[T]) :Iterator[T] = iter.map{ x => println("hello"); x}
map: [T](iter: Iterator[T])Iterator[T]
scala> def filter[T](iter : Iterator[T]): Iterator[T] = iter.filter{ x => println("world"); true}
filter: [T](iter: Iterator[T])Iterator[T]
scala> map(filter((1 to 5).iterator))
world
res7: Iterator[Int] = non-empty iterator
scala> map(filter((1 to5).iterator)).foreach(x=>x)
world
hello
world
hello
world
hello
world
hello
world
3 扩展
scala> (1 to 4).filter{ x =>println("hello"+x); x>2}.map{ x =>println("world"+x); x}
hello1
hello2
hello3
hello4
world3
world4
res9:scala.collection.immutable.IndexedSeq[Int] = Vector(3, 4)
scala> (1 to 4).withFilter{ x =>println("hello"+x); x>2}.map{ x =>println("world"+x); x}
hello1
hello2
hello3
world3
hello4
world4
res10: scala.collection.immutable.IndexedSeq[Int]= Vector(3, 4)