[杂谈]Rdd运行-1

最新推荐文章于 2024-09-11 21:29:29 发布

weixin_30244681

最新推荐文章于 2024-09-11 21:29:29 发布

阅读量77

点赞数

文章标签：大数据

原文链接：http://www.cnblogs.com/ivanny/p/spark_rdd_run_DAGScheduler.html

版权

在调试的时候，发现有一个有趣的现象。task运行的时候，是分区分批次地执行。
每次执行都会调用我们运用在RDD上的所有的操作。分区之间能够并行执行，同一批次的数据运算当然需要顺序执行（不能并行计算）

比如，我有一个job

rdd.map(...).map(...).map(...).reduce(...)

数据源有5行数据，共2个分区

task-0-分区0:

 第1行数据，map(...).map(...).map(...)
 第2行数据，map(...).map(...).map(...)
 reduce(...)
 继续处理分区0的数据...

task-1-分区1:

 第4行数据，map(...).map(...).map(...)
 第5行数据，map(...).map(...).map(...)
 reduce(...)
 继续处理分区1的数据...

task-2-DAG-scheduler:

 对task-0和task-1的结果，进行reduce

task-0和task-1属于同一个stage, 可以并行运行；

task-2属于另一个stage，但也可以和task-0,task-1并行执行；

task-0和task-1在运行过程中，一边执行一边把数据shuffle到task-2，task-2就可以开始reduce

不知道TaskScheduler有没有实现这种层次的并行。

测试数据： sale.svc

John,iphone Cover,9.99
John,Headphones,5.49
Jack,iphone Cover,9.99
Jill,Samsung Galaxy Cover,8.95
Bob,iPad Cover,5.49

测试代码：

object SaleStat {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf(true).setAppName("SaleStat").setMaster("local[4]").set("spark.executor.heartbeatInterval", 100.toString)
        val sc = new SparkContext(conf)
        val saleData = sc.textFile("PATH_TO_TEST_DATA/sale.svc")
            .map {
                line => {
                    val words = line.split(",")
                    println("map1: " + Thread.currentThread().getName + "," + line)
                    words
                }
            }
            .map {
                words => val (user, product, price) = (words(0), words(1), words(2))
                println("map2: " + (Thread.currentThread().getName,user, product, price))
                (user, product, price)
            }
            .map(
                x => {
                    val price: Double = x._3.toDouble
                    println("map3: " + Thread.currentThread().getName + "," + x)
                    (x._1, price)
                }
            )
            .reduce {
                (x, y) => {
                    val total: Double = x._2 + y._2
                    println("reduceByKey: " + (Thread.currentThread().getName, x,y,total))
                    (x._1,total)
                }
            }
        
        println("saleData: " + saleData)            // 出发了rdd的计算
    }
}

输出

map1: Executor task launch worker-0,John,iphone Cover,9.99
map1: Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95
map2: (Executor task launch worker-0,John,iphone Cover,9.99)
map2: (Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-1,(Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-0,(John,iphone Cover,9.99)
map1: Executor task launch worker-1,Bob,iPad Cover,5.49
map1: Executor task launch worker-0,John,Headphones,5.49
map2: (Executor task launch worker-1,Bob,iPad Cover,5.49)
map2: (Executor task launch worker-0,John,Headphones,5.49)
map3: Executor task launch worker-1,(Bob,iPad Cover,5.49)
map3: Executor task launch worker-0,(John,Headphones,5.49)
reduceByKey: (Executor task launch worker-1,(Jill,8.95),(Bob,5.49),14.44)
reduceByKey: (Executor task launch worker-0,(John,9.99),(John,5.49),15.48)
map1: Executor task launch worker-0,Jack,iphone Cover,9.99
map2: (Executor task launch worker-0,Jack,iphone Cover,9.99)
map3: Executor task launch worker-0,(Jack,iphone Cover,9.99)
reduceByKey: (Executor task launch worker-0,(John,15.48),(Jack,9.99),25.47)
reduceByKey: (dag-scheduler-event-loop,(Jill,14.44),(John,25.47),39.91)
saleData: (Jill,39.91)

worker-0即task-0
worker-1即task-1
dag-scheduler-event-loop即task-2