[杂谈]Rdd运行-1

在调试的时候,发现有一个有趣的现象。task运行的时候,是分区分批次地执行。
每次执行都会调用我们运用在RDD上的所有的操作。分区之间能够并行执行,同一批次的数据运算当然需要顺序执行(不能并行计算)

比如,我有一个job

rdd.map(...).map(...).map(...).reduce(...)

数据源有5行数据,共2个分区

  1. task-0-分区0:

     第1行数据,map(...).map(...).map(...)
     第2行数据,map(...).map(...).map(...)
     reduce(...)
     继续处理分区0的数据...
  2. task-1-分区1:

     第4行数据,map(...).map(...).map(...)
     第5行数据,map(...).map(...).map(...)
     reduce(...)
     继续处理分区1的数据...
  3. task-2-DAG-scheduler:

     对task-0和task-1的结果,进行reduce

task-0和task-1属于同一个stage, 可以并行运行;

task-2属于另一个stage,但也可以和task-0,task-1并行执行;

task-0和task-1在运行过程中,一边执行一边把数据shuffle到task-2,task-2就可以开始reduce

不知道TaskScheduler有没有实现这种层次的并行。

测试数据: sale.svc

John,iphone Cover,9.99
John,Headphones,5.49
Jack,iphone Cover,9.99
Jill,Samsung Galaxy Cover,8.95
Bob,iPad Cover,5.49

测试代码:
object SaleStat {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf(true).setAppName("SaleStat").setMaster("local[4]").set("spark.executor.heartbeatInterval", 100.toString)
        val sc = new SparkContext(conf)
        val saleData = sc.textFile("PATH_TO_TEST_DATA/sale.svc")
            .map {
                line => {
                    val words = line.split(",")
                    println("map1: " + Thread.currentThread().getName + "," + line)
                    words
                }
            }
            .map {
                words => val (user, product, price) = (words(0), words(1), words(2))
                println("map2: " + (Thread.currentThread().getName,user, product, price))
                (user, product, price)
            }
            .map(
                x => {
                    val price: Double = x._3.toDouble
                    println("map3: " + Thread.currentThread().getName + "," + x)
                    (x._1, price)
                }
            )
            .reduce {
                (x, y) => {
                    val total: Double = x._2 + y._2
                    println("reduceByKey: " + (Thread.currentThread().getName, x,y,total))
                    (x._1,total)
                }
            }
        
        println("saleData: " + saleData)            // 出发了rdd的计算
    }
}
输出
map1: Executor task launch worker-0,John,iphone Cover,9.99
map1: Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95
map2: (Executor task launch worker-0,John,iphone Cover,9.99)
map2: (Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-1,(Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-0,(John,iphone Cover,9.99)
map1: Executor task launch worker-1,Bob,iPad Cover,5.49
map1: Executor task launch worker-0,John,Headphones,5.49
map2: (Executor task launch worker-1,Bob,iPad Cover,5.49)
map2: (Executor task launch worker-0,John,Headphones,5.49)
map3: Executor task launch worker-1,(Bob,iPad Cover,5.49)
map3: Executor task launch worker-0,(John,Headphones,5.49)
reduceByKey: (Executor task launch worker-1,(Jill,8.95),(Bob,5.49),14.44)
reduceByKey: (Executor task launch worker-0,(John,9.99),(John,5.49),15.48)
map1: Executor task launch worker-0,Jack,iphone Cover,9.99
map2: (Executor task launch worker-0,Jack,iphone Cover,9.99)
map3: Executor task launch worker-0,(Jack,iphone Cover,9.99)
reduceByKey: (Executor task launch worker-0,(John,15.48),(Jack,9.99),25.47)
reduceByKey: (dag-scheduler-event-loop,(Jill,14.44),(John,25.47),39.91)
saleData: (Jill,39.91)

worker-0即task-0
worker-1即task-1
dag-scheduler-event-loop即task-2
这个可能是DAG的魅力之一了。RDD的运行是懒执行的,DAG能够整合rdd的运行过程,让很多的操作集中在一个线程里运行连续执行,避免了磁盘和网络io

转载于:https://www.cnblogs.com/ivanny/p/spark_rdd_run_DAGScheduler.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值