//创建spark配置,设置应用程序名字
val conf = new SparkConf().setAppName("mytest").setMaster("local[2]")
//创建spark执行的入口
val sc = new SparkContext(conf)
val rdd = sc.parallelize(1 to 4,3)
val rdd1 = rdd.map(x => {
println(x + "--rdd1----------" + Thread.currentThread())
x+10
})
val rdd2 = rdd1.map(x =>{
println(x + "--rdd2----------" + Thread.currentThread())
})
rdd2.collect()
分片数:local[2] 开始设置两个,之后有都指定3个,sc.parallelize(1 to 4,3)
运行之后控制台的效果:
[Stage 0:> (0 + 0) / 3] //3代表3个分片,0+0应该是两个executor上未完成的任务数,由于都是窄依赖,所有分成一个stage
1--rdd1----------Thread[Executor task launch worker for task 0,5,main]
11--rdd2----------Thread[Executor task launch worker for task 0,5,main]
12--rdd2----------Thread[Executor task launch worker for task 1,5,main]
2--rdd1----------Thread[Executor task launch worker for task 1,5,main]
3--rdd1----------Thread[Executor task launch worker for task 2,5,main]
13--rdd2----------Thread[Executor task launch worker for task 2,5,main]
4--rdd1----------Thread[Executor task launch worker for task 2,5,main]
14--rdd2----------Thread[Executor task launch worker for task 2,5,main]
三个分片对应的任务线程分别是task0,task1,task2, task0执行的分片只有一个元素1,task1任务线程执行的分片也有一个元素,task2任务执行的分片有两个元素3和4
def main(args: Array[String]) {
//创建spark配置,设置应用程序名字
val conf = new SparkConf().setAppName("mytest").setMaster("local[2]")
//创建spark执行的入口
val sc = new SparkContext(conf)
val rdd = sc.parallelize('A' to 'D',3)
val rdd1 = rdd.map(x => {
println(getCurrentTime + "--" + x + "--rdd1---" + "" + Thread.currentThread())
Thread.sleep(5000)
x
})
val rdd2 = rdd1.map(x =>{
println(getCurrentTime + "--" + x + "--rdd2---" + Thread.currentThread())
})
rdd2.collect()
// rdd1.foreach(println)
}
def getCurrentTime = {
new Timestamp(System.currentTimeMillis())
}
让第一个map里睡5秒钟:
2020-05-12 09:30:45.686--b--rdd1---Thread[Executor task launch worker for task 1,5,main]
2020-05-12 09:30:45.686--a--rdd1---Thread[Executor task launch worker for task 0,5,main]
2020-05-12 09:30:50.689--a--rdd2---Thread[Executor task launch worker for task 0,5,main]
2020-05-12 09:30:50.689--b--rdd2---Thread[Executor task launch worker for task 1,5,main]
2020-05-12 09:30:50.723--c--rdd1---Thread[Executor task launch worker for task 2,5,main]
2020-05-12 09:30:55.723--c--rdd2---Thread[Executor task launch worker for task 2,5,main]
2020-05-12 09:30:55.723--d--rdd1---Thread[Executor task launch worker for task 2,5,main]
2020-05-12 09:31:00.724--d--rdd2---Thread[Executor task launch worker for task 2,5,main]
由于是在idea上执行,所以三个分片,默认启动三个executor,每个executor执行一个任务,由于task2任务中有两个元素,所以执行10s,其它两个元素执行5s
spark2-shell --master yarn-client --num-executors 1 --executor-cores 1
spark会启动3个executors,每个executor里一个core执行一个任务
spark2-shell --master yarn-client --num-executors 1 --executor-cores 3
启动1个executor,三个任务执行在三个core上
spark2-shell --master yarn-client --num-executors 1 --executor-cores 2
启动两个executor,貌似总要保证executors * executor-cores > 分片数,执行详细信息如下
executor1:
2020-05-12 01:15:51.696--A--rdd1---Thread[Executor task launch worker for task 0,5,main] 2020-05-12 01:15:51.696--B--rdd1---Thread[Executor task launch worker for task 1,5,main] 2020-05-12 01:15:56.699--A--rdd2---Thread[Executor task launch worker for task 0,5,main] 2020-05-12 01:15:56.699--B--rdd2---Thread[Executor task launch worker for task 1,5,main] 两个core,分别执行两个任务,并行执行
executor2:
2020-05-12 01:14:24.82--C--rdd1---Thread[Executor task launch worker for task 2,5,main] 2020-05-12 01:14:29.824--C--rdd2---Thread[Executor task launch worker for task 2,5,main] 2020-05-12 01:14:29.825--D--rdd1---Thread[Executor task launch worker for task 2,5,main] 2020-05-12 01:14:34.825--D--rdd2---Thread[Executor task launch worker for task 2,5,main] 选择其中的一个core,执行第三个任务,由于任务三里有两个元素,每个元素sleep 5秒,所以执行完毕需要10s
增加一个union操作
val rdd3 = sc.parallelize(List('a','b','c','d'),3) val rdd4 = rdd3.map(x =>{ println(getCurrentTime + "--" + x + "--rdd4---" + Thread.currentThread()) x }) val rdd5 = rdd2.union(rdd4) rdd5.collect();
发现分片数变成了6 = 3+3
2020-05-12 06:40:12.063--A--rdd1---Thread[Executor task launch worker for task 0,5,main] 2020-05-12 06:40:12.063--B--rdd1---Thread[Executor task launch worker for task 1,5,main] 2020-05-12 06:40:12.066--A--rdd2---Thread[Executor task launch worker for task 0,5,main] 2020-05-12 06:40:12.067--B--rdd2---Thread[Executor task launch worker for task 1,5,main] 2020-05-12 06:40:12.151--C--rdd1---Thread[Executor task launch worker for task 2,5,main] 2020-05-12 06:40:12.151--C--rdd2---Thread[Executor task launch worker for task 2,5,main] 2020-05-12 06:40:12.152--D--rdd1---Thread[Executor task launch worker for task 2,5,main] 2020-05-12 06:40:12.152--D--rdd2---Thread[Executor task launch worker for task 2,5,main] 2020-05-12 06:40:12.171--a--rdd4---Thread[Executor task launch worker for task 3,5,main] 2020-05-12 06:40:12.191--b--rdd4---Thread[Executor task launch worker for task 4,5,main] 2020-05-12 06:40:12.207--c--rdd4---Thread[Executor task launch worker for task 5,5,main] 2020-05-12 06:40:12.208--d--rdd4---Thread[Executor task launch worker for task 5,5,main]
增加了task 3,4,5