学习目的
首次学习spark时,对分区没有直观的了解,在使用sortBy方式时也不能得预期的结果,通过实践了解spark分区和sortBy的原理
SparkContext 的配置
val conf = new SparkConf().setAppName(getAppName).setMaster("local[4]")
val sc = new SparkContext(conf)
master设置为:local[4],利用4个线程(Executor)来测试,模拟分布式环境
测试分区
按分区打印
val rdd = sc.parallelize(1 to 100)
rdd.mapPartitionsWithIndex((idx, iter)=>{
println("partitionIndex" + idx + " " + iter.mkString(","))
iter
}).collect()
输出结果为
partitionIndex1 26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50
partitionIndex3 76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
partitionIndex0 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25
partitionIndex2 51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
可以看到打印结果中有4个分区,每个分区的数据是有序的,与预期结果一致
直接打印全部数据
val rdd = sc.parallelize(1 to 100)
rdd.foreach(i => print(i + ","))
输出结果
26,76,51,27,1,2,3,4,5,28,52,77,53,29,6,7,8,30,54,78,55,31,9,10,11,32,56,57,79,58,59,33,12,13,14,15,16,17,34,60,80,61,62,63,64,35,18,19,20,21,36,65,81,66,37,22,23,24,38,67,82,68,83,39,40,41,42,43,44,45,46,47,48,25,49,84,69,85,50,86,70,87,71,88,72,89,73,90,74,91,75,92,93,94,95,96,97,98,99,100,
从输出结果看,在每个分区里的数据是有序的,但是整体输出时是无序的,目前我所知道的原因为rdd的foreach会在每个Executor执行,而不是Driver,每个Executor的执行是并发执行,所以看到的结果为无序
加上collect后输出
val rdd = sc.parallelize(1 to 100)
rdd.collect().foreach(i => print(i + ","))
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,
执行collect后,在Driver端单线程执行,可以有序的数据
测试sortBy
打乱分区的数据并打印
val rdd = sc.parallelize(1 to 100)
val radmomRdd = rdd.map(i => i + Random.nextInt(100))//添加随机数
radmomRdd.mapPartitionsWithIndex((idx, iter)=>{
println("partitionIndex" + idx + " " + iter.mkString(","))
iter
}).collect()
输出结果
partitionIndex3 123,134,152,126,171,105,99,131,172,183,125,148,178,141,174,94,147,103,101,162,153,192,102,101,167
partitionIndex2 127,129,115,77,140,150,94,124,79,124,116,143,70,86,131,74,142,77,71,153,153,155,124,84,146
partitionIndex1 46,119,69,40,95,84,128,71,51,68,76,131,67,50,103,93,121,46,127,115,109,93,124,75,136
partitionIndex0 37,86,63,98,36,30,90,79,69,28,91,95,16,53,27,56,66,41,29,23,76,78,114,84,32
将每个rdd的数据加上一个随机数,使得每个分区的数据无序
用sortBy将数据排序
val rdd = sc.parallelize(1 to 100)
val radmomRdd = rdd.map(i => i + Random.nextInt(100))//增加随机数
radmomRdd.sortBy(i => i, true).mapPartitionsWithIndex((idx, iter)=>{
println("partitionIndex" + idx + " " + iter.mkString(","))
iter
}).collect()
输出结果
partitionIndex2 98,98,98,101,102,105,106,108,108,108,109,110,110,111,118,119,121,122,124,124,126,126
partitionIndex3 128,129,135,138,139,141,141,143,149,151,153,154,158,158,161,161,161,162,164,168,172,173,175,177,179
partitionIndex1 66,66,69,70,71,72,73,75,75,75,77,78,78,79,80,81,84,84,84,86,87,87,88,90,90,92,93,93,95,95,96,97
partitionIndex0 27,29,29,30,32,33,34,39,42,43,44,44,45,46,47,48,56,59,62,62,64
可以看到使用sortBy后每个分区的数据已经变成有序排列了
直接打印全部数据
val rdd = sc.parallelize(1 to 100)
val radmomRdd = rdd.map(i => i + Random.nextInt(100))//增加随机数
radmomRdd.sortBy(i => i, true).foreach(i => print(i + ","))
输出结果
93,127,97,95,98,103,106,97,69,58,152,123,148,119,53,72,103,56,57,86,32,92,82,41,10,70,161,181,132,68,150,70,100,110,102,182,120,152,114,72,104,65,40,48,56,60,84,102,71,183,123,65,68,129,193,85,63,75,55,82,116,117,106,99,145,135,56,142,110,79,69,20,72,87,110,34,16,59,70,76,20,70,87,25,39,120,149,187,108,158,73,142,167,195,140,180,84,89,132,78,
可以看到整体输出的结果是无序的,原因前面说过
加上collect后输出
val rdd = sc.parallelize(1 to 100)
val radmomRdd = rdd.map(i => i + Random.nextInt(100))//增加随机数
radmomRdd.sortBy(i => i, true).collect().foreach(i => print(i + ","))
输出结果
2,8,13,13,25,29,32,33,34,37,39,43,46,51,52,53,54,59,59,60,60,62,63,64,64,68,70,70,73,74,77,79,80,84,84,86,87,87,89,90,91,92,92,94,95,96,97,97,98,99,100,100,104,105,105,105,105,108,109,110,111,112,113,113,115,116,116,117,118,120,121,122,125,129,130,132,133,134,135,138,138,144,147,148,149,152,154,154,155,159,161,164,170,171,177,183,184,185,186,192,
得到了有序列表
注意点
- 学习和测试时collect很重要,否则得到的数据可能跟预期的不一样
- 在小数据集上验证运行原理要容易些