如果我发现有什么有趣的东西会继续更新
观察1——物理计划与scala和python有区别
union physical plan pyspark
:- Exchange RoundRobinPartitioning(10), [id=#1318]
: +- *(1) Scan ExistingRDD[value#148]
+- Exchange RoundRobinPartitioning(10), [id=#1320]
+- *(2) Scan ExistingRDD[value#154]
== Physical Plan scala ==
Union
:- Exchange RoundRobinPartitioning(10), [id=#1012]
: +- LocalTableScan [value#122]
+- ReusedExchange [value#131], Exchange RoundRobinPartitioning(10), [id=#1012]
scala Range (1 to 10 by 2) == Physical Plan ==
val df2 = (1 to 10 by 2).toDF.repartition(10)
Union
:- Exchange RoundRobinPartitioning(10), [id=#1644]
: +- LocalTableScan [value#184]
+- Exchange RoundRobinPartitioning(10), [id=#1646]
+- LocalTableScan [value#193]
观察2——spark中的union基本上不会引起 shuffle操作,这是一个非常有效的操作。我相信它是df1和df2的显式 repartition,这导致union的df3的分区数发生变化。如果不显式地对输入 DataFrame 进行分区,则最终会得到一个分区号等于df1和df2之和的 uniondf。我试着对同一数据进行排列,结果是
案例1
from pyspark.sql.types import IntegerType
df1 = spark.createDataFrame(range(100000), IntegerType())
print("df1 partitions: %d" %df1.rdd.getNumPartitions())
print("df1 partitioner: %s" %df1.rdd.partitioner)
df2 = spark.createDataFrame(range(100000), IntegerType())
print("df2 partitions: %d" %df2.rdd.getNumPartitions())
print("df2 partitioner: %s" %df2.rdd.partitioner)
df3 = df1.union(df2)
print("df3 partitions: %d" %df3.rdd.getNumPartitions())
print("df3 partitioner: %s" %df3.rdd.partitioner)
******开/关*******
df1 partitions: 8
df1 partitioner: None
df2 partitions: 8
df2 partitioner: None
df3 partitions: 16
df3 partitioner: None
案例2
val df1 = (1 to 100000).toDF
println(s"df1 partitions: ${df1.rdd.getNumPartitions}")
println(s"df1 partitioner: ${df1.rdd.partitioner}")
val df2 = (1 to 100000).toDF
println(s"df2 partitions: ${df2.rdd.getNumPartitions}")
println(s"df2 partitioner: ${df2.rdd.partitioner}")
df1.union(df2).explain()
val df3 = df1.union(df2)
println(s"df3 partitions: ${df3.rdd.getNumPartitions}")
println(s"df3 partitioner: ${df3.rdd.partitioner}")
******开/关*******
df1 partitions: 8
df1 partitioner: None
df2 partitions: 8
df2 partitioner: None
df3 partitions: 16
df3 partitioner: None
案例3
val df1 = (1 to 100000).toDF
println(s"df1 partitions: ${df1.rdd.getNumPartitions}")
println(s"df1 partitioner: ${df1.rdd.partitioner}")
val df2 = (1 to 100000 by 2).toDF
println(s"df2 partitions: ${df2.rdd.getNumPartitions}")
println(s"df2 partitioner: ${df2.rdd.partitioner}")
val df3 = df1.union(df2)
println(s"df3 partitions: ${df3.rdd.getNumPartitions}")
println(s"df3 partitioner: ${df3.rdd.partitioner}")
****开/关****
df1 partitions: 8
df1 partitioner: None
df2 partitions: 8
df2 partitioner: None
df3 partitions: 16
df3 partitioner: None