scala python混编_使用Scala和Python API unionSpark DataFrame 时的不同分区号

如果我发现有什么有趣的东西会继续更新

观察1——物理计划与scala和python有区别

union physical plan pyspark

:- Exchange RoundRobinPartitioning(10), [id=#1318]

: +- *(1) Scan ExistingRDD[value#148]

+- Exchange RoundRobinPartitioning(10), [id=#1320]

+- *(2) Scan ExistingRDD[value#154]

== Physical Plan scala ==

Union

:- Exchange RoundRobinPartitioning(10), [id=#1012]

: +- LocalTableScan [value#122]

+- ReusedExchange [value#131], Exchange RoundRobinPartitioning(10), [id=#1012]

scala Range (1 to 10 by 2) == Physical Plan ==

val df2 = (1 to 10 by 2).toDF.repartition(10)

Union

:- Exchange RoundRobinPartitioning(10), [id=#1644]

: +- LocalTableScan [value#184]

+- Exchange RoundRobinPartitioning(10), [id=#1646]

+- LocalTableScan [value#193]

观察2——spark中的union基本上不会引起 shuffle操作,这是一个非常有效的操作。我相信它是df1和df2的显式 repartition,这导致union的df3的分区数发生变化。如果不显式地对输入 DataFrame 进行分区,则最终会得到一个分区号等于df1和df2之和的 uniondf。我试着对同一数据进行排列,结果是

案例1

from pyspark.sql.types import IntegerType

df1 = spark.createDataFrame(range(100000), IntegerType())

print("df1 partitions: %d" %df1.rdd.getNumPartitions())

print("df1 partitioner: %s" %df1.rdd.partitioner)

df2 = spark.createDataFrame(range(100000), IntegerType())

print("df2 partitions: %d" %df2.rdd.getNumPartitions())

print("df2 partitioner: %s" %df2.rdd.partitioner)

df3 = df1.union(df2)

print("df3 partitions: %d" %df3.rdd.getNumPartitions())

print("df3 partitioner: %s" %df3.rdd.partitioner)

******开/关*******

df1 partitions: 8

df1 partitioner: None

df2 partitions: 8

df2 partitioner: None

df3 partitions: 16

df3 partitioner: None

案例2

val df1 = (1 to 100000).toDF

println(s"df1 partitions: ${df1.rdd.getNumPartitions}")

println(s"df1 partitioner: ${df1.rdd.partitioner}")

val df2 = (1 to 100000).toDF

println(s"df2 partitions: ${df2.rdd.getNumPartitions}")

println(s"df2 partitioner: ${df2.rdd.partitioner}")

df1.union(df2).explain()

val df3 = df1.union(df2)

println(s"df3 partitions: ${df3.rdd.getNumPartitions}")

println(s"df3 partitioner: ${df3.rdd.partitioner}")

******开/关*******

df1 partitions: 8

df1 partitioner: None

df2 partitions: 8

df2 partitioner: None

df3 partitions: 16

df3 partitioner: None

案例3

val df1 = (1 to 100000).toDF

println(s"df1 partitions: ${df1.rdd.getNumPartitions}")

println(s"df1 partitioner: ${df1.rdd.partitioner}")

val df2 = (1 to 100000 by 2).toDF

println(s"df2 partitions: ${df2.rdd.getNumPartitions}")

println(s"df2 partitioner: ${df2.rdd.partitioner}")

val df3 = df1.union(df2)

println(s"df3 partitions: ${df3.rdd.getNumPartitions}")

println(s"df3 partitioner: ${df3.rdd.partitioner}")

****开/关****

df1 partitions: 8

df1 partitioner: None

df2 partitions: 8

df2 partitioner: None

df3 partitions: 16

df3 partitioner: None

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值