09-Partitioning

Get partitions and cores

Use an rdd method to get the number of DataFrame partitions

df = spark.read.parquet(eventsPath)
df.rdd.getNumPartitions()

在这里插入图片描述

Access SparkContext through SparkSession to get the number of cores or slots

SparkContext is also provided in Databricks notebooks as the variable sc

print(spark.sparkContext.defaultParallelism)
# print(sc.defaultParallelism)
# return 8 

Repartition DataFrame

repartition

Returns a new DataFrame that has exactly n partitions.

repartitionedDF = df.repartition(8)
repartitionedDF.rdd.getNumPartitions()

在这里插入图片描述

coalesce

Returns a new DataFrame that has exactly n partitions, when the fewer partitions are requested

If a larger number of partitions is requested, it will stay at the current number of partitions

coalesceDF = df.coalesce(8)
coalesceDF.rdd.getNumPartitions()

在这里插入图片描述

Configure default shuffle partitions

Use SparkConf to access the spark configuration parameter for default shuffle partitions

spark.conf.get("spark.sql.shuffle.partitions")

在这里插入图片描述

Configure default shuffle partitions to match the number of cores

spark.conf.set("spark.sql.shuffle.partitions", "8")

Adaptive Query Execution

https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html

Spark SQL can use spark.sql.adaptive.enabled to control whether AQE is turned on/off (disabled by default)

spark.conf.get("spark.sql.adaptive.enabled")
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值