Get partitions and cores
Use an rdd
method to get the number of DataFrame partitions
df = spark.read.parquet(eventsPath)
df.rdd.getNumPartitions()
Access SparkContext through SparkSession to get the number of cores or slots
SparkContext is also provided in Databricks notebooks as the variable sc
print(spark.sparkContext.defaultParallelism)
# print(sc.defaultParallelism)
# return 8
Repartition DataFrame
repartition
Returns a new DataFrame that has exactly n
partitions.
repartitionedDF = df.repartition(8)
repartitionedDF.rdd.getNumPartitions()
coalesce
Returns a new DataFrame that has exactly n
partitions, when the fewer partitions are requested
If a larger number of partitions is requested, it will stay at the current number of partitions
coalesceDF = df.coalesce(8)
coalesceDF.rdd.getNumPartitions()
Configure default shuffle partitions
Use SparkConf
to access the spark configuration parameter for default shuffle partitions
spark.conf.get("spark.sql.shuffle.partitions")
Configure default shuffle partitions to match the number of cores
spark.conf.set("spark.sql.shuffle.partitions", "8")
Adaptive Query Execution
Spark SQL can use spark.sql.adaptive.enabled
to control whether AQE is turned on/off (disabled by default)
spark.conf.get("spark.sql.adaptive.enabled")