Apache Spark Performance Tuning – Degree of Parallelism - spark 性能调优之partition

一般说来,spark并行任务量取决于分区数目。

Spark Partition Principles

The general principles to be followed when tuning partition for Spark application are as follows:

  • Too few partitions – Cannot utilize all cores available in the cluster.
  • Too many partitions – Excessive overhead in managing many small tasks.
  • Reasonable partitions – Helps us to utilize the cores available in the cluster and avoids excessive overhead in managing small tasks

cited by https://dzone.com/articles/apache-spark-performance-tuning-degree-of-parallel

Understanding Spark Data Partitionsi

在Spark配置中有两个参数是用来设spartition的:spark.default.parallelism , spark.sql.shuffle.partitions

On considering the event timeline to understand those 200 shuffled partition tasks, there are tasks with more scheduler delay and less computation time. It indicates that 200 tasks are not necessary here and can be tuned to decrease the shuffle partition to reduce scheduler burden.FireServiceCallAnalysisDataFrameTest1StagesStats

On considering the event timeline to understand those 200 shuffled partition tasks, there are tasks with more scheduler delay and less computation time. It indicates that 200 tasks are not necessary here and can be tuned to decrease the shuffle partition to reduce scheduler burdeHighNumberOfTasksProblem

The Stages view in Spark UI indicates that most of the tasks are simply launched and terminated without any computation, as shown in the below diagram:

NumberOfTasksProblem

Spark Partition Tuning

Let us first decide the number of partitions based on the input dataset size. The rule of thumb to decide the partition size while working with HDFS is 128 MB.

As our input dataset size is about 1.5 GB (1500 MB) and going with 128 MB per partition, the number of partitions will be:

Total input dataset size / partition size => 1500 / 128 = 11.71 = ~12 partitions.

This is equal to the Spark default parallelism (spark.default.parallelism) value. The metrics based on default parallelism are shown in the above section.

Now, let us perform a test by reducing the partition size and increasing the number of partitions.

Consider partition size as 64 MB.

Number of partitions = Total input dataset size / partition size => 1500 / 64 = 23.43 = ~23 partitions.

 

 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值