spark充分利用所有CPU核Utilizing all CPU cores

最新推荐文章于 2023-01-13 16:02:28 发布

Dillon2015

最新推荐文章于 2023-01-13 16:02:28 发布

阅读量1.7k

点赞数 1

分类专栏： spark graphx 文章标签： spark cpu core

spark 同时被 2 个专栏收录

5 篇文章

订阅专栏

graphx

2 篇文章

订阅专栏

本文探讨了如何在Apache Spark中合理配置资源以确保高效运行。重点介绍了partition的重要性及其数量如何影响Spark作业的并行度，并给出了建议的partition数量设置。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Using the parameters to spark-shell or spark-submit, we can ensure that memory
and CPUs are available on the cluster for our application. But that doesn’t guarantee

that all the available memory or CPUs will be used.

我们可以通过配置spark-shell 和 spark-submit的命令行参数的方式来使集群上的所有内存和CPU资源对程序可用，但这并不保证这些资源能全部被用到。

As you’ve seen, Spark processes a stage by processing each partition separately. In
fact, only one executor can work on a single partition, so if the number of partitions is
less than the number of executors, the stage won’t take advantage of the full resources

available.

spark中一个partition只能由一个executor处理，如果partition数少于executor数，我们就不能完全利用所有资源。

What determines the number of partitions? You’ve seen that RDDs are built into a
chain of processing by transformations; the number of partitions for a RDD is based
on the number of partitions in its parent RDD.

那么是什么决定partitions的数量？一个RDD的partition数量主要取决于其父RDD的partition数量。
Eventually we reach an RDD without a parent. These are typically RDDs created
from file or database storage. In the case of reading from HDFS, the number of partitions
will be determined by the size for each HDFS block.

有些情况下，例如我们通过文件或数据库创建RDD时，他们没有父RDD。以从HDFS读取数据创建RDD为例，它的partition数量取决于HDFS block的大小。
As a general rule, you want to ensure that you have at least as many partitions as
cores. In fact, having two or three times as many partitions as cores is usually fine, due
to Spark’s low scheduling latency compared to Hadoop

通常来说由于spark相比于hadoop调度等待时间更短，把partition数量设置为core数量的2~3倍比较合适