spark充分利用所有CPU核Utilizing all CPU cores

Using the parameters to spark-shell or spark-submit, we can ensure that memory
and CPUs are available on the cluster for our application. But that doesn’t guarantee

that all the available memory or CPUs will be used.

我们可以通过配置spark-shell 和 spark-submit的命令行参数的方式来使集群上的所有内存和CPU资源对程序可用,但这并不保证这些资源能全部被用到。

As you’ve seen, Spark processes a stage by processing each partition separately. In
fact, only one executor can work on a single partition, so if the number of partitions is
less than the number of executors, the stage won’t take advantage of the full resources

available.

spark中一个partition只能由一个executor处理,如果partition数少于executor数,我们就不能完全利用所有资源。

What determines the number of partitions? You’ve seen that RDDs are built into a
chain of processing by transformations; the number of partitions for a RDD is based
on the number of partitions in its parent RDD.

那么是什么决定partitions的数量?一个RDD的partition数量主要取决于其父RDD的partition数量。
Eventually we reach an RDD without a parent. These are typically RDDs created
from file or database storage. In the case of reading from HDFS, the number of partitions
will be determined by the size for each HDFS block.

有些情况下,例如我们通过文件或数据库创建RDD时,他们没有父RDD。以从HDFS读取数据创建RDD为例,它的partition数量取决于HDFS block的大小。
As a general rule, you want to ensure that you have at least as many partitions as
cores. In fact, having two or three times as many partitions as cores is usually fine, due
to Spark’s low scheduling latency compared to Hadoop

通常来说由于spark相比于hadoop调度等待时间更短,把partition数量设置为core数量的2~3倍比较合适

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值