pyspark节点数设置_获取PySpark中可见节点的数量

I'm running some operations in PySpark, and recently increased the number of nodes in my configuration (which is on Amazon EMR). However, even though I tripled the number of nodes (from 4 to 12), performance seems not to have changed. As such, I'd like to see if the new nodes are visible to Spark.

I'm calling the following function:

sc.defaultParallelism

>>>> 2

But I think this is telling me the total number of tasks distributed to each node, not the total number of codes that Spark can see.

How do I go about seeing the amount of nodes that PySpark is using in my cluster?

解决方案

sc.defaultParallelism is just a hint. Depending on the configuration it may not have a relation to the number of nodes. This is the number of partitions if you use an operation that takes a partition count argument but you don't provide it. For example sc.parallelize will make a new RDD from a list. You can tell it how many partitions to create in the RDD with the second argument. But the default value for this argument is sc.defaultParallelism.

You can get the number of executors with sc.getExecutorMemoryStatus in the Scala API, but this is not exposed in the Python API.

In general the recommendation is to have around 4 times as many partitions in an RDD as you have executors. This is a good tip, because if there is variance in how much time the tasks take this will even it out. Some executors will process 5 faster tasks while others process 3 slower tasks, for example.

You don't need to be very accurate with this. If you have a rough idea, you can go with an estimate. Like if you know you have less than 200 CPUs, you can say 500 partitions will be fine.

So try to create RDDs with this number of partitions:

rdd = sc.parallelize(data, 500) # If distributing local data.

rdd = sc.textFile('file.csv', 500) # If loading data from a file.

Or repartition the RDD before the computation if you don't control the creation of the RDD:

rdd = rdd.repartition(500)

You can check the number of partitions in an RDD with rdd.getNumPartitions().

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值