python写spark的效率问题,为什么我的Spark运行速度比纯Python慢​​?性能比较

Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:

1)

In Spark:

train_df.filter(train_df.gender == '-unknown-').count()

It takes about 30 seconds to get results back. But using Python it takes about 1 second.

2) In Spark:

sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()

Same thing, takes about 30 sec in Spark, 1 sec in Python.

Several possible reasons my Spark is much slower than pure Python:

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

2) My spark is running locally and I should run it in something like Amazon EC instead.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!

解决方案

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.

By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.

1) My dataset is about 220,000 records, 24 MB, and that's not a big

enough dataset to show the scaling advantages of Spark.

You are right, you will not see much difference at lower volumes. Spark can be slower as well.

2) My spark is running locally and I should run it in something like

Amazon EC instead.

For your volume it might not help much.

3) Running locally is okay, but my computing capacity just doesn't cut

it. It's a 8 Gig RAM 2015 Macbook.

Again it does not matter for 20MB data set.

4) Spark is slow because I'm running Python. If I'm using Scala it

would be much better. (Con argument: I heard lots of people are using

PySpark just fine.)

On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值