与较小的数据集上的pyspark相比,Python肯定会表现得更好.处理更大的数据集时,您会看到不同之处.
默认情况下,在SQL上下文或Hive上下文中运行spark时,默认情况下将使用200个分区.您需要使用sqlContext.sql(“set spark.sql.shuffle.partitions = 10”);将其更改为10或任何值.它肯定比默认更快.
1) My dataset is about 220,000 records, 24 MB, and that’s not a big
enough dataset to show the scaling advantages of Spark.
你是对的,你不会在较低的数量上看到太大的差异. Spark也可以更慢.
2) My spark is running locally and I should run it in something like
Amazon EC instead.
对于你的音量,它可能没有多大帮助.
3) Running locally is okay, but my computing capacity just doesn’t cut
it. It’s a 8 Gig RAM 2015 Macbook.
同样,对于20MB的数据集也没关系.
4) Spark is slow because I’m running Python. If I’m using Scala it
would be much better. (Con argument: I heard lots of people are using
PySpark just fine.)
单独站立就会有所不同. Python比scala具有更多的运行时开销,但在具有分布式功能的较大集群上,它无关紧要