Spark Standalone Mode:Differences between client and cluster deploy modes

最新推荐文章于 2022-07-28 15:31:45 发布

专注_每天进步一点点

最新推荐文章于 2022-07-28 15:31:45 发布

阅读量311

点赞数

分类专栏： # Spark 文章标签： Spark

本文链接：https://blog.csdn.net/qq_32649581/article/details/102701627

版权

Spark 专栏收录该内容

40 篇文章 0 订阅

订阅专栏

1. 一个stackoverflow上的小伙伴首先提出一个问题

We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:

A master machine, which also is where our application is run using spark-submit
2 identical worker machines

From the Spark Documentation, I read:

(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.

Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster".

So I am not able to test both modes to see the practical differences. That being said, my questions are:

1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

2) How to I choose which one my application is going to be running on, using spark-submit?

2.接下来是一个比较好的回答

What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

Let's try to look at the differences between client and cluster mode.

Client:

Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.

Cluster:

Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.

How to I choose which one my application is going to be running on, using spark-submit

The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:

/bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

https://stackoverflow.com/questions/37027732/apache-spark-differences-between-client-and-cluster-deploy-modes

专注_每天进步一点点

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Standalone Mode:Differences between client and cluster deploy modes

1. 一个stackoverflow上的小伙伴首先提出一个问题We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:A master machine, which also is where our application is run using spark-submit...
复制链接

扫一扫