大数据工作笔记 – Spark 篇
Spark Terms
- Driver the process running the main() function of the application, and creating spark context(DagSchedular and TaskSchedular). this is a application level process.
- Yarn Application master (application level service) is a lightweight process that coordinates the execution of tasks of an application and asks the resource manager for resource containers for tasks. it monitor tasks, restarts failed ones, etc. it can run any type of tasks, be them MapReduce tasks or spark tasks.
- Master is the cluster manager of spark standalone cluster, it is an external long running service for acquiring resources on cluster(e.g. standalone, mesos, yarn) at cluster level. Master webui
is the web UI sever for the standalone master
INFO Master: Starting Spark master at spark://japila.local:7077
INFO Master: Running Spark version 1.6.0-SNAPSHOT
Spark Deploy Mode
-
Standalone (Cluster) - spark manage everything/cluster by itself (e.g. cluster manager(master), slaves node). This mode requires each application to run an executor on every node in the cluster by default.
- it will acquire all cores by default in the cluster , which only make sense if you just run one application at a time. you can limit the number of cores by setting spark.cores.max in sparkconf.
- alternatively add following to conf/spark-env.sh on the cluster master process to change the default for application that doesn’t set
spark.cores.max
.
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=<value>"
-
Yarn - utilizing yarn resource manager
-
Client - when your application is submitted from machine that is physically co-located with your worker machines. Driver is launched directly within the spark-submit process which act as a client to cluster.
-
Cluster - when your application is submitted remotely (e.g. locally from laptop), it is common to use luster mode to minimize network latency between the driver and executor.
</
-