【Spark】【Standalone】【独立模式】

资源存储库

已于 2024-02-17 19:53:04 修改

阅读量797

点赞数 25

文章标签： spark

于 2024-02-17 19:39:00 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/136141286

版权

本文介绍了Spark Standalone模式，包括安全配置、集群安装、手动启动、资源分配、客户端属性和资源调度等内容。在安全方面，强调了互联网或不受信网络上的集群安全设置。Spark Standalone模式可通过启动脚本轻松部署，支持资源分配和配置，便于应用程序连接和资源调度。

摘要由CSDN通过智能技术生成

Spark Standalone Mode

Spark独立模式

Security

安全

Installing Spark Standalone to a Cluster

将Spark Standalone安装到集群

Starting a Cluster Manually

手动启动群集

Cluster Launch Scripts

集群启动脚本

Resource Allocation and Configuration Overview

资源分配和配置概述

Connecting an Application to the Cluster

将应用程序连接到群集

Client Properties

客户端属性

Launching Spark Applications

启动Spark应用程序

Resource Scheduling

资源调度

Executors Scheduling 执行者调度

Stage Level Scheduling Overview

阶段级计划概述

Caveats 警告

Monitoring and Logging

监控和记录

Running Alongside Hadoop

与Hadoop一起运行

Spark Standalone Mode

Spark独立模式

In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
除了在Mesos或YARN集群管理器上运行外，Spark还提供了一个简单的独立部署模式。您可以手动启动一个独立的集群，通过手动启动主集群和工作集群，或者使用我们提供的启动脚本。也可以在一台机器上运行这些守护进程进行测试。

Security

安全

Security features like authentication are not enabled by default. When deploying a cluster that is open to the internet or an untrusted network, it’s important to secure access to the cluster to prevent unauthorized applications from running on the cluster. Please see Spark Security and the specific security sections in this doc before running Spark.
默认情况下，身份验证等安全功能不启用。在部署对Internet或不受信任的网络开放的群集时，确保对群集的访问安全以防止未经授权的应用程序在群集上运行非常重要。在运行Spark之前，请参阅Spark Security和本文档中的特定安全部分。

Installing Spark Standalone to a Cluster

将Spark Standalone安装到集群

To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster. You can obtain pre-built versions of Spark with each release or build it yourself.
要安装Spark Standalone模式，您只需在集群上的每个节点上放置Spark的编译版本。您可以在每个版本中获取Spark的预构建版本，也可以自己构建。

Starting a Cluster Manually

手动启动群集

You can start a standalone master server by executing:
您可以通过执行以下命令来启动独立主服务器：

./sbin/start-master.sh

Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.
一旦启动，master将为自己打印出一个 spark://HOST:PORT URL，您可以使用该URL将worker连接到它，或者将“master”参数传递给 SparkContext 。您也可以在master的Web UI上找到此URL，默认为http：//localhost：8080。

Similarly, you can start one or more workers and connect them to the master via:
类似地，您可以启动一个或多个worker，并通过以下方式将它们连接到master：

./sbin/start-worker.sh <master-spark-URL>

Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
启动worker后，查看master的Web UI（默认为http：//localhost：8080）。您应该会看到新节点列在那里，沿着列出了它的CPU和内存数量（减去留给操作系统的1GB）。

Finally, the following configuration options can be passed to the master and worker:
最后，可以将以下配置选项传递给master和worker：

Argument 论点	Meaning 意义
`-h HOST`, `--host HOST`	Hostname to listen on 要监听的主机名
`-i HOST`, `--ip HOST`	Hostname to listen on (deprecated, use -h or --host) 要侦听的主机名（已弃用，使用-h或--host）
`-p PORT`, `--port PORT`	Port for service to listen on (default: 7077 for master, random for worker) 服务监听的端口（默认值：master为7077，worker为random）
`--webui-port PORT`	Port for web UI (default: 8080 for master, 8081 for worker) Web UI的端口（默认：8080用于主服务器，8081用于工作服务器）
`-c CORES`, `--cores CORES`	Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker 允许Spark应用程序在机器上使用的CPU核心总数（默认值：全部可用）;仅在worker上
`-m MEM`, `--memory MEM`	Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GiB); only on worker 允许Spark应用程序在机器上使用的内存总量，格式为1000 M或2G（默认值：机器的总RAM减去1 GiB）;仅限worker
`-d DIR`, `--work-dir DIR`	Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker 用于暂存空间和作业输出日志的目录（默认值：SPARK_HOME/work）;仅在worker上
`--properties-file FILE`	Path to a custom Spark properties file to load (default: conf/spark-defaults.conf) 要加载的自定义Spark属性文件的路径（默认值：conf/spark-defaults.conf）

Cluster Launch Scripts

集群启动脚本

To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/workers in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. If conf/workers does not exist, the launch scripts defaults to a single machine (localhost), which is useful for testing. Note, the master machine accesses each of the worker machines via ssh. By default, ssh is run in parallel and requires password-less (using a private key) access to be setup. If you do not have a password-less setup, you can set the environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker.
要使用启动脚本启动Spark独立集群，您应该在Spark目录中创建一个名为conf/workers的文件，该文件必须包含您打算启动Spark workers的所有机器的主机名，每行一个。如果conf/workers不存在，启动脚本默认为一台机器（localhost），这对测试很有用。注意，主机器通过ssh访问每个工作机器。默认情况下，ssh是并行运行的，并且需要设置无密码（使用私钥）访问。如果您没有无密码设置，则可以设置环境变量SPARK_SSH_FOREIGN，并依次为每个工作进程提供密码。

Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin:
设置好此文件后，您可以使用以下shell脚本启动或停止集群，这些脚本基于Hadoop的部署脚本，可在 SPARK_HOME/sbin 中使用：

sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
sbin/start-master.sh -在执行脚本的机器上启动主实例。
sbin/start-workers.sh - Starts a worker instance on each machine specified in the conf/workers file.
sbin/start-workers.sh -在 conf/workers 文件中指定的每台计算机上启动一个worker实例。
sbin/start-worker.sh - Starts a worker instance on the machine the script is executed on.
sbin/start-worker.sh -在执行脚本的机器上启动一个worker实例。
sbin/start-connect-server.sh - Starts a Spark Connect server on the machine the script is executed on.
sbin/start-connect-server.sh -在执行脚本的机器上启动Spark Connect服务器。
sbin/start-all.sh - Starts both a master and a number of workers as described above.
sbin/start-all.sh -如上所述启动一个主进程和多个工作进程。
sbin/stop-master.sh - Stops the master that was started via the sbin/start-master.sh script.
sbin/stop-master.sh -停止通过 sbin/start-master.sh 脚本启