文章目录
Spark安装须知
Spark官方下载地址:Spark下载地址
注意:选择正确的Spark版本,这里Hadoop版本为3.3.3,对应版本软件包为spark-3.2.1-bin-hadoop3.2.tgz。
*Yarn模式安装需提前安装Hadoop集群,安装手顺参考:
Apache-Hadoop3.3.3集群安装
1.Local 模式
Local 模式,就是不需要其他任何节点资源就可以在本地执行 Spark 代码的环境,一般用于教学,调试,演示等。
1.1 解压缩文件
将 spark安装包文件上传到 Linux 并解压缩,放置在指定位置。
tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz -C /opt/soft
cd /opt/soft
mv spark-3.2.1-bin-hadoop3.2/ spark-3.2.1-local
1.2 启动 Local 环境
(1) 进入解压缩后的路径,执行如下指令
[root@node01 spark-3.2.1-local]# bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/06/06 12:53:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://node01:4040
Spark context available as 'sc' (master = local[*], app id = local-1654491218662).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_261)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
- 启动成功后,可以输入网址进行 Web UI 监控页面访问
http://node01:4040
![image.png](https://img-blog.csdnimg.cn/img_convert/8b20f7debdd861c4acfbc00c844ecf3d.png#clientId=uef3f7261-c79d-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=606&id=uba0a9e0c&margin=[object Object]&name=image.png&originHeight=757&originWidth=1920&originalType=binary&ratio=1&rotation=0&showTitle=false&size=48357&status=done&style=none&taskId=u2abf4a41-11e8-4cc5-97f0-426c086d181&title=&width=1536)
1.3 命令行工具
在解压缩文件夹下的 data 目录中,添加 word.txt 文件。在命令行工具中执行如下代码指令。
[root@node01 data]# touch word.txt
[root@node01 data]# echo "hello spark" > word.txt
scala> sc.textFile("data/word.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((hello,1), (spark,1))
1.4 提交应用
[root@node01 spark-3.2.1-local]# bin/spark-submit \
> --class org.apache.spark.examples.SparkPi \
> --master local[2] \
> ./examples/jars/spark-examples_2.12-3.2.1.jar \
> 10
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/06/06 13:04:23 INFO SparkContext: Running Spark version 3.2.1
...
22/06/06 13:04:23 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/06/06 13:04:23 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://node01:4040
22/06/06 13:04:23 INFO SparkContext: Added JAR file:/opt/soft/spark-3.2.1-local/examples/jars/spark-examples_2.12-3.2.1.jar at spark://node01:39994/jars/spark-examples_2.12-3.2.1.jar with timestamp 1654491863010
22/06/06 13:04:24 INFO Executor: Starting executor ID driver on host node01
....
22/06/06 13:04:25 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.637438 s
Pi is roughly 3.142031142031142
....
参数说明:
(1) --class 表示要执行程序的主类;
(2) --master local[2] 部署模式,默认为本地模式,数字表示分配的虚拟CPU 核数量;
(3) spark-examples_2.12-3.0.0.jar 运行的应用类所在的 jar 包;
(4) 数字 10 表示程序的入口参数,用于设定当前应用的任务数量。
2.Yarn 模式
Standalone模式由 Spark 自身提供计算资源,无需其他框架提供资源。这种方式降低了和其他第三方资源框架的耦合性,独立性非常强。Spark主要是计算框架,而不是资源调度框架,所以本身提供的资源调度并不是它的强项。Spark与专业的资源调度框架集成会更靠谱,因为在国内工作中Yarn 使用的非常多,所以主要是基于Yarn搭建Spark运行环境。
2.1 解压缩文件
将 spark-3.0.0-bin-hadoop3.2.tgz 文件上传到 linux 并解压缩,放置在指定位置。
tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz -C /opt/soft
cd /opt/soft
mv spark-3.2.1-bin-hadoop3.2/ spark-3.2.1-yarn/
2.2 修改配置文件
(1) 修改 hadoop 配置文件, 并分发
vim /opt/soft/hadoop-3.3.3/etc/hadoop/yarn-site.xml
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是 true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是 true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
ssh_do_scp.sh ~/bin/node.list /opt/soft/hadoop-3.3.3/etc/hadoop/yarn-site.xml /opt/soft/hadoop-3.3.3/etc/hadoop/
(2) 修改 conf/spark-env.sh,添加 JAVA_HOME 和 YARN_CONF_DIR 配置
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
export JAVA_HOME=/opt/soft/jdk1.8.0_261
YARN_CONF_DIR=/opt/soft/hadoop-3.3.3/etc/hadoop
2.3 提交应用
启动 HDFS 以及 YARN 集群, 提交应用到yarn。
[root@node01 spark-3.2.1-yarn]# bin/spark-submit \
> --class org.apache.spark.examples.SparkPi \
> --master yarn \
> --deploy-mode cluster \
> ./examples/jars/spark-examples_2.12-3.2.1.jar \
> 10
2022-06-06 13:38:37,992 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-06-06 13:38:38,080 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node02/192.168.31.102:8032
...
2022-06-06 13:38:48,322 INFO yarn.Client: Submitting application application_1654493828584_0001 to ResourceManager
2022-06-06 13:38:48,711 INFO impl.YarnClientImpl: Submitted application application_1654493828584_0001
2022-06-06 13:38:49,718 INFO yarn.Client: Application report for application_1654493828584_0001 (state: ACCEPTED)
2022-06-06 13:38:49,722 INFO yarn.Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1654493928432
final status: UNDEFINED
tracking URL: http://node02:8088/proxy/application_1654493828584_0001/
user: root
2022-06-06 13:38:50,726 INFO yarn.Client: Application report for application_1654493828584_0001 (state: ACCEPTED)
2022-06-06 13:38:51,729 INFO yarn.Client: Application report for application_1654493828584_0001 (state: RUNNING)
2022-06-06 13:38:55,745 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: node01
ApplicationMaster RPC port: 37882
queue: default
start time: 1654493928432
final status: UNDEFINED
tracking URL: http://node02:8088/proxy/application_1654493828584_0001/
user: root
report for application_1654493828584_0001 (state: FINISHED)
2022-06-06 13:39:07,807 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: node01
ApplicationMaster RPC port: 37882
queue: default
start time: 1654493928432
final status: SUCCEEDED
tracking URL: http://node02:8088/proxy/application_1654493828584_0001/
user: root
2022-06-06 13:39:07,818 INFO util.ShutdownHookManager: Shutdown hook called
2022-06-06 13:39:07,819 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-3515b475-3558-42e3-bbfc-89c56a10bc6f
2022-06-06 13:39:07,821 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-1f7e03ac-815f-4a16-904e-c6fb64b8d683
查看 http://node02:8088 页面,点击 History,查看历史页面
2.4 配置历史服务器
cd /opt/soft/spark-3.2.1-yarn/conf
(1) 修改 spark-defaults.conf.template 文件名为 spark-defaults.conf
mv spark-defaults.conf.template spark-defaults.conf
配置日志存储路径
vim spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node01:8020/sparkHistory
spark.yarn.historyServer.address=node01:18080
spark.history.ui.port=18080
注意:需要启动hadoop 集群,HDFS上的目录需要提前存在。
[root@node01 conf]# hadoop fs -mkdir /sparkHistory
[root@node01 conf]# hadoop fs -ls /
Found 3 items
drwxr-xr-x - root supergroup 0 2022-06-06 13:47 /sparkHistory
drwx------ - root supergroup 0 2022-06-02 11:46 /tmp
drwxr-xr-x - root supergroup 0 2022-06-06 13:38 /user
(3) 修改 spark-env.sh 文件, 添加日志配置
vim spark-env.sh
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080
-Dspark.history.fs.logDirectory=hdfs://node01:8020/sparkHistory
-Dspark.history.retainedApplications=30"
参数 1 含义:WEB UI 访问的端口号为 18080
参数 2 含义:指定历史服务器日志存储路径
参数 3 含义:指定保存 Application 历史记录的个数,如果超过这个值,旧的应用程序信息将被删除,这个是内存中的应用数,而不是页面上显示的应用数。
(4) 启动历史服务
[root@node01 spark-3.2.1-yarn]# sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/soft/spark-3.2.1-yarn/logs/spark-root-org.apache.spark.deploy.history.HistoryServer-1-node01.out
(5) 重新提交应用
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.2.1.jar \
10
(6) Web 页面查看日志:http://node02:8088
3 总结
(1)各种部署模式对比:
(2)端口号说明:
Spark 查看当前 Spark-shell 运行任务情况端口号:4040(计算)
Spark 历史服务器端口号:18080
Hadoop YARN 任务运行情况查看端口号:8088