CSDN话题挑战赛第2期
参赛话题:大数据学习成长记录
文章目录
Spark集群部署
项目中需要使用Spark集群,通过Spark core, Spark SQL技术完成项目中的分析需求,因此需要完成Spark集群的部署,在原有的高可用Hadoop集群的基础上完成Spark集群部署。
下载Spark,Scala安装包
Spark安装包地址:https://archive.apache.org/dist/spark/spark-2.4.8/
选择spark-2.4.8-bin-hadoop2.7.tgz下载安装
Scala安装包:https://scala-lang.org/download/2.11.8.html
选择scala-2.11.8.tgz下载安装
集群规划
集群规划
序号 | IP | 主机别名 | 角色 | 集群 |
---|---|---|---|---|
1 | 192.168.137.110 | node1 | NameNode(Active),DFSZKFailoverController(ZKFC),ResourceManager,mysql,RunJar(Hive服务端-metastore),RunJar(Hive服务端-hiveserver2),Master | Hadoop,Spark |
2 | 192.168.137.111 | node2 | DataNode,JournalNode,QuorumPeerMain,NodeManager,RunJar(Hive客户端,启动时有),Worker | Zookeeper,Hadoop,Spark |
3 | 192.168.137.112 | node3 | DataNode,JournalNode,QuorumPeerMain,NodeManager,RunJar(Hive客户端,启动时有),Worker | Zookeeper,Hadoop,Spark |
4 | 192.168.137.113 | node4 | DataNode,JournalNode,QuorumPeerMain,NodeManager,RunJar(Hive客户端,启动时有),Worker | Zookeeper,Hadoop,Spark |
5 | 192.168.137.114 | node5 | NameNode(Standby),DFSZKFailoverController(ZKFC),ResourceManager,JobHistoryServer,RunJar(Hive客户端,启动时有),Worker | Hadoop,Spark |
部署Scala
Spark基于Scala开发,许多应用需要使用到scala代码,因此需要部署Scala到集群每个节点
安装包上传到node1,并解压
tar -zxvf scala-2.11.8.tgz -C /opt/soft_installed/
部署Scala
# 配置环境变量
vim /etc/profile
# 配置Scala
SCALA_HOME=/opt/soft_installed/scala-2.11.8
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/b
in:$HIVE_HOME/bin:$SCALA_HOME/bin
export PATH JAVA_HOME JRE_HOME CLASSPATH HADOOP_HOME HADOOP_LOG_DIR YARN_LOG_DIR HADOOP_CONF
_DIR HADOOP_HDFS_HOME HADOOP_YARN_HOME ZOOKEEPER_HOME HIVE_HOME SCALA_HOME
source /etc/profile
验证Scala部署
[root@master scala-2.11.8]# scala
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171).
Type in expressions for evaluation. Or try :help.
scala> 2022 + 1991
res0: Int = 4013
分发Scala到其他节点
对集群的其他节点进行相同操作(node2~node5)
scp -r /opt/soft_installed/scala-2.11.8 node2:/opt/soft_installed
安装包上传解压
通过mobaXterm上传spark安装包到node1节点
# 解压
tar -zxvf spark-2.4.8-bin-hadoop2.7.tgz -C /opt/soft_installed/
# 删除安装包
rm spark-2.4.8-bin-hadoop2.7.tgz -rf
修改Spark配置
配置spark-env.sh
cd /opt/soft_installed/spark-2.4.8-bin-hadoop2.7/conf
cp spark-env.sh.template spark-env.sh
# 在spark-env.sh尾部添加如下内容
vim spark-env.sh
# 配置JDK,HADOOP
export JAVA_HOME=/opt/soft_installed/jdk1.8.0_171
export HADOOP_HOME=/opt/soft_installed/hadoop-2.7.3
export HADOOP_CONF_DIR=/opt/soft_installed/hadoop-2.7.3/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
export SCALA_HOME=/opt/soft_installed/scala-2.11.8
export YARN_CONF_DIR=/opt/soft_installed/hadoop-2.7.3/etc/hadoop
# SPARK
export SPARK_MASTER_IP=node1
export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077
export SPARK_LOG_DIR=/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/logs
export SPARK_HOME=/opt/soft_installed/spark-2.4.8-bin-hadoop2.7
export SPARK_PID_DIR=${SPARK_HOME}/pids
export SPARK_MASTER_WEBUI_PORT=8099
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=19022 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://lh1/spark/history"
# SPARK WORK
export SPARK_WORKER_MEMORY=512m
export SPARK_WORKER_CORES=1
配置spark-defaults.conf
cd /opt/soft_installed/spark-2.4.8-bin-hadoop2.7/conf
cp spark-defaults.conf.template spark-defaults.conf
# 在其尾部添加
vim spark-defaults.conf
spark.master spark://node1:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://lh1/spark/history
spark.eventLog.compress true
spark.yarn.historyServer.address node5:19022
配置slaves
cd /opt/soft_installed/spark-2.4.8-bin-hadoop2.7/conf
cp slaves.template slaves
# 在slaves尾部添加如下内容
vim slaves
# A Spark Worker will be started on each of the machines listed below.
node2
node3
node4
node5
配置yarn-site.xml文件
cd $HADOOP_HOME/etc/hadoop
vim yarn-site.xml
<!-- 虚拟内存的比率,是物理内存的倍数 -->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
</property>
<!-- 由于测试环境的虚拟机内存太少, 防止将来任务被意外杀死, 做如下配置 -->
<!-- 是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其
杀掉,默认是true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀
掉,默认是true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<!-- 日志服务配置 -->
<property>
<name>yarn.log.server.url</name>
<value>http://node5:19888/jobhistory/logs</value>
</property>
分发安装包
将node1上配置好的spark软件包分发到其他节点
cd /opt/soft_installed/
scp -r /opt/soft_installed/spark-2.4.8-bin-hadoop2.7 node2:`pwd`
scp -r /opt/soft_installed/spark-2.4.8-bin-hadoop2.7 node3:`pwd`
scp -r /opt/soft_installed/spark-2.4.8-bin-hadoop2.7 node4:`pwd`
scp -r /opt/soft_installed/spark-2.4.8-bin-hadoop2.7 node5:`pwd`
cd $HADOOP_HOME/etc/hadoop
scp yarn-site.xml node2:`pwd`
scp yarn-site.xml node3:`pwd`
scp yarn-site.xml node4:`pwd`
scp yarn-site.xml node5:`pwd`
启动集群
启动spark集群需要先启动
- hadoop集群
- hive(如果需要用到hive)
/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/sbin/start-all.sh
查看node1节点和node5节点启动后的jps
集群运行验证
web验证集群运行状态
登录http://node1:8099/
提交jar任务
# 提交到yarn结果客户端看不到,yarn application可以看到
#
/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 2 \
--queue default \
/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.8.jar 1000
# client运行模式,可以直接看到运行结果
/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 2 \
--queue default \
/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.8.jar 1000
Pi is roughly 3.1450157250786255
spark-shell 测试
[root@yarnserver hadooptest]# /opt/soft_installed/spark-2.4.8-bin-hadoop2.7/bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/10 00:39:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://yarnserver.phlh123.cn:4040
Spark context available as 'sc' (master = yarn, app id = application_1665297503457_0018).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.8
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.textFile("hdfs://lh1/wordcount/input/class19_3.txt").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect()
res0: Array[(String, Int)] = Array((大数据专业,3), ("",1), (贵州师范大学,3), (19级,3), (20220908,1))
scala>
配置一键启停脚本
[root@master scripts]# cat onekeyspark.sh
#! /bin/bash
# spark集群
case $1 in
"start")
echo "========== now start spark cluster =========="
/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/sbin/start-all.sh
ssh node5 "source /etc/profile; /opt/soft_installed/spark-2.4.8-bin-hadoop2.7/sbin/start-history-server.sh"
ssh node5 "source /etc/profile ; /opt/soft_installed/spark-2.4.8-bin-hadoop2.7/sbin/start-master.sh";;
"stop")
echo "========== now stop spark cluster =========="
/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/sbin/stop-all.sh
ssh node5 "source /etc/profile;/opt/soft_installed/spark-2.4.8-bin-hadoop2.7/sbin/stop-history-server.sh"
ssh node5 "source /etc/profile ; /opt/soft_installed/spark-2.4.8-bin-hadoop2.7/sbin/stop-master.sh";;
*)
echo Invalid Args!
echo 'Usage: '$(basename $0)' start|stop';;
esac