spark
大数据应用之 — spark 3.2.1 集群安装部署
spark-3.2.1-bin-hadoop3.2-scala2.13.tgz
集群规划
节点 | 角色 |
---|---|
lsyk01 | master |
lsyk02 | worker |
lsyk03 | worker |
lsyk04 | worker |
解压安装
tar -zxvf spark-3.2.1-bin-hadoop3.2-scala2.13.tgz -C /opt/
mv spark-3.2.1-bin-hadoop3.2-scala2.13 spark-3.2.1
vi /etc/profile/
#添加以下内容
export SPARK_HOME=/opt/spark-3.2.1
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
cd /opt/spark-3.2.1/conf
##配置日志级别
cp log4j.properties.template log4j.properties
vi log4j.properties
#修改日志级别
log4j.rootCategory= WARN, console
#运行
spark-shell
测试:
var r = sc.parallelize(Array(1,2,3,4))
r.map(_*10).collect()
配置Standalone集群
#配置spark-env.sh
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
#增加以下内容
export JAVA_HOME=/usr/java/jdk1.8.0_333
export SPARK_HOME=/opt/spark-3.2.1
#修改workers
cp workers.template workers
vi workers
lsyk02
lsyk03
lsyk04
#分发到节点
scp -r /opt/spark-3.2.1 lsyk02:/opt
scp -r /opt/spark-3.2.1 lsyk03:/opt
scp -r /opt/spark-3.2.1 lsyk04:/opt
scp /etc/profile lsyk02:/etc
scp /etc/profile lsyk03:/etc
scp /etc/profile lsyk04:/etc
启动集群
lsyk01启动
由于环境变量hadoop在前,start-all.sh被hadoop占用,所以需要加目录:
/opt/spark-3.2.1/sbin/start-all.sh
登录 http://lsyk01:8080/
测试集群
spark-submit --class org.apache.spark.examples.SparkPi --master spark://lsyk01:7077 --num-executors 1 /opt/spark-3.2.1/examples/jars/spark-examples_2.13-3.2.1.jar
关闭集群
/opt/spark-3.2.1/sbin/stop-all.sh
yarn模式
vi /opt/spark-3.2.1/conf/spark-env.sh
#增加以下内容
YARN_CONF_DIR=/opt/hadoop-3.3.3/etc/hadoop
#分发到其他节点
scp spark-env.sh lsyk02:$PWD
scp spark-env.sh lsyk03:$PWD
scp spark-env.sh lsyk04:$PWD
启动hadoop、spark
$HAdoop_HOME/sbin/start-all.sh
$SPARK_HOME/sbin/start-all.sh
测试
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster /opt/spark-3.2.1/examples/jars/spark-examples_2.13-3.2.1.jar
查看yarn http://lsyk01:8088
配置历史服务器
cd /opt/spark-3.2.1/conf
cp spark-defaults.conf.template spark-defaults.conf
vi spark-defaults.conf
#添加或者放开注释并修改
#hdfs://lsyk01:9000/user/sparklog 需要提前创建好
spark.eventLog.enabled true
spark.eventLog.dir hdfs://lsyk01:9000/user/sparklog
spark.yarn.historyServer.address=lsyk01:18080
spark.history.ui.port=18080
#修改 spark-env.sh 文件
vi spark-env.sh
#添加日志配置
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://lsyk01:9000/user/sparklog -Dspark.history.retainedApplications=30"
#同步上述文件至其他节点
scp spark-defaults.conf lsyk02:$PWD
scp spark-defaults.conf lsyk03:$PWD
scp spark-defaults.conf lsyk04:$PWD
scp spark-env.sh lsyk02:$PWD
scp spark-env.sh lsyk03:$PWD
scp spark-env.sh lsyk04:$PWD
启动/关闭历史服务器
$SPARK_HOME/sbin/start-history-server.sh
$SPARK_HOME/sbin/stop-history-server.sh
重新提交任务测试
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster /opt/spark-3.2.1/examples/jars/spark-examples_2.13-3.2.1.jar
从yarn上点history查看历史
挑战到历史查看: