1、下载Spark安装包
官网网址:http://spark.apache.org/downloads.html
2、Spark安装过程
2.1、上传并解压缩
[potter@potter2 ~]$ tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz -C apps/
2.2、修改配置文件
(1)进入配置文件所在目录
/home/potter/apps/spark-2.3.0-bin-hadoop2.7/conf
[potter@potter2 conf]$ ll
total 36
-rw-r--r-- 1 potter potter 996 Feb 23 03:42 docker.properties.template
-rw-r--r-- 1 potter potter 1105 Feb 23 03:42 fairscheduler.xml.template
-rw-r--r-- 1 potter potter 2025 Feb 23 03:42 log4j.properties.template
-rw-r--r-- 1 potter potter 7801 Feb 23 03:42 metrics.properties.template
-rw-r--r-- 1 potter potter 865 Feb 23 03:42 slaves.template
-rw-r--r-- 1 potter potter 1292 Feb 23 03:42 spark-defaults.conf.template
-rwxr-xr-x 1 potter potter 4221 Feb 23 03:42 spark-env.sh.template
(2)修改spark-env.sh文件
复制spark-env.sh.template,并重命名为spark-env.sh,并在文件最后添加配置内容
[potter@potter2 conf]$ cp spark-env.sh.template spark-env.sh
[potter@potter2 conf]$ vi spark-env.sh
export JAVA_HOME=/usr/local/java/jdk1.8.0_73
#export SCALA_HOME=/usr/share/scala
export HADOOP_HOME=/home/potter/apps/hadoop-2.7.5
export HADOOP_CONF_DIR=/home/potter/apps/hadoop-2.7.5/etc/hadoop
export SPARK_WORKER_MEMORY=500m
export SPARK_WORKER_CORES=1
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=potter2:2181,potter3:2181,potter4:2181,potter5:2181 -Dspark.deploy.zookeeper.dir=/spark"
注: #export SPARK_MASTER_IP=hadoop1 这个配置要注释掉。 集群搭建时配置的spark参数可能和现在的不一样,主要是考虑个人电脑配置问题,如果memory配置太大,机器运行很慢。 说明: -Dspark.deploy.recoveryMode=ZOOKEEPER #说明整个集群状态是通过zookeeper来维护的,整个集群状态的恢复也是通过zookeeper来维护的。就是说用zookeeper做了spark的HA配置,Master(Active)挂掉的话,Master(standby)要想变成Master(Active)的话,Master(Standby)就要像zookeeper读取整个集群状态信息,然后进行恢复所有Worker和Driver的状态信息,和所有的Application状态信息; -Dspark.deploy.zookeeper.url=potter2:2181,potter3:2181,potter4:2181,potter5:2181#将所有配置了zookeeper,并且在这台机器上有可能做master(Active)的机器都配置进来;(我用了4台,就配置了4台) -Dspark.deploy.zookeeper.dir=/spark 这里的dir和zookeeper配置文件zoo.cfg中的dataDir的区别??? -Dspark.deploy.zookeeper.dir是保存spark的元数据,保存了spark的作业运行状态; zookeeper会保存spark集群的所有的状态信息,包括所有的Workers信息,所有的Applactions信息,所有的Driver信息,如果集群 |
(3)复制slaves.template变成slaves
[potter@potter2 conf]$ cp slaves.template slaves
[potter@potter2 conf]$ vi slaves
添加以下内容:
potter2
potter3
potter4
potter5
(4)将安装包分发给其他节点
[potter@potter2 apps]$ scp -r spark-2.3.0-bin-hadoop2.7/ potter3:$PWD
[potter@potter2 apps]$ scp -r spark-2.3.0-bin-hadoop2.7/ potter4:$PWD
[potter@potter2 apps]$ scp -r spark-2.3.0-bin-hadoop2.7/ potter5:$PWD
2.3、配置环境变量
[potter@potter2 ~]$ vi .bashrc
export SPARK_HOME=/home/potter/apps/spark-2.3.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
保存并使其立即生效
[potter@potter2 ~]$ source .bashrc
2.4、配置spark-defaults.conf
复制一个spark-defaults.conf文件
[potter@potter2 conf]$ cp spark-defaults.conf.template spark-defaults.conf
[potter@potter2 conf]$ vi spark-defaults.conf
# This is useful for setting default environmental settings.
# Example:
spark.master spark://potter2:7077,potter3:7077,potter4:7077,potter5:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
[potter@potter2 conf]$ scp -r spark-defaults.conf potter3:$PWD
[potter@potter2 conf]$ scp -r spark-defaults.conf potter4:$PWD
[potter@potter2 conf]$ scp -r spark-defaults.conf potter5:$PWD
3、启动
3.1、先启动zookeeper集群
所有节点均要执行
[potter@potter2 ~]$ zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/potter/apps/zookeeper-3.4.10/bin/../conf/zoo.cfg
Starting zookeeper ... already running as process 3703.
[potter@potter2 ~]$ zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /home/potter/apps/zookeeper-3.4.10/bin/../conf/zoo.cfg
Mode: follower
3.2、启动HDFS集群
任意一个节点执行即可
[potter@potter2 ~]$ start-dfs.sh
3.3、再启动Spark集群
[potter@potter2 ~]$ cd apps/spark-2.3.0-bin-hadoop2.7/sbin/
[potter@potter2 sbin]$ ./start-all.sh
3.4、查看进程
[potter@potter2 sbin]$ jps
6464 Master
6528 Worker
6561 Jps
6562 Jps
3909 NameNode
6565 Jps
3703 QuorumPeerMain
5047 NodeManager
4412 DFSZKFailoverController
4204 JournalNode
4014 DataNode
[potter@potter3 conf]$ jps
4609 Jps
3441 DataNode
3284 QuorumPeerMain
4581 Worker
3879 NodeManager
3576 JournalNode
3372 NameNode
3676 DFSZKFailoverController