1.目的
搭建spark目的是为了做离线计算
2.基础
spark搭建基础:hadoop集群已经搭建成功
例如:我的hadoop集群在work用户下
/home/work/hadoop-2.9.2 这是hadoop目录
/home/work/jdk1.8.0_171 这是java目录
scala包:scala-2.10.4.tgz
spark包:spark-2.4.0-bin-hadoop2.7.tgz
(wget http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz)
3.安装scala
spark使用scala语言,具体安装流程
将scala-2.10.4.tgz移至/home/work/
将spark-2.4.0-bin-hadoop2.7.tgz移至/home/work/
解压即可
[work@hserver1 ~]# tar -zxvf scala-2.10.4.tgz
[work@hserver1 ~]# vim .bashrc
export SCALA_HOME=/home/work/scala-2.10.4
export PATH=$PATH:$SCALA_HOME/bin
[work@hserver1 ~]# source ~/.bashrc
验证:
[work@hserver1 ~]# scala -version
2.安装spark
解压spark
[work@hserver1 ~]# tar -xxzf spark-2.4.0-bin-hadoop2.7.tgz
[work@hserver1 ~]# cp -rf /home/work/spark-2.4.0-bin-hadoop2.7/conf/spark-env.sh.template /home/work/spark-2.4.0-bin-hadoop2.7/conf/spark-env.sh
[work@hserver1 ~]# vim /home/work/spark-2.4.0-bin-hadoop2.7/conf/spark-env.sh
export SCALA_HOME=/home/work/scala-2.10.4
export JAVA_HOME=/home/work/jdk1.8.0_171
export HADOOP_HOME=/home/work/hadoop-2.9.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=hserver1
SPARK_LOCAL_DIRS=/home/work/spark-2.4.0-bin-hadoop2.7
SPARK_DRIVER_MEMORY=1G
[work@hserver1 ~]# cp /home/work/spark-2.4.0-bin-hadoop2.7/conf/slaves.template /home/work/spark-2.4.0-bin-hadoop2.7/conf/slaves
[work@hserver1 ~]# vim /home/work/spark-2.4.0-bin-hadoop2.7/conf/slaves
hserver2
hserver3
3.将配置好的spark和scala分发到集群各机器
由于搭建hadoop集群时做了免密登陆,所有远程发送文件时,并不需要输入密码
[work@hserver1 ~]# scp -r /home/work/scala-2.10.4 work@hserver2:/home/work
[work@hserver1 ~]# scp -r /home/work/scala-2.10.4 work@hserver3:/home/work
[work@hserver1 ~]# scp -r /home/work/spark-2.4.0-bin-hadoop2.7 work@hserver2:/home/work
[work@hserver1 ~]# scp -r /home/work/spark-2.4.0-bin-hadoop2.7 work@hserver3:/home/work
[work@hserver1 ~]# scp -r ~/.bashrc work@hserver3:/home/work
[work@hserver1 ~]# scp -r ~/.bashrc work@hserver2:/home/work
[work@hserver1 ~]# ssh work@hserver3 "source ~/.bashrc" // 将各个机器的包的scala加入环境变量中
[work@hserver1 ~]# ssh work@hserver2 "source ~/.bashrc"
4.验证
[work@hserver1 ~]# cd /home/work/spark-2.4.0-bin-hadoop2.7
[work@hserver1 spark-2.4.0-bin-hadoop2.7]# bin/spark-submit --master yarn --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.4.0.jar 10
结果:
2019-01-22 14:40:24 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.095592 s
Pi is roughly 3.1414551414551415 // 计算PI的值为3.14 ......