环境
CentOS 7.0
hadoop 2.7.3 CentOS 7.0+hadoop 2.7搭建集群
scala 2.12.4
spark 2.2.0
下载并安装
scala
下载地址:http://www.scala-lang.org/download/
百度云地址:https://pan.baidu.com/s/1kVFyb3p 密码:nffb
下载完上传的到集群,目录/data/software ,安装:
mkdir /opt/scala
cd /data/software
tar -zxvf scala-2.12.4.tgz -C /opt/scala/
spark
下载地址:http://spark.apache.org/downloads.html
百度云地址:https://pan.baidu.com/s/1pL0wEa3 密码:zyvw
wget地址:
wget http://mirrors.hust.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
下载到/data/software后,安装:
mkdir /opt/spark
cd /data/software
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /opt/spark/
cd /opt/spark
mv spark-2.2.0-bin-hadoop2.7/ spark-2.2.0
配置
scala
修改环境变量
vi /etc/profile # 添加 export SCALA_HOME=/opt/scala/scala-2.12.4 export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$ZK_HOME/bin:$HBASE_HOME/bin:$FLUME_HOME/bin:$KAFKA_HOME/bin:$SCALA_HOME/bin # 生效 source /etc/profile # 验证 scala -version # 输出 Scala code runner version 2.12.4 -- Copyright 2002-2017, LAMP/EPFL and Lightbend, Inc.
spark
修改环境变量
vi /etc/profile # 添加 export SPARK_HOME=/opt/spark/spark-2.2.0 export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$ZK_HOME/bin:$HBASE_HOME/bin:$FLUME_HOME/bin:$KAFKA_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin # 生效 source /etc/profile
修改spark-env.sh
cd /opt/spark/spark-2.2.0/conf cp spark-env.sh.template spark-env.sh vi spark-env.sh # 添加 export SCALA_HOME=/opt/scala/scala-2.12.4 export JAVA_HOME=/opt/java/jdk1.8.0_60 export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_HOME=/opt/spark/spark-2.2.0 export SPARK_MASTER_IP=master export SPARK_EXECUTOR_MEMORY=2G
修改slaves
cp slaves.template slaves vi slaves # 修改 删除localhost 添加以下 slave1 slave2
启动
cd /opt/spark/spark-2.2.0/sbin
./start-all.sh
# 输出
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/spark-2.2.0/logs/spark-root-org.apache.spark.deploy.master.Master-1-slave1.out
slave1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/spark-2.2.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave1.out
slave2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/spark-2.2.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave2.out
测试
打开网页192.168.122.128:8080
,能够成功显示spark的状态
注: 若之前安装过tomcat,且没有修改过tomcat的默认端口,那么spark的8080端口会被tomcat占用;所以为了保证tomcat和spark能够同时运行,在启动spark和tomcat前(若已启动,先关闭),提供两种修改方案(选择其一即可):
修改tomcat的默认端口
cd /opt/tomcat/apache-tomcat-8.5.24/conf vi server.xml # 查找到8080端口 # /8080 找到如下 注意是否注释<!-- --> # <Connector port="8080" protocol="HTTP/1.1" # connectionTimeout="20000" # redirectPort="8443" /> # 修改为 8081
启动tomcat和spark,打开
192.168.122.128:8081
、192.168.122.128:8080
能够成功显示tomcat(8081端口)和spark(8080端口)的状态页面修改spark的默认端口
cd /opt/spark/spark-2.2.0/sbin/ vi start-master.sh # 找到8080端口 # /8080 # if [ "$SPARK_MASTER_WEBUI_PORT" = "" ]; then # SPARK_MASTER_WEBUI_PORT=8080 # 修改为 8081 source start-master.sh
启动tomcat和spark,打开
192.168.122.128:8080
、192.168.122.128:8081
能够成功显示tomcat(8080端口)和spark(8081端口)的状态页面
运行spark计算圆周率
cd /opt/spark/spark-2.2.0/
# 单机模式
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master local ./examples/jars/spark-examples_2.11-2.2.0.jar
# yarn-client
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client ./examples/jars/spark-examples_2.11-2.2.0.jar
# yarn-cluster
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster ./examples/jars/spark-examples_2.11-2.2.0.jar
均能成功输出,类似如下 Pi is roughly 3.1348556742783713
。
...
...
17/12/11 15:11:41 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4044 ms on slave2 (executor 2) (1/2)
17/12/11 15:11:41 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 227 ms on slave2 (executor 2) (2/2)
17/12/11 15:11:41 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 6.993 s
17/12/11 15:11:41 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/12/11 15:11:41 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 7.857732 s
Pi is roughly 3.1348556742783713
17/12/11 15:11:41 INFO server.AbstractConnector: Stopped Spark@1a15b789{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
17/12/11 15:11:41 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.122.128:4040
17/12/11 15:11:41 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread
17/12/11 15:11:41 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
17/12/11 15:11:41 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/12/11 15:11:41 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
...
...
网页访问192.168.122.128:8088
,点击applications标签,能够看到刚刚提交的spark任务的详细信息。
遇到的错误
在yarn上运行的模式,报内存错误
...
...
Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
17/12/11 10:47:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/11 10:47:08 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.122.128:8032
17/12/11 10:47:08 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
17/12/11 10:47:09 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container)
Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (2048+384 MB) is above the max threshold (2048 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:302)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:166)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1091)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
根据报错信息,修改配置文件,增大yarn的yarn.scheduler.maximum-allocation-mb 和yarn.nodemanager.resource.memory-mb 的值。
步骤:
关闭spark和hadoop,确定所有进程完全关闭
修改hadoop的yarn配置文件
vi /opt/hadoop/hadoop-2.7.3/etc/hadoop/yarn-site.xml # yarn.scheduler.maximum-allocation-mb和yarn.nodemanager.resource.memory-mb # 修改为 4096 具体值根据报错信息提示修改 # <property> # <name>yarn.scheduler.maximum-allocation-mb</name> # <value>4096</value> # <discription>每个节点可用内存,单位MB,默认8182MB</discription> # </property> # <property> # <name>yarn.nodemanager.vmem-pmem-ratio</name> # <value>2.1</value> # </property> # <property> # <name>yarn.nodemanager.resource.memory-mb</name> # <value>4096</value> # </property>
启动hadoop
启动spark
测试
cd /opt/spark/spark-2.2.0 # yarn-client ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client ./examples/jars/spark-examples_2.11-2.2.0.jar # yarn-cluster ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster ./examples/jars/spark-examples_2.11-2.2.0.jar
能够成功输出PI的值
参考:
Linux安装Spark集群(CentOS7+Spark2.1.1+Hadoop2.8.0)
Spark2.1.1中用各种模式运行计算圆周率的官方Demo
修改spark或者hadoop master web ui端口