1.本来想写安装spark2.3的,但是由于配置Hadoop时候jdk用的是1.7,而Spark2.3只支持JDK1.8。如果spark和Hadoop安装的JDK版本不一样,在yarn上运行spark会报错。所以记录的是spark1.x的安装。
2.特别注明下,spark不要用CDH版本,有些jar包找不到,直接用Apache版本就好。
3.解压后的spark 目录如下:
[zuowei.zhang@master spark-1.6.3]$ ll
total 1380
drwxr-xr-x 2 zuowei.zhang zuowei.zhang 4096 Nov 3 2016 bin
-rw-r--r-- 1 zuowei.zhang zuowei.zhang 1343562 Nov 3 2016 CHANGES.txt
drwxr-xr-x 2 zuowei.zhang zuowei.zhang 212 Dec 24 08:35 conf
drwxr-xr-x 3 zuowei.zhang zuowei.zhang 19 Nov 3 2016 data
drwxr-xr-x 3 zuowei.zhang zuowei.zhang 79 Nov 3 2016 ec2
drwxr-xr-x 3 zuowei.zhang zuowei.zhang 17 Nov 3 2016 examples
drwxr-xr-x 2 zuowei.zhang zuowei.zhang 237 Nov 3 2016 lib
-rw-r--r-- 1 zuowei.zhang zuowei.zhang 17352 Nov 3 2016 LICENSE
drwxr-xr-x 2 zuowei.zhang zuowei.zhang 4096 Nov 3 2016 licenses
-rw-r--r-- 1 zuowei.zhang zuowei.zhang 23529 Nov 3 2016 NOTICE
drwxr-xr-x 6 zuowei.zhang zuowei.zhang 119 Nov 3 2016 python
drwxr-xr-x 3 zuowei.zhang zuowei.zhang 17 Nov 3 2016 R
-rw-r--r-- 1 zuowei.zhang zuowei.zhang 3359 Nov 3 2016 README.md
-rw-r--r-- 1 zuowei.zhang zuowei.zhang 120 Nov 3 2016 RELEASE
drwxr-xr-x 2 zuowei.zhang zuowei.zhang 4096 Nov 3 2016 sbin
4.配置conf 目录下slave文件和spark-env.sh文件
slave文件配置work节点:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# A Spark Worker will be started on each of the machines listed below.
master.cn
slave1.cn
slave2.cn
spark-env.sh文件配置如下:
export JAVA_HOME=/opt/java/jdk1.7.0_67
export SPARK_MASTER_IP=master.cn
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=1g
#spark on yarn
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/cdh5.15.0/spark-1.6.3
export SPARK_JAR=/opt/cdh5.15.0/spark-1.6.3/lib/spark-assembly-1.6.3-hadoop2.6.0.jar
export PATH=$SPARK_HOME/bin:$PATH
3.将spark文件分发到各个节点:
scp -r /opt/cdh5.15.0/spark-1.6.3/ slave1.cn:/opt/cdh5.15.0/
4.spark运行在yarn上例子:
client模式:结果在xshell上可见
bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 1G --num-executors 1 lib/spark-examples-1.6.3-hadoop2.6.0.jar 100
cluster模式:结果在8088界面可见
bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-memory 1G --num-executors 1 lib/spark-examples-1.6.3-hadoop2.6.0.jar 100
5.spark配置HistoryServer,并将如下配置分发到各个节点。
配置conf 目录下的spark-defaults.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master.cn:8020/sparklog
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
配置spark-env.sh:
#HistoryServer
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=30 -Dspark.history.fs.logDirectory=hdfs://master.cn:8020/sparklog"
6.在任意一台节点上启动start-history-server.sh,在对应节点的UI:http://master.cn:18080/即可查看