目前spark的Run on的hadoop版本大多是hadoop2以上,但是实际上各个公司的生产环境不尽相同,用到2.0以上的公司还是少数。
大多数公司还是停留在1代hadoop上,所以我就拿spark0.91 + hadoop0.20.2-cdh3u5来部署一个小集群,以供测试学习使用。
一、环境概况
Spark集群3台:
web01: slave
web02: master
db01: slave
Hadoop集群:
hadoop 0.20.2-cdh3u5 3台
2、编译Spark
编译spark我在这里就不赘述了,已经有好几篇编译的文章了
第一步、设置Spark要work with的Hadoop的版本号,可以在spark官网查找。
第二部、sbt/sbt assembly 编译发布spark核心包。
还是推荐大家用sbt编译,遇到问题可以看我的
spark编译sbt依赖问题。
3、配置
如果编译基本都ok了的话,会在/home/hadoop/shengli/spark/assembly/target/scala-2.10下生成spark和hadoop匹配的发布包。
总计 92896
drwxr-xr-x 3 root root 4096 04-21 14:00 cache
drwxrwxr-x 6 root root 4096 04-21 14:00 ..
-rw-r--r-- 1 root root 95011766 04-21 14:16 spark-assembly-0.9.1-hadoop0.20.2-cdh3u5.jar
drwxrwxr-x 3 root root 4096 04-21 14:20 .
而且在路径/home/hadoop/shengli/spark/lib_managed/jars下,你会找到hadoop-core-0.20.2-cdh3u5.jar这个文件。
4.spark配置
spark官网上提供了好多几种启动集群的方式,我比较推荐的是用官方的shell脚本。sbin/start-all.sh,简单快捷,如果需要定制的启动Master和Slave,就需要用到sbin/start-master.sh sbin/start-slave.sh了。
4.1 修改spark的spark环境
如果用到这种启动方式,首先要修改配置文件。
cp spark-env.sh.template spark-env.sh
设置一下Master的IP和端口:(其它的配置以后再配了)
#!/usr/bin/env bash
# This file contains environment variables required to run Spark. Copy it as
# spark-env.sh and edit that to configure Spark for your site.
#
# The following variables can be set in this file:
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos
# - SPARK_JAVA_OPTS, to set node-specific JVM options for Spark. Note that
# we recommend setting app-wide options in the application's driver program.
# Examples of node-specific options : -Dspark.local.dir, GC options
# Examples of app-wide options : -Dspark.serializer
#
# If using the standalone deploy mode, you can also set variables for it here:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much memory to use (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
export SPARK_MASTER_IP=web02.dw
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_INSTANCES=2
#control executor mem
export SPARK_EXECUTOR_MEMORY=1g
export SPARK_JAVA_OPTS=-Dspark.executor.memory=1g
4.2 将hadoop配置文件加入classpath
将hadoop配置文件core-site.xml和hdfs-site.xml拷贝到spark/conf下。
4.3 设置slaves
vim slaves
# A Spark Worker will be started on each of the machines listed below.
web01.dw
db01.dw
4.4分发spark
最基本的默认配置以及配置好了,下面开始分发spark到各个slave节点,注意要先打包后分发,然后到各个节点去解压,不要直接scp。
5. 启动spark
5.1启动spark:
[root@web02 spark]# sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /app/home/hadoop/shengli/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-web02.dw.out
web01.dw: starting org.apache.spark.deploy.worker.Worker, logging to /app/home/hadoop/shengli/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-web01.dw.out
db01.dw: starting org.apache.spark.deploy.worker.Worker, logging to /app/hadoop/shengli/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-db01.dw.out
web02是Master,web01和db01是Worker.
[root@web02 spark]# jps
25293 SecondaryNameNode
25390 JobTracker
18783 Jps
25118 NameNode
18677 Master
[root@web01 conf]# jps
22733 DataNode
5697 Jps
22878 TaskTracker
5625 Worker
4839 jar
[root@db01 assembly]# jps
16242 DataNode
16345 TaskTracker
30603 Worker
30697 Jps
5.2 web监控
可以清晰的看到:
默认情况下Master的端口为7077,当然我们可以根据配置文件更改,这里暂不做更改。
2个worker,以及每个worker和集群的配置。
Master: