首先当然是下载一个spark源码,在http://archive.cloudera.com/cdh5/cdh/5/中找到属于自己的源码,自己编译打包,有关如何编译打包可以参考一下我原来写的文章:
http://blog.csdn.net/xiao_jun_0820/article/details/44178169
执行完之后你应该能得到一个类似spark-1.6.0-cdh5.7.1-bin-custom-spark.tgz的压缩包(版本根据你具体下载的版本而有所区别)
然后上传到节点上去,解压到/opt目录下面,解压出来的目录应该是spark-1.6.0-cdh5.7.1-bin-custom-spark,名字太长了,给它做个软链把:
ln -s spark-1.6.0-cdh5.7.1-bin-custom-spark spark
然后cd /opt/spark/conf目录下面,将里面的那些模板文件都删掉,没卵用了,然后是关键了,在conf目录下面创建两个软连接yarn-conf和log4j.properties分别指向CDH的SPARK配置目录(默认目录是/etc/spark/conf,除非你改过)下面相应的文件:
ln -s /etc/spark/conf/yarn-conf yarn-conf
ln -s /etc/spark/conf/log4j.properties log4j.properties
然后把/etc/spark/conf目录下面的classpath.txt,spark-defaults.conf,spark-env.sh这三个文件拷贝到你自己的spark conf目录里面来,本例是/opt/spark/conf,最终/opt/spark/conf目录下面有5个文件:
编辑classpath.txt文件,找到里面spark相关的jar包,应该有两个:
/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/jars/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar
/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/jars/spark-streaming-flume-sink_2.10-1.6.0-cdh5.7.1.jar
一个是spark yarn shuffle 的jar包(如果启用动态资源分配的话会用到该JAR),这个在你自己打包的spark的SPARK_HOME/lib目录下面也有一个,将里面的路径替换成你自己的jar包路径:/opt/spark/lib/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar,另一个是spark streaming消费flume的,我不用,就把它删掉把,如果要用的话,就改成自己的jar包路径
接下来修改spark-defaults.conf文件,CDH自带的应该是:
spark.authenticate=false
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=60
spark.dynamicAllocation.minExecutors=0
spark.dynamicAllocation.schedulerBacklogTimeout=1
spark.eventLog.enabled=true
#spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
spark.eventLog.dir=hdfs://name84:8020/user/spark/applicationHistory
spark.yarn.historyServer.address=http://name84:18088
spark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/lib/spark-assembly.jar
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native
spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native
spark.yarn.config.gatewayPath=/opt/cloudera/parcels
spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
spark.master=yarn-client
就把spark.yarn.jar的路径修改一下就好了,修改成自己的jar包路径:
spark.yarn.jar=local:/opt/spark/lib/spark-assembly-1.6.0-cdh5.7.1-hadoop2.6.0-cdh5.7.1.jar
接下来修改spark-env.sh,将原来的export SPARK_HOME=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark修改成export SPARK_HOME=/opt/spark
OK,所有修改完毕。
每台机器上都按相同的方式装一遍自己的spark二进制分发包吧。
上面有两个注意点,就是log4j.properties和yarn-conf目录使用的是软链,所以你在cloudera manager中修改配置,不会影响你新装的spark。但是spark-default.conf和spark-env.sh不是软链,spark-env.sh一般也不会变了没啥关系,spark-default.conf里面有写死一些配置信息,如果你在cm中修改了这些配置,是不会同步的,需要手工改一下这些配置,比如你把historyServer换了一台机器部署,那么你要自己去修改这个配置。
做软链 应该也是可以的,无非就是spark.yarn.jar这个配置改了一下,应该可以在spark-submit 脚本中通过--conf spark.yarn.jar=xxxx设置回来。暂未尝试。
尝试提交一个任务试试吧:
/opt/spark/bin/spark-submit --class com.kingnet.framework.StreamingRunnerPro --master yarn-client --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1 /opt/spark/lib/dm-streaming-pro.jar test
或者yarn-cluster模式:
/opt/spark/bin/spark-submit --class com.kingnet.framework.StreamingRunnerPro --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1 hdfs://name84:8020/install/dm-streaming-pro.jar test
参考:http://spark.apache.org/docs/1.6.3/hadoop-provided.html
Using Spark's "Hadoop Free" Build
Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages “Hadoop free” builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify SPARK_DIST_CLASSPATH
to include Hadoop’s package jars. The most convenient place to do this is by adding an entry in conf/spark-env.sh
.
This page describes how to connect Spark to Hadoop for different types of distributions.
Apache Hadoop
For Apache distributions, you can use Hadoop’s ‘classpath’ command. For instance:
### in conf/spark-env.sh ###
# If 'hadoop' binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)