环境说明:
version | |
---|---|
CentOS | 6.8 |
JDK | 1.8 |
Maven | 3.6.3 |
Scala | 2.11.8 |
Hadoop | 2.7.2 |
Hive | 2.3.6 |
Spark | 2.1.1 |
源码编译
Hive和Spark的版本兼容性
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3a+Getting+Started
- 1.下载Spark源码包:
https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1.tgz - 2.解压Spark源码包
- 3.修改Spark内置maven路径
在/etc/profile/ 修改添加
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m -XX:MaxPermSize=512m"
vim spark-2.1.1/dev/make-distribution.sh
MVN="$MAVEN_HOME/bin/mvn"
- 4.编译源码
bash ./dev/change-scala-version.sh 2.11
bash ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
mvn clean package -Pdist -Dmaven.test.skip=true
配置
- 1.拷贝jar包到HIVE_HOME/lib下
在编译好的并且解压后的Spark
cd spark-2.1.1-bin-hadoop2-without-hive
把SPARK_HOME/jars下的spark-*和scala-*的jar包软连接或拷到HIVE_HOME/lib下
cp jars/spark-* $HIVE_HOME/lib/
cp jars/scala-* $HIVE_HOME/lib/
- 2.修改hive-site.xml
<property>
<name>spark.yarn.jars</name>
<value>hdfs://namenode:9000/spark-jars/*</value>
</property>
并把SPARK_HOME/jars下所有jar包上传到到hdfs://namenode:9000/spark-jars
- 3.修改spark-env.sh
#注意$(hadoop classpath)需要支持hadoop命令可执行,可以修改成根目录形式$(/hadoop-2.6.4/bin/hadoop classpath)
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
- 4.修改hive
vim HIVE-HOME/bin/hive
for f in ${HIVE_LIB}/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
在这后面添加下面2个引包语句,如不添加在hive里可能会有错(此乃一坑)
for f in ${SPARK_HOME}/jars/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
for f in ${SCALA_HOME}/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
- 5.启动hive
set hive.execution.engine=spark;
set spark.master=yarn;
set spark.submit.deployMode=client;
set spark.eventLog.enabled=true;
set spark.eventLog.dir=hdfs://namenode:9000/spark-logs;
set spark.executor.memory=1024m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
上述配置可以在hive-site.xml里全局配置,本次为了测试选用会话级配置。
测试
大功告成!!!