版本兼容性
配置之前,首先确保Hive与、spark、hadoop版本的兼容性,方法是下载hive源码,在hive源码的根目录下有一个pom.xml文件,在该文件中通过查找spark.version可以查看到与该hive版本兼容的spark版本号,同理,在该文件中通过查找hadoop.version可以查看到与该hive版本兼容的hadoop版本号,注意:关于版本兼容性,只要保证大版本号一致即可(如与hive-3.1.1兼容的spark版本为2.3.0,hadoop版本为3.1.0,这种情况下,我们可以安装spark-2.3.3稳定版本,hadoop版本也是类似的)
对于hive与spark版本的兼容性,也可以参考:
https://cwiki.apache.org//confluence/display/Hive/Hive+on+Spark:+Getting+Started
spark安装(建议源码编译安装)
此处以源码编译安装为例
源码编译(以spark-2.3.3版本,hadoop-3.1.0版本为例)
下载spark-2.3.3版本源码,解压后,进入源码根目录,执行如下命令编译源码(注:在linux上编译,且linux主机需要能联网,此外该主机上已安装Maven,maven版本至少3.3.9以上)
./dev/make-distribution.sh --name "h310-without-hive" --tgz "-Pyarn,-Phadoop-3.1,-Dhadoop.version=3.1.0,parquet-provided,orc-provided"
编译好后,在源码根目录下会生成安装包“spark-2.3.3-bin-h310-without-hive.tgz”,spark on yarn部署方法具体参考:
https://blog.csdn.net/wangkai_123456/article/details/87348161#3Spark_25
YARN配置
在${HADOOP_HOME}/etc/hadoop/yarn-site.xml文件中添加如下配置:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
Hive配置
To add the Spark dependency to Hive
Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib.
Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn’t have an assembly jar.
To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib.
scala-library
spark-core
spark-network-common
To run with LOCAL mode (for debugging only), link the following jars in addition to those above to HIVE_HOME/lib.
chill-java chill jackson-module-paranamer jackson-module-scala jersey-container-servlet-core
jersey-server json4s-ast kryo-shaded minlog scala-xml spark-launcher
spark-network-shuffle spark-unsafe xbean-asm5-shaded
Configure Hive execution engine to use Spark
在${HIVE_HOME}conf/hive-site.xml文件中增加如下配置:
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
Configure Spark-application configs for Hive
在${HIVE_HOME}conf/hive-site.xml文件中增加如下配置:
<property>
<name>spark.home</name>
<value>/usr/local/spark</value>
</property>
<property>
<name>spark.master</name>
<value>yarn-cluster</value>
</property>
<property>
<name>hive.spark.client.channel.log.level</name>
<value>WARN</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://hadoopSvr1:8020/user/hive/tmp/sparkeventlog</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>1g</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>2</value>
</property>
<property>
<name>spark.executor.instances</name>
<value>6</value>
</property>
<property>
<name>spark.yarn.executor.memoryOverhead</name>
<value>150m</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>4g</value>
</property>
<property>
<name>spark.yarn.driver.memoryOverhead</name>
<value>400m</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hadoopSvr1:8020/spark-jars/*</value>
</property>
Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs.
上传 $SPARK_HOME/jars目录下的所有jar包到HDFS目录(比如:hdfs://hadoopSvr1:8020/spark-jars),并${HIVE_HOME}conf/hive-site.xml文件中增加如下配置:
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hadoopSvr1:8020/spark-jars/*</value>
</property>
spark配置(可选)
参考:https://cwiki.apache.org//confluence/display/Hive/Hive+on+Spark:+Getting+Started