Hive on Spark
1. 下载Spark源码并编译
需要Hive on Spark编译好的包的可以关注我微信公众号(二维码在最下面),回复关键字"hive"领取
需要maven环境,配置不在这里赘述
从官网下载Spark源码并解压
下载地址: https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5.tgz
进入Spark源码目录,并编译:
[wang@bigdata101 ~]$ cd spark-2.4.5
[wang@bigdata101 spark-2.4.5]$ ./dev/make-distribution.sh --name without-hive --tgz -Pyarn -Phadoop-3.1 -Dhadoop.version=3.1.3 -Pparquet-provided -Porc-provided -Phadoop-provided
等待编译完成。编译的tar包就在源码根目录,文件名:
spark-2.4.5-bin-without-hive.tgz
2. 解压Spark,并配置环境
[wang@bigdata101 src]$ tar -zxf /opt/software/spark-2.4.5-bin-without-hive.tgz -C /opt/app
[wang@bigdata101 src]$ mv /opt/app/spark-2.4.5-bin-without-hive /opt/app/spark
3. 配置SPARK_HOME环境变量
sudo vim /etc/profile.d/spark.sh
添加如下内容
export SPARK_HOME=/opt/app/spark
export PATH=$PATH:$SPARK_HOME/bin
source 使其生效
source /etc/profile.d/spark.sh
4. 配置Spark运行环境:
[wang@bigdata101 src]$ mv /opt/app/spark/conf/spark-env.sh.template /opt/app/spark/conf/spark-env.sh
[wang@bigdata101 src]$ vim /opt/app/spark/conf/spark-env.sh
添加如下内容
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
vim /opt/app/hive/conf/spark-defaults.conf
添加如下内容
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata101:8020/spark-history
spark.executor.memory 1g
spark.driver.memory 1g
5. 在HDFS创建如下路径
hadoop fs -mkdir /spark-history
6. 上传Spark依赖到HDFS (可以省略,发现比不好使,可直接第7步)
hadoop fs -mkdir /spark-jars
hadoop fs -put /opt/app/spark/jars/* /spark-jars
7. 链接文件到Hive 或者hive-env.sh 添加(选一个)
链接文件到Hive
[wang@bigdata101 src]$ ln -s /opt/app/spark/jars/scala-library-2.11.12.jar /opt/app/hive/lib/scala-library-2.11.12.jar
#如果以下两个文件存在,就跳过
[wang@bigdata101 src]$ ln -s /opt/app/spark/jars/spark-core_2.11-2.4.5.jar /opt/app/hive/lib/spark-core_2.11-2.4.5.jar
[wang@bigdata101 src]$ ln -s /opt/app/spark/jars/spark-network-common_2.11-2.4.5.jar /opt/app/hive/lib/spark-network-common_2.11-2.4.5.jar
或者hive-env.sh 添加
export SPARK_HOME=/opt/app/spark-2.4.5-bin-without-hive
export SPARK_JARS=""
for jar in `ls $SPARK_HOME/jars`; do
export SPARK_JARS=$SPARK_JARS:$SPARK_HOME/jars/$jar
done
export HIVE_AUX_JARS_PATH=/opt/app/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar$SPARK_JARS
8. 修改hive-site.xml
<!--Spark依赖位置-->
<!--
<property>
<name>spark.yarn.jars</name>
<value>hdfs://bigdata102:8020/spark-jars/*</value>
</property>
-->
<!--Hive执行引擎-->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<!--Hive和spark连接超时时间-->
<property>
<name>hive.spark.client.connect.timeout</name>
<value>10000ms</value>
</property>
9. Hive on Spark测试
创建表
hive (default)> create table student(
id int,
name string);
向表中插入数据
hive (default)> insert into student values(1,"zhangsan");
如果没有报错就表示成功了
hive (default)> select * from student;
1 zhangsan