修改源码这里不讨论,因为我不会,我是下载修改后的源码
这里只讲报错和配置
使用的是hive3.1.2和spark3.0.0
修改hadoop配置
因为要安装spark,需要让hadoop的yarn知道spark,因此添加sprak的配置
进入hadoop的yarn-site.xml
[root@hadoop102 hadoop]# cd /opt/module/hadoop-3.1.3/etc/hadoop
[root@hadoop102 hadoop]# vim yarn-site.xml
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME,SPARK_HOME</value>
</property>
完整yarn-site配置
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- MR uffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME,SPARK_HOME</value>
</property>
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<!-- 是否将虚拟核数当做cpu核数,默认是false,采用物理cpu -->
<property>
<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
<value>false</value>
</property>
<!-- NodeManager使用内存数,默认是8g,我本机配置的是3g,因此修改为2g -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<!-- NodeManager的cpu核数,默认是8核,我本机配置的是2核,因此修改为2核 -->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<!-- 容器最小内存 修改为1g -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!-- 容器最大内存默认8g 修改为2g -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!-- 容器最小cpu核数 修改为1 -->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- 容器最大cpu核数 默认是4个 修改为1 -->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>1</value>
</property>
<!-- 虚拟内存检查,默认打开,改为关闭 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 虚拟内存和物理内存设置比例,默认2.1 -->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/opt/module/hadoop-3.1.3/etc/hadoop:/opt/module/hadoop-3.1.3/share/hadoop/common/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/common/*:/opt/module/hadoop-3.1.3/share/hadoop/hdfs:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/hdfs/*:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/*:/opt/module/hadoop-3.1.3/share/hadoop/yarn:/opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/*:/opt/module/hadoop-3.1.3/share/hadoop/yarn/*</value>
</property>
</configuration>
spark配置
环境变量配置
下载地址:Index of /dist/spark/spark-3.0.0
解压之后修改名称为spark
[root@hadoop102 software]# tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
[root@hadoop102 software]# cd /opt/module/
[root@hadoop102 software]# mv spark-3.0.0-bin-hadoop3.2 spark
###########配置环境变量
[root@hadoop102 module]# vim /etc/profile.d/my_env.sh
##SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin
[root@hadoop102 module]# source /etc/profile.d/my_env.sh
hive配置
hive中创建spark配置文件
[root@hadoop102 module]# vim /opt/module/hive/conf/spark-defaults.conf
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop102:8020/spark-history
spark.executor.memory 1g
spark.driver.memory 1g
HDFS创建如下路径,用于存储历史日志
[root@hadoop102 software]$ hadoop fs -mkdir /spark-history
[root@hadoop102 software]$ tar -zxvf /opt/software/spark-3.0.0-bin-without-hadoop.tgz
[root@hadoop102 software]$ hadoop fs -mkdir /spark-jars
[root@hadoop102 software]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars
[root@hadoop102 software]$ hadoop fs -mkdir /spark-jars
修改hive-site.xml文件
[root@hadoop102 ~]$ vim /opt/module/hive/conf/hive-site.xml
添加如下内容
<!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hadoop102:8020/spark-jars/*</value>
</property>
<!--Hive执行引擎-->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
测试
[root@hadoop102 hive]$ bin/hive
hive (default)> create table student(id int, name string);
hive (default)> insert into table student values(1,'abc');
修改配置防止小文件合并
因为会导致lzo文件和索引合并,无法使用切片
只有在lzo下面才要关闭小文件合并
在执行sql之前执行,只在本窗口有效
默认
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
在hive中运行sql:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat