Hive On Spark 官方教程
注意,一般来说hive版本需要与spark版本对应,官网有给出对应版本。这里使用的hive版本,spark版本,hadoop版本都没有使用官方推荐。
- 下载Spark 源码,以spark-2.4.5 为例。
- 编译Spark 源码。
./dev/make-distribution.sh --name "hadoop3-without-hive" --tgz "-Pyarn,hadoop-3.1,scala-2.12,parquet-provided,orc-provided" -Dhadoop.version=3.1.3 -Dscala.version=2.12.12 -Dscala.binary.version=2.12
- 安装编译后的源码包。上述安装包发送到与hive安装所在的机器上,解压。配置spark-env.sh脚本,添加如下配置:
SPARK_CONF_DIR=/opt/spark-2.4.5-bin-hadoop3-without-hive/conf HADOOP_CONF_DIR=/opt/hadoop-3.1.3/etc/hadoop YARN_CONF_DIR=/opt/hadoop-3.1.3/etc/hadoop SPARK_EXECUTOR_CORES=3 SPARK_EXECUTOR_MEMORY=4g SPARK_DRIVER_MEMORY=2g
- 配置spark-defaults.conf,可以在hive-site.xml中配置
spark.yarn.historyServer.address=${hostname}:18080 spark.yarn.historyServer.allowTracking=true spark.eventLog.dir=hdfs://master/spark/eventlogs spark.eventLog.enabled=true spark.history.fs.logDirectory=hdfs://master/spark/hisLogs spark.yarn.jars=hdfs://master/spark-jars-hive/*
- 在hdfs新建一个spark-jars目录,并将上面安装的spark目录下的jars下的jar包都上传到该目录下。
- 配置hive
- 配置hive-env.sh脚本,如下:
export HADOOP_HOME=/opt/hadoop-3.1.3 export HIVE_CONF_DIR=/opt/apache-hive-3.1.2-bin/conf
- 配置hive-site.sh,如下:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://${hostname}:3306/hive?createDatabaseIfNotEx ist=true</value> <description>hive元数据存储数据库连接URL,注意host为mysql服务 所在的主机名</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> <description>hive存储元数据数据库的连接驱动类,这里用的MYSQL 8.0+,所以驱动用8.0+的方式,注意区别于MYSQL6.0之前的版本</de scription> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>hive元数据存储数据库用户名</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive1234</value> <description>hive元数据存储数据库的密码</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>hdfs://master/hive</value> <description>hive在hdfs的目录,将用于存储表的数据</descripti on> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>${hostname}</value> <description>启动hive2时的主机名</description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://${hostname}:9083</value> <description>hive的thriftURI,好比jdbc一样,注意host为启动hi vethriftserver的主机名</description> </property> <property> <name>hive.execution.engine</name> <value>spark</value> <description>修改hive的执行引擎为saprk</description> </property> <property> <name>spark.serializer</name> <value>org.apache.spark.serializer.KryoSerializer</value> <description>配置spark的序列化类</description> </property> <property> <name>spark.eventLog.dir</name> <value>hdfs://master/spark/eventlogs</value> </property> <property> <name>spark.executor.instances</name> <value>3</value> </property> <property> <name>spark.executor.cores</name> <value>3</value> </property> <property> <name>spark.yarn.jars</name> <value>hdfs://master/spark-jars-hive/*</value> </property> <property> <name>spark.home</name> <value>/opt/spark-2.4.5-bin-hadoop3-without-hive</value> </property> <property> <name>spark.master</name> <value>yarn</value> <description>配置spark on yarn</description> </property> <property> <name>spark.executor.extraClassPath</name> <value>/opt/apache-hive-3.1.2-bin/lib</value> <description>配置spark 用到的hive的jar包</description> </property> <property> <name>spark.eventLog.enabled</name> <value>true</value> </property> <property> <name>spark.executor.memory</name> <value>4g</value> </property> <property> <name>spark.yarn.executor.memoryOverhead</name> <value>2048m</value> </property> <property> <name>spark.driver.memory</name> <value>2g</value> </property> <property> <name>spark.yarn.driver.memoryOverhead</name> <value>400m</value> </property> <property> <name>spark.executor.cores</name> <value>3</value> </property> <!--配置spark executor动态分配--> <!-- <property> <name>spark.shuffle.service.enabled</name> <value>true</value> </property> <property> <name>spark.dynamicAllocation.enabled</name> <value>true</value> </property> <property> <name>spark.dynamicAllocation.minExecutors</name> <value>0</value> </property> <property> <name>spark.dynamicAllocation.maxExecutorss</name> <value>14</value> </property> <property> <name>spark.dynamicAllocation.initialExecutors</name> <value>4</value> </property> <property> <name>spark.dynamicAllocation.executorIdleTimeout</name> <value>60000</value> </property> <property> <name>spark.dynamicAllocation.schedulerBacklogTimeout</name> <value>1000</value> </property> --> </configuration>
- 配置hive-env.sh脚本,如下:
- 用mysql存储元数据,将mysql驱动包拷贝到${HIVE_HOME}/lib。
- 将spark相关包拷贝到${HIVE_HOME}/lib目录下(建立软链接也可以),
cala-reflect-2.12.12.jar scala-library-2.12.12.jar spark-core_2.12-2.4.5.jar spark-network-common_2.12-2.4.5.jar spark-yarn_2.12-2.4.5.jar spark-unsafe_2.12-2.4.5.jar
- 启动初始化hive:
${HVIE_HOME}/bin/schematool -dbType mysql -initSchema
- 启动元数据连接服务(thrift server 服务):
nohup hive --service metastore &
- 启动hiveserver2服务:
nohup hive --service hiveserver2 &