Hive笔记
HIve 基本概念
Hive 是基于 Hadoop 的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并提供类 SQL 查询功能
Hive 本质:将 HQL 转化成 MapReduce 程序
(1)Hive 处理的数据存储在 HDFS
(2)Hive 分析数据底层的实现是 MapReduce
(3)执行程序运行在 Yarn 上
Hive 架构原理
1)用户接口:Client
CLI(command-line interface)、JDBC/ODBC(jdbc 访问 hive)、WEBUI(浏览器访问 hive)
2)元数据:Metastore
元数据包括:表名、表所属的数据库(默认是 default)、表的拥有者、列/分区字段、
表的类型(是否是外部表)、表的数据所在目录等;
默认存储在自带的 derby 数据库中,推荐使用 MySQL 存储 Metastore
3)Hadoop
使用 HDFS 进行存储,使用 MapReduce 进行计算。
4)驱动器:Driver
(1)解析器(SQL Parser):将 SQL 字符串转换成抽象语法树 AST,这一步一般都用第
三方工具库完成,比如 antlr;对 AST 进行语法分析,比如表是否存在、字段是否存在、SQL
语义是否有误。
(2)编译器(Physical Plan):将 AST 编译生成逻辑执行计划。
(3)优化器(Query Optimizer):对逻辑执行计划进行优化。
(4)执行器(Execution):把逻辑执行计划转换成可以运行的物理计划。对于 Hive 来
说,就是 MR/Spark。
Hive运行机制
Hive 通过给用户提供的一系列交互接口,接收到用户的指令(SQL),使用自己的 Driver,结合元数据(MetaStore),将这些指令翻译成 MapReduce,提交到 Hadoop 中执行,最后,将执行返回的结果输出到用户交互接口。
Hive安装
1)Hive官网地址
http://hive.apache.org/
2)文档查看地址
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
3)下载地址
https://downloads.apache.org/hive/
安装Hive
-
把 apache-hive-3.1.2-bin.tar.gz 上传到 linux 的/opt/software 目录下
-
解压 apache-hive-3.1.2-bin.tar.gz 到/opt/module/目录下面
[hadoop@hadoop102 software]$ tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/module/
-
修改/etc/profile.d/my_env.sh,添加环境变量
[hadoop@hadoop102 module]$ sudo vi /etc/profile.d/my_env.sh
-
添加内容
#HIVE_HOME export HIVE_HOME=/opt/module/apache-hive-3.1.2-bin export PATH=$PATH:$HIVE_HOME/bin
-
解决日志 Jar 包冲突
#配置生效 [hadoop@hadoop102 module]$ source /etc/profile [hadoop@hadoop102 module]$ mv $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.jar $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.bak
-
初始化元数据库
[hadoop@hadoop102 apache-hive-3.1.2-bin]$ bin/schematool -dbType derby -initSchema
可能报错:
[hadoop@hadoop102 apache-hive-3.1.2-bin]$ bin/schematool -dbType derby -initSchema Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518) at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536) at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:430) at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5141) at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:5104) at org.apache.hive.beeline.HiveSchemaTool.<init>(HiveSchemaTool.java:96) at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1473) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
原因:hadoop和hive的两个guava.jar版本不一致 ,删除低版本的那个,将高版本的复制到低版本目录下
两个位置分别位于下面两个目录:
/hive/lib/
/hadoop/share/hadoop/common/lib/[hadoop@hadoop102 lib]$ cp /opt/module/hadoop-3.1.3/share/hadoop/common/lib/guava-27.0-jre.jar /opt/module/apache-hive-3.1.2-bin/lib/ [hadoop@hadoop102 lib]$ rm -rf guava-19.0.jar [hadoop@hadoop102 apache-hive-3.1.2-bin]$ bin/schematool -dbType derby -initSchema Initialization script completed schemaTool completed [hadoop@hadoop102 apache-hive-3.1.2-bin]$
启动并使用Hive
启动hadoop
[hadoop@hadoop102 bin]$ myhadoop.sh start
=================== 启动 hadoop 集群 ===================
--------------- 启动 hdfs ---------------
Starting namenodes on [hadoop102]
Starting datanodes
Starting secondary namenodes [hadoop104]
--------------- 启动 yarn ---------------
Starting resourcemanager
Starting nodemanagers
--------------- 启动 historyserver ---------------
启动hive
[hadoop@hadoop102 bin]$ hive
which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/home/hadoop/bin:/opt/module/apache-hive-3.1.2-bin/bin:/root/bin:/opt/module/jdk1.8.0_212/bin:/opt/module/hadoop-3.1.3/bin:/opt/module/hadoop-3.1.3/sbin:/home/hadoop/bin:/opt/module/apache-hive-3.1.2-bin/bin)
Hive Session ID = d9065b52-dc14-44c3-a6ff-c2648e1c7ffc
Logging initialized using configuration in jar:file:/opt/module/apache-hive-3.1.2-bin/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
hive> show tables;
hive> create table test(id int);
hive> insert into test values(1);
hive> select * from test;
MySQL 安装
下载地址:
https://downloads.mysql.com/archives/community/
-
检查当前系统是否安装过Mysql
[hadoop@hadoop102 software]$ rpm -qa|grep mariadb mariadb-libs-5.5.56-2.el7.x86_64 //如果存在就通过下面的命令卸载掉 [hadoop@hadoop102 software]$ sudo rpm -e --nodeps mariadb-libs
-
解压安装包
[hadoop@hadoop102 software]$ tar -xf mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar
-
执行rpm安装
[hadoop@hadoop102 software]$ sudo rpm -ivh mysql-community-common-5.7.28-1.el7.x86_64.rpm [hadoop@hadoop102 software]$ sudo rpm -ivh mysql-community-libs-5.7.28-1.el7.x86_64.rpm [hadoop@hadoop102 software]$ sudo rpm -ivh mysql-community-libs-compat-5.7.28-1.el7.x86_64.rpm [hadoop@hadoop102 software]$ sudo rpm -ivh mysql-community-client-5.7.28-1.el7.x86_64.rpm [hadoop@hadoop102 software]$ sudo rpm -ivh mysql-community-server-5.7.28-1.el7.x86_64.rpm
-
删除/etc/my.cnf 文件中 datadir 指向的目录下的所有内容,如果有内容的情况下:
[hadoop@hadoop102 software]$ cat /etc/my.cnf datadir=/var/lib/mysql [hadoop@hadoop102 software]$ cd /var/lib/mysql [hadoop@hadoop102 mysql]$ sudo rm -rf ./*
-
初始化数据库
[hadoop@hadoop102 opt]$ sudo mysqld --initialize --user=mysql
-
查看临时生成的 root 用户的密码
[hadoop@hadoop102 opt]$ sudo cat /var/log/mysqld.log 2021-08-23T13:09:45.493567Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details). 2021-08-23T13:09:45.665118Z 0 [Warning] InnoDB: New log files created, LSN=45790 2021-08-23T13:09:45.700558Z 0 [Warning] InnoDB: Creating foreign key constraint system tables. 2021-08-23T13:09:45.760729Z 0 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: 633f05c9-0413-11ec-b89b-0800273563af. 2021-08-23T13:09:45.763125Z 0 [Warning] Gtid table is not ready to be used. Table 'mysql.gtid_executed' cannot be opened. 2021-08-23T13:09:46.143574Z 0 [Warning] CA certificate ca.pem is self signed. 2021-08-23T13:09:46.313176Z 1 [Note] A temporary password is generated for root@localhost: 6QF+*aKqC4dp
-
启动mysql 服务
[hadoop@hadoop102 opt]$ sudo systemctl start mysqld
-
登录Mysql 数据库(输入上面的临时密码)
[hadoop@hadoop102 opt]$ mysql -uroot -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2 Server version: 5.7.28 Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql>
-
必须先修改 root 用户的密码,否则执行其他的操作会报错
mysql> set password = password("123456"); Query OK, 0 rows affected, 1 warning (0.00 sec) mysql>
-
修改 mysql 库下的 user 表中的 root 用户允许任意 ip 连接
mysql> update mysql.user set host='%' where user='root'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> flush privileges; Query OK, 0 rows affected (0.00 sec)
Hive元数据配置到Mysql
-
拷贝驱动
将 MySQL 的 JDBC 驱动拷贝到 Hive 的 lib 目录下
[hadoop@hadoop102 software]$ cp mysql-connector-java-5.1.37.jar $HIVE_HOME/lib
-
配置Metastore到 Mysql
(1)在$HIVE_HOME/conf 目录下新建 hive-site.xml 文件
[hadoop@hadoop102 lib]$ vi $HIVE_HOME/conf/hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- jdbc 连接的 URL --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value> </property> <!-- jdbc 连接的 Driver--> <property> <name>javax.jdo.option.ConnectionDriverName</name> <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- jdbc 连接的 URL --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value> </property> <!-- jdbc 连接的 Driver--> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <!-- jdbc 连接的 username--> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <!-- jdbc 连接的 password --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>000000</value> </property> <!-- Hive 元数据存储版本的验证 --> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <!--元数据存储授权--> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> <!-- Hive 默认在 HDFS 的工作目录 --> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> </configuration>
(2)登录MYSQL
[hadoop@hadoop102 lib]$ mysql -uroot -p123456
(3)新建 Hive 元数据库
mysql> create database metastore; mysql> quit;
(4)初始化HIVE元数据库
[hadoop@hadoop102 software]$ schematool -initSchema -dbType mysql -verbose
配置hive 服务,使用JDBC 方式访问hive
hive-site.xml 文件中添加如下配置信息
[hadoop@hadoop102 conf]$ cd /opt/module/apache-hive-3.1.2-bin/conf/
[hadoop@hadoop102 conf]$ vi hive-site.xml
添加如下内容
<!-- 指定存储元数据要连接的地址 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop102:9083</value>
</property>
<!-- 指定 hiveserver2 连接的 host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop102</value>
</property>
<!-- 指定 hiveserver2 连接的端口号 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
hive服务启动命令:
[hadoop@hadoop102 hive]$ nohup hive --service metastore 2>&1 &
[hadoop@hadoop102 hive]$ nohup hive --service hiveserver2 2>&1 &
为了方便使用,可以直接编写脚本来管理服务的启动和关闭
[hadoop@hadoop102 conf]$ vi ~/bin/hiveservices.sh
#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs
if [ ! -d $HIVE_LOG_DIR ]
then
mkdir -p $HIVE_LOG_DIR
fi
#检查进程是否运行正常,参数 1 为进程名,参数 2 为进程端口
function check_process()
{
pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')
ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)
echo $pid
[[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1
}
function hive_start()
{
metapid=$(check_process HiveMetastore 9083)
cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"
[ -z "$metapid" ] && eval $cmd || echo "Metastroe 服务已启动"
server2pid=$(check_process HiveServer2 10000)
cmd="nohup hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"
[ -z "$server2pid" ] && eval $cmd || echo "HiveServer2 服务已启动"
}
function hive_stop()
{
metapid=$(check_process HiveMetastore 9083)
[ "$metapid" ] && kill $metapid || echo "Metastore 服务未启动"
server2pid=$(check_process HiveServer2 10000)
[ "$server2pid" ] && kill $server2pid || echo "HiveServer2 服务未启动"
}
case $1 in
"start")
hive_start
;;
"stop")
hive_stop
;;
"restart")
hive_stop
sleep 2
hive_start
;;
"status")
check_process HiveMetastore 9083 >/dev/null && echo "Metastore 服务运行正常" || echo "Metastore 服务运行异常"
check_process HiveServer2 10000 >/dev/null && echo "HiveServer2 服务运行正常" || echo "HiveServer2 服务运行异常"
;;
*)
echo Invalid Args!
echo 'Usage: '$(basename $0)' start|stop|restart|status'
;;
esac
赋权
[hadoop@hadoop102 conf]$ chmod +x /home/hadoop/bin/hiveservices.sh
启动 Hive 后台服务
[hadoop@hadoop102 conf]$ hiveservices.sh start