1. Hive产生背景
MapReduce 缺点明显:现实业务场景下,编写MapReduce代码复杂,代价太大。
2. Hive概述
Hive是SQLon Hadoop的一种解决方案。(SQL on Hadoop:Hive/Presto/Impala/Spark SQL…)
2.1 官网定义
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
关键词:数据仓库,实现读写,大量数据集,分布式存储,SQL,命令行工具,JDBC驱动连接
2.2 拓展
Hive是Facebook开源,用于解决海量结构化的日志数据统计问题的开源软件,是构建在Hadoop(HDFS/MapReduce/YARN)之上的数据仓库。
- 业务数据存储 : Hive的数据存放在HDFS之上;
- Hive引擎 : Hive的底层执行引擎可以是MapReduce/Tez/Spark中的一种,只需要通过一个参数就能够切换底层的执行引擎;
- 资源调度 : Hive作业提交到YARN上运行;
- 使用方式 : 提供了HQL查询语言,HQL和SQL类似,但不完全相同;
- 使用场景 : 适用于离线处理/批处理;
- 数据形式 : 文本,压缩,列式存储;
- 所谓的大数据“云化”:是一个很大项目
3. Hive架构
3.1 用户接口层
- 命令行窗口(command line)
- JDBC连接
3.2 Driver/驱动器
- SQL解析:SQL --> AST(antlr)
- 查询优化:逻辑/物理执行计划
- UDF/SerDes:自定义函数/序列化和反序列化
- Execution:执行
3.3 元数据
- 表名、所属数据库、列(名/类型/index)、表类型、表数据所在目录等业务数据的存储信息等
3.4 数据存储
- 存放在HDFS之上
Hive的数据分为两部分:数据 + 元数据
4. Hive与MapReduce对比优缺点
- 优点:相对于MapReduce来说,Hive使用SQL处理Hadoop,业务实现简单快速
- 缺点:如果底层引擎是基于MapReduce,那么和MapReduce一样,性能必然不高
5. Hive和RDBMS的对比
- 相同点
都支持SQL (https://cwiki.apache.org/confluence/display/Hive)
都支持事务:hive也支持事务,用的很少
都支持insert/update/delete:Apache版本的insert、update和delete在Hive0.14以后才支持(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML)
针对Hive数据仓库,写比较少的,批次加载数据到Hive然后进行统计分析
- 不同点
体量/集群规模:Hive大大大
延迟/时性:Hive延迟高
6. Hive部署
官网:https://cwiki.apache.org/confluence/display/Hive/GettingStarted
6.1 前提
- 安装JDK8
- 安装Hadoop2.x
- 安装系统是Linux
- 有MySQL库作为元数据库
6.2 下载上传解压,创建软连接,查看目录结构
CDH-5.16.2 :http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.16.2.tar.gz
[ruoze@rzdata001 software]$ tar -zxvf hive-1.1.0-cdh5.16.2.tar.gz -C ~/app
···
[ruoze@rzdata001 software]$cd ~/app
[ruoze@rzdata001 app]$ ll
total 8
lrwxrwxrwx 1 ruoze ruoze 22 Nov 28 21:24 hadoop -> hadoop-2.6.0-cdh5.16.2
drwxr-xr-x 15 ruoze ruoze 4096 Nov 28 21:46 hadoop-2.6.0-cdh5.16.2
drwxr-xr-x 11 ruoze ruoze 4096 Jun 3 2019 hive-1.1.0-cdh5.16.2
[ruoze@rzdata001 app]$ ln -s hive-1.1.0-cdh5.16.2 hive
[ruoze@rzdata001 app]$ ll
total 8
lrwxrwxrwx 1 ruoze ruoze 22 Nov 28 21:24 hadoop -> hadoop-2.6.0-cdh5.16.2
drwxr-xr-x 15 ruoze ruoze 4096 Nov 28 21:46 hadoop-2.6.0-cdh5.16.2
lrwxrwxrwx 1 ruoze ruoze 20 Dec 15 11:35 hive -> hive-1.1.0-cdh5.16.2
drwxr-xr-x 11 ruoze ruoze 4096 Jun 3 2019 hive-1.1.0-cdh5.16.2
[ruoze@rzdata001 app]$
[ruoze@rzdata001 app]$ cd hive
[ruoze@rzdata001 hive]$ ll
total 420
drwxr-xr-x 2 ruoze ruoze 47 Jun 3 2019 auxlib
drwxr-xr-x 3 ruoze ruoze 112 Jun 3 2019 bin # 命令
drwxr-xr-x 2 ruoze ruoze 4096 Jun 3 2019 conf # 配置文件
drwxr-xr-x 3 ruoze ruoze 20 Jun 3 2019 data
drwxr-xr-x 6 ruoze ruoze 104 Jun 3 2019 docs
drwxr-xr-x 4 ruoze ruoze 32 Jun 3 2019 examples
drwxr-xr-x 7 ruoze ruoze 63 Jun 3 2019 hcatalog
drwxr-xr-x 4 ruoze ruoze 8192 Jun 3 2019 lib # 依赖的jar包等
-rw-r--r-- 1 ruoze ruoze 24754 Jun 3 2019 LICENSE
-rw-r--r-- 1 ruoze ruoze 397 Jun 3 2019 NOTICE
-rw-r--r-- 1 ruoze ruoze 4048 Jun 3 2019 README.txt
-rw-r--r-- 1 ruoze ruoze 376416 Jun 3 2019 RELEASE_NOTES.txt
drwxr-xr-x 3 ruoze ruoze 22 Jun 3 2019 scripts
[ruoze@rzdata001 hive]$
6.3 配置环境变量
[ruoze@rzdata001 app]$ vim ~/.bashrc
# Hive env
export HIVE_HOME=/home/ruoze/app/hive
export PATH=$HIVE_HOME/bin:$PATH
[ruoze@rzdata001 app]$
[ruoze@rzdata001 app]$ source ~/.bashrc
[ruoze@rzdata001 app]$ which hive
~/app/hive/bin/hive
[ruoze@rzdata001 app]$
6.4 配置 hive-site.xml 文件
Hive解压后,conf目录下没有默认的hive-site.xml模板
[ruoze@rzdata001 hive]$ cd conf
[ruoze@rzdata001 conf]$ ll
total 20
-rw-r--r-- 1 ruoze ruoze 1196 Jun 3 2019 beeline-log4j.properties.template
-rw-r--r-- 1 ruoze ruoze 2378 Jun 3 2019 hive-env.sh.template
-rw-r--r-- 1 ruoze ruoze 2662 Jun 3 2019 hive-exec-log4j.properties.template
-rw-r--r-- 1 ruoze ruoze 3505 Jun 3 2019 hive-log4j.properties.template
-rw-r--r-- 1 ruoze ruoze 2060 Jun 3 2019 ivysettings.xml
[ruoze@rzdata001 conf]$
自行创建并配置
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://rzdata001:3306/ruozedata_hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>ruozedata</value>
</property>
<property>
<name>hive.cli.print.current.db</name> # hive命令行打印hive的数据库
<value>true</value>
</property>
<property>
<name>hive.cli.print.header</name> # 开启打印头信息
<value>true</value>
</property>
</configuration>
6.5 下载上传 mysql-connector-java 驱动包
根据MySQL版本,下载上传对应版本的mysql-connector-java驱动包到 lib 目录: https://dev.mysql.com/downloads/connector/j/5.1.html
[ruoze@rzdata001 conf]$ cd ../lib
[ruoze@rzdata001 lib]$ ls -l | grep mysql-connector-java
-rw-r--r-- 1 ruoze ruoze 1006959 Jul 11 13:43 mysql-connector-java-5.1.48-bin.jar
[ruoze@rzdata001 lib]$
6.6 启动,简单使用
[ruoze@rzdata001 hive]$ which hive
~/app/hive/bin/hive
[ruoze@rzdata001 hive]$ hive
which: no hbase in (/home/ruoze/app/hive/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/home/hadoop/app/hive-1.1.0-cdh5.16.2/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/usr/java/jdk1.8.0_121/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ruoze/.local/bin:/home/ruoze/bin)
19/12/15 12:17:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Logging initialized using configuration in jar:file:/home/ruoze/app/hive-1.1.0-cdh5.16.2/lib/hive-common-1.1.0-cdh5.16.2.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive (default)>
7. Hive简单使用
show databases;
查看所有数据库
hive (default)> show databases;
OK
database_name
default
Time taken: 3.463 seconds, Fetched: 1 row(s)
hive (default)>
use dbname;
切换数据库
hive (default)> use default;
OK
Time taken: 0.02 seconds
create table stu(id int, name string,age int);
创建表,简单的例子
hive (default)> create table stu(id int, name string,age int);
OK
Time taken: 2.309 seconds
hive (default)>
show tables;
查看当前数据库下所有表
hive (default)> show tables;
OK
tab_name
stu
Time taken: 0.016 seconds, Fetched: 1 row(s)
hive (default)>
查看表结构:
desc stu;
desc extended stu;
desc formatted stu;
hive (default)> desc stu;
OK
col_name data_type comment
id int
name string
age int
Time taken: 0.071 seconds, Fetched: 3 row(s)
hive (default)>
hive (default)>
hive (default)> desc extended stu;
OK
col_name data_type comment
id int
name string
age int
Detailed Table Information Table(tableName:stu, dbName:default, owner:ruoze, createTime:1576383831, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:age, type:int, comment:null)], location:hdfs://rzdata001:9000/user/hive/warehouse/stu, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1576383831}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, ownerType:USER)
Time taken: 0.061 seconds, Fetched: 5 row(s)
hive (default)>
hive (default)>
hive (default)> desc formatted stu;
OK
col_name data_type comment
# col_name data_type comment
id int
name string
age int
# Detailed Table Information
Database: default
OwnerType: USER
Owner: ruoze
CreateTime: Sun Dec 15 12:23:51 CST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://rzdata001:9000/user/hive/warehouse/stu
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1576383831
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.056 seconds, Fetched: 29 row(s)
hive (default)>
insert into stu values(1,'cong',30);
插入数据(只为了插入一条数据,跑了一个MapReduce,不建议使用)
hive (default)> insert into stu values(1,'cong',30);
Query ID = ruoze_20191215123636_f87550af-4e8c-46f8-aa32-23d8180a6576
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1575802210826_0002, Tracking URL = http://rzdata001:38088/proxy/application_1575802210826_0002/
Kill Command = /home/ruoze/app/hadoop/bin/hadoop job -kill job_1575802210826_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-12-15 12:36:47,977 Stage-1 map = 0%, reduce = 0%
2019-12-15 12:36:53,254 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.9 sec
MapReduce Total cumulative CPU time: 1 seconds 900 msec
Ended Job = job_1575802210826_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://rzdata001:9000/user/hive/warehouse/stu/.hive-staging_hive_2019-12-15_12-36-40_659_1751786226686695013-1/-ext-10000
Loading data to table default.stu
Table default.stu stats: [numFiles=1, numRows=1, totalSize=10, rawDataSize=9]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.9 sec HDFS Read: 3893 HDFS Write: 77 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 900 msec
OK
_col0 _col1 _col2
Time taken: 13.888 seconds
hive (default)>
select * from stu;
查询数据
hive (default)> select * from stu;
OK
stu.id stu.name stu.age
1 cong 30
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive (default)>
exit;
退出
hive (default)> exit;
8. 参数配置
官网:https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
查看表后台目录
[ruoze@rzdata001 hive]$ hadoop fs -ls /user/hive/warehouse/stu
19/12/15 12:53:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x 1 ruoze supergroup 10 2019-12-15 12:36 /user/hive/warehouse/stu/000000_0
[ruoze@rzdata001 hive]$
数据仓库的默认值有以下参数设置
参数 | 默认值 |
---|---|
hive.metastore.warehouse.dir | /user/hive/warehouse |
9. Hive的完整执行日志
查看日志路径参数配置
[ruoze@rzdata001 conf]$ pwd
/home/ruoze/app/hive/conf
[ruoze@rzdata001 conf]$ cat hive-log4j.properties.template | grep hive.log.
hive.log.threshold=ALL
hive.log.dir=${java.io.tmpdir}/${user.name}
hive.log.file=hive.log
log4j.threshold=${hive.log.threshold}
log4j.appender.DRFA.File=${hive.log.dir}/${hive.log.file}
[ruoze@rzdata001 conf]$
[ruoze@rzdata001 conf]$ ll /tmp/ruoze/hive.log
-rw-rw-r-- 1 ruoze ruoze 41619 Dec 15 12:40 /tmp/ruoze/hive.log
[ruoze@rzdata001 conf]$
默认日志路径是 ${java.io.tmpdir}/${user.name}/hive.log
,此例子中是/tmp/ruoze/hive.log
。
由于Linux的OOM机制和/tmp下的过期清理机制,建议更改日志目录
此例子更改为 /home/ruoze/log/hive/hive.log
[ruoze@rzdata001 conf]$ cd
[ruoze@rzdata001 ~]$ ll
total 4
drwxrwxr-x 4 ruoze ruoze 86 Dec 15 11:37 app
drwxrwxr-x 2 ruoze ruoze 25 Dec 1 18:51 data
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:34 lib
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:34 log
drwxrwxr-x 2 ruoze ruoze 76 Dec 15 11:29 software
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:34 sourcecode
drwxrwxrwx 4 ruoze ruoze 4096 Dec 8 18:50 tmp
[ruoze@rzdata001 ~]$ cd log
[ruoze@rzdata001 log]$ ll
total 0
[ruoze@rzdata001 log]$ mkdir hive
[ruoze@rzdata001 log]$ cd hive/
[ruoze@rzdata001 hive]$ pwd
/home/ruoze/log/hive
[ruoze@rzdata001 hive]$ cd $HIVE_HOME/conf
[ruoze@rzdata001 conf]$ pwd
/home/ruoze/app/hive/conf
[ruoze@rzdata001 conf]$
[ruoze@rzdata001 conf]$ cp hive-log4j.properties.template hive-log4j.properties
[ruoze@rzdata001 conf]$ vim hive-log4j.properties
hive.log.dir=/home/ruoze/log/hive
[ruoze@rzdata001 conf]$
10. Hive配置属性
- 全局生效:在
$HIVE_HOME/hive-site.xml
中修改 - 当前session:使用set命令,仅对当前session生效
查看当前的属性:set key;
修改当前的属性:set key=value;
, 比如:hive -hiveconf hive.cli.print.current.db=true
11. Hive中交互式命令
- -e:不需要进入hive命令后,就可以跟上sql语句查询
[ruoze@rzdata001 ~]$ hive -e "select * from stu"
which: no hbase in (/home/ruoze/app/hive/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/home/hadoop/app/hive-1.1.0-cdh5.16.2/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/usr/java/jdk1.8.0_121/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ruoze/.local/bin:/home/ruoze/bin)
19/12/15 13:23:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Logging initialized using configuration in file:/home/ruoze/app/hive-1.1.0-cdh5.16.2/conf/hive-log4j.properties
OK
stu.id stu.name stu.age
1 cong 30
Time taken: 3.389 seconds, Fetched: 1 row(s)
[ruoze@rzdata001 ~]$
- -f:执行指定文件(内容是SQL语句)
[ruoze@rzdata001 cong]$ cat stu.hql
select * from stu;
[ruoze@rzdata001 cong]$ hive -f stu.hql
which: no hbase in (/home/ruoze/app/hive/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/home/hadoop/app/hive-1.1.0-cdh5.16.2/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/usr/java/jdk1.8.0_121/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ruoze/.local/bin:/home/ruoze/bin)
19/12/15 13:25:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Logging initialized using configuration in file:/home/ruoze/app/hive-1.1.0-cdh5.16.2/conf/hive-log4j.properties
OK
stu.id stu.name stu.age
1 cong 30
Time taken: 3.54 seconds, Fetched: 1 row(s)
[ruoze@rzdata001 cong]$
- 拓展:基于Hive的离线统计/数据仓库,可以把SQL封装到shell脚本中,使用hive -e “query sql…” , crontab定时调度.
12 HiveServer2 & beeline
- HiveServer2 : Server端
指定端口启动服务命令:hiveserver2 --hiveconf hive.server2.thrift.port=10086
或者修改hive_site.xml,添加参数端口,后台执行 (启动时建议切换到$HIVE_HOME/bin 下执行,因为spark下也有个beeline) :
[ruoze@rzdata001 conf]$ pwd
/home/ruoze/app/hive/conf
[ruoze@rzdata001 conf]$ vim hive-site.xml
<property>
<name>hive.server2.thrift.port</name>
<value>10086</value>
</property>
<property>
<name>hive.server2.logging.operation.level</name> # 设置beeline客户端日志级别
<value>NONE</value>
</property>
[ruoze@rzdata001 conf]$ cd ../bin
[ruoze@rzdata001 bin]$ pwd
/home/ruoze/app/hive/bin
[ruoze@rzdata001 bin]$ nohup /home/ruoze/app/hive/bin/hiveserver2 > /dev/null 2>&1 &
[1] 15804
[ruoze@rzdata001 bin]$ jps
23795 DataNode
24292 ResourceManager
24405 NodeManager
15925 Jps
23659 NameNode
23963 SecondaryNameNode
15804 RunJar # HiveServer2服务端
[ruoze@rzdata001 bin]$ cd ../conf/
[ruoze@rzdata001 conf]$ cat hive-log4j.properties | grep hive.log # 查看日志输出目录
hive.log.threshold=ALL
hive.log.dir=/home/ruoze/log/hive
hive.log.file=hive.log
log4j.threshold=${hive.log.threshold}
log4j.appender.DRFA.File=${hive.log.dir}/${hive.log.file}
[ruoze@rzdata001 conf]$
- beeline : client端
连接客户端 :beeline -u jdbc:hive2://rzdata001:10086/ruoze_hivedb -n ruoze
或者在~/.bashrc
文件中添加别名:
alias beeline='beeline -u jdbc:hive2://rzdata001:10086/ruoze_hivedb -n ruoze'
source 生效后直接输入beeline
登录:
[ruoze@rzdata001 ~]$ beeline
which: no hbase in (/home/ruoze/app/hive/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/home/ruoze/app/hive/bin:/home/ruoze/app/hadoop/bin:/home/ruoze/app/hadoop/sbin:/usr/java/jdk1.8.0_121/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ruoze/.local/bin:/home/ruoze/bin)
scan complete in 1ms
Connecting to jdbc:hive2://rzdata001:10086/ruoze_hivedb
Connected to: Apache Hive (version 1.1.0-cdh5.16.2)
Driver: Hive JDBC (version 1.1.0-cdh5.16.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.16.2 by Apache Hive
0: jdbc:hive2://rzdata001:10086/ruoze_hivedb>