文章目录
一.Hive简介
- Hive是基于Hadoop的一个数据仓库工具。
- 可以将结构化的数据文件映射为一张数据库表,并提供简单的类SQL(HQL)查询功能,可以将HQL语句转换为MapReduce任务进行运行。
- 学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发与门的MapReduce应用。
- 适合数据仓库的ETL和统计分析。
二.Hive的特点
- 简单易用
基于SQL表达式语法,兼容大部分SQL-92语义和部分SQL-2003扩展语义 - 可扩展
Hive基于Hadoop实现,可以自由的扩展集群的规模,一般情况下丌需要重启服务 - 延展性
Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数 - 容错性
Hadoop良好的容错性,节点出现问题SQL仍可完成执行
三.Hive使用场景
- 最佳使用场合
大数据集的批处理作业,例如:网络日志分析 - 不适用于
不能在大规模数据集上实现低延迟快速的查询,例如:Hive 在几百MB 的数据集上执行查询一般有分钟级的时间延迟。
不支持联机事务处理(OLTP)
Hive 不提供基于行级的数据更新操作(2.0版本开始支持Update)
四.Hive命令
hive命令如下图所示:
测试记录:
hive> set mapred.reduce.tasks=32;
hive> set;
hive> select a.* from tab1;
hive> !ls;
hive> dfs -ls;
Hive CLI使用
[root@hp1 libext]# hive -H
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
usage: hive
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)
Hive执行sql脚本
hive -e 'select count(*) from test.emp'
hive -f a.sql
测试记录:
[root@hp1 ~]# more a.sql
select count(*) from test.emp;
[root@hp1 ~]#
[root@hp1 ~]# hive -f a.sql
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
Query ID = root_20201127172235_292c9382-f5e4-46fd-ab9b-5660eeba4f44
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
20/11/27 17:22:38 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0003, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0003/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606444720955_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-11-27 17:22:46,594 Stage-1 map = 0%, reduce = 0%
2020-11-27 17:22:51,815 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.99 sec
2020-11-27 17:22:52,850 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.91 sec
2020-11-27 17:22:58,031 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.93 sec
MapReduce Total cumulative CPU time: 5 seconds 930 msec
Ended Job = job_1606444720955_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.93 sec HDFS Read: 14178 HDFS Write: 102 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 930 msec
OK
14
Time taken: 23.234 seconds, Fetched: 1 row(s)
五.Hive参数
hive参数概述:
5.1 Hivevar参数示例
代码:
hive -d foo=bar
set foo;
set hivevar:empno=7369;
select * from test.emp where empno=${hivevar:empno};
测试记录:
[root@hp1 ~]# hive -d foo=bar
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set foo;
foo=bar
> set hivevar:empno=7369;
hive> select * from test.emp where empno=${hivevar:empno};
Query ID = root_20201127173211_9f830148-5427-4b15-b356-32d2ab9b3c74
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:32:12 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0004, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0004/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606444720955_0004
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:32:20,448 Stage-1 map = 0%, reduce = 0%
2020-11-27 17:32:26,683 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.61 sec
MapReduce Total cumulative CPU time: 6 seconds 610 msec
Ended Job = job_1606444720955_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.61 sec HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 610 msec
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 17.64 seconds, Fetched: 1 row(s)
5.2 Hiveconf参数示例
代码:
···
hive --hiveconf hive.cli.print.header=true
select * from test.emp where empno=7369;
set hive.cli.print.header;
set hive.cli.print.header=false;
set hive.cli.print.header;
select * from test.emp where empno=7369;
···
测试记录:
[root@hp1 ~]# hive --hiveconf hive.cli.print.header=true
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
> select * from test.emp where empno=7369;
Query ID = root_20201127173938_ca672a76-b8d5-4281-86a1-f8305b202ad8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:39:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0005, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606444720955_0005
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:39:48,730 Stage-1 map = 0%, reduce = 0%
2020-11-27 17:39:54,961 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.4 sec
MapReduce Total cumulative CPU time: 6 seconds 400 msec
Ended Job = job_1606444720955_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.4 sec HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 400 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 17.515 seconds, Fetched: 1 row(s)
hive>
> set hive.cli.print.header;
hive.cli.print.header=true
hive> set hive.cli.print.header=false;
hive> set hive.cli.print.header;
hive.cli.print.header=false
hive> select * from test.emp where empno=7369;
Query ID = root_20201127174246_26ec846e-1d16-4cf2-a9d4-6436632dc587
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:42:46 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0006, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606444720955_0006
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:42:53,878 Stage-1 map = 0%, reduce = 0%
2020-11-27 17:43:01,096 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.51 sec
MapReduce Total cumulative CPU time: 6 seconds 510 msec
Ended Job = job_1606444720955_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.51 sec HDFS Read: 11969 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 510 msec
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 15.72 seconds, Fetched: 1 row(s)
hive>
5.3 System变量
代码:
set system:user.name;
set system:user.name=7369;
set system:user.name;
select * from test.emp where empno=${system:user.name};
测试记录:
[root@hp1 ~]# hive
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set system:user.name;
system:user.name=root
hive> set system:user.name=7369;
hive> set system:user.name;
system:user.name=7369
hive> select * from test.emp where empno=${system:user.name};
Query ID = 7369_20201127180045_9f902d3a-070d-4e12-8641-38063197eafc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 18:00:47 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0010, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0010/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606444720955_0010
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 18:00:55,955 Stage-1 map = 0%, reduce = 0%
2020-11-27 18:01:02,281 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.11 sec
2020-11-27 18:01:03,313 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.29 sec
MapReduce Total cumulative CPU time: 6 seconds 290 msec
Ended Job = job_1606444720955_0010
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.29 sec HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 290 msec
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 19.033 seconds, Fetched: 1 row(s)
hive>
5.4 Env变量
代码:
export empno=7369
hive
set env:empno;
select * from test.emp where empno=${env:empno};
测试记录:
[root@hp1 ~]# export empno=7369
[root@hp1 ~]# hive
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set env:empno;
env:empno=7369
hive> select * from test.emp where empno=${env:empno};
Query ID = root_20201127180306_e2344df7-49f4-4746-9c60-6f62f81d36bb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 18:03:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0011, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0011/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606444720955_0011
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 18:03:16,404 Stage-1 map = 0%, reduce = 0%
2020-11-27 18:03:22,675 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.41 sec
MapReduce Total cumulative CPU time: 6 seconds 410 msec
Ended Job = job_1606444720955_0011
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.41 sec HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 410 msec
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
Time taken: 17.48 seconds, Fetched: 1 row(s)
hive>
六.Hive系统架构
Hive系统架构图
6.1 Hive接口
用户接口主要有三个:CLI,ThriftServer和HWI。
最常用的是CLI,Cli启动的时候,会同时启动一个Hive Driver。
ThriftServer是以Thrift协议封装的Hive服务化接口,可以提供跨语言的访问如Python、C++等。并实现了JDBC/ODBC协议
HWI 提供了一个基于浏览器访问Hive的途径。
6.2 元数据存储
Hive 将元数据存储在数据库中,如mysql、oracle、derby。
Hive 中的元数据包括表的名字,表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等
6.3 驱动器(Driver)
- 编译器
完成词法分析、语法分析,将HQL查询解析成AST
AST生成逻辑执行计划
逻辑执行计划生成物理MR执行计划 - 优化器
对逻辑执行计划进行优化
对物理执行计划进行优化 - 执行器
生成的物理执行计划转变成MR Job
提交到Hadoop上面执行