大数据开发之Hive篇1-Hive介绍

一.Hive简介

  1. Hive是基于Hadoop的一个数据仓库工具。
  2. 可以将结构化的数据文件映射为一张数据库表,并提供简单的类SQL(HQL)查询功能,可以将HQL语句转换为MapReduce任务进行运行。
  3. 学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发与门的MapReduce应用。
  4. 适合数据仓库的ETL和统计分析。

二.Hive的特点

  1. 简单易用
    基于SQL表达式语法,兼容大部分SQL-92语义和部分SQL-2003扩展语义
  2. 可扩展
    Hive基于Hadoop实现,可以自由的扩展集群的规模,一般情况下丌需要重启服务
  3. 延展性
    Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数
  4. 容错性
    Hadoop良好的容错性,节点出现问题SQL仍可完成执行

三.Hive使用场景

  1. 最佳使用场合
    大数据集的批处理作业,例如:网络日志分析
  2. 不适用于
    不能在大规模数据集上实现低延迟快速的查询,例如:Hive 在几百MB 的数据集上执行查询一般有分钟级的时间延迟。
    不支持联机事务处理(OLTP)
    Hive 不提供基于行级的数据更新操作(2.0版本开始支持Update)

四.Hive命令

hive命令如下图所示:
image.png

测试记录:

  hive> set mapred.reduce.tasks=32;
  hive> set;
  hive> select a.* from tab1;
  hive> !ls;
  hive> dfs -ls;

Hive CLI使用

[root@hp1 libext]# hive -H
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

Hive执行sql脚本

hive -e 'select count(*) from test.emp'
hive -f a.sql

测试记录:

[root@hp1 ~]# more a.sql 
select count(*) from test.emp;
[root@hp1 ~]# 
[root@hp1 ~]# hive -f a.sql
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
Query ID = root_20201127172235_292c9382-f5e4-46fd-ab9b-5660eeba4f44
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
20/11/27 17:22:38 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0003, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0003/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-11-27 17:22:46,594 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:22:51,815 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.99 sec
2020-11-27 17:22:52,850 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.91 sec
2020-11-27 17:22:58,031 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.93 sec
MapReduce Total cumulative CPU time: 5 seconds 930 msec
Ended Job = job_1606444720955_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.93 sec   HDFS Read: 14178 HDFS Write: 102 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 930 msec
OK
14
Time taken: 23.234 seconds, Fetched: 1 row(s)

五.Hive参数

hive参数概述:
image.png

5.1 Hivevar参数示例

代码:

hive -d foo=bar
set foo;
set hivevar:empno=7369;
select * from test.emp where empno=${hivevar:empno};

测试记录:

[root@hp1 ~]# hive -d foo=bar
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set foo;
foo=bar
    > set hivevar:empno=7369;
hive> select * from test.emp where empno=${hivevar:empno};
Query ID = root_20201127173211_9f830148-5427-4b15-b356-32d2ab9b3c74
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:32:12 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0004, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0004/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0004
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:32:20,448 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:32:26,683 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.61 sec
MapReduce Total cumulative CPU time: 6 seconds 610 msec
Ended Job = job_1606444720955_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.61 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 610 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 17.64 seconds, Fetched: 1 row(s)

5.2 Hiveconf参数示例

代码:
···
hive --hiveconf hive.cli.print.header=true
select * from test.emp where empno=7369;
set hive.cli.print.header;
set hive.cli.print.header=false;
set hive.cli.print.header;
select * from test.emp where empno=7369;
···

测试记录:

[root@hp1 ~]# hive --hiveconf hive.cli.print.header=true
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> 
    > select * from test.emp where empno=7369;
Query ID = root_20201127173938_ca672a76-b8d5-4281-86a1-f8305b202ad8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:39:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0005, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0005
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:39:48,730 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:39:54,961 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.4 sec
MapReduce Total cumulative CPU time: 6 seconds 400 msec
Ended Job = job_1606444720955_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.4 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 400 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 17.515 seconds, Fetched: 1 row(s)
hive> 
    > set hive.cli.print.header;
hive.cli.print.header=true
hive> set hive.cli.print.header=false;
hive> set hive.cli.print.header;
hive.cli.print.header=false
hive> select * from test.emp where empno=7369;
Query ID = root_20201127174246_26ec846e-1d16-4cf2-a9d4-6436632dc587
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:42:46 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0006, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0006
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:42:53,878 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:43:01,096 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.51 sec
MapReduce Total cumulative CPU time: 6 seconds 510 msec
Ended Job = job_1606444720955_0006
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.51 sec   HDFS Read: 11969 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 510 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 15.72 seconds, Fetched: 1 row(s)
hive> 

5.3 System变量

代码:

set system:user.name;
set system:user.name=7369;
set system:user.name;
select * from test.emp where empno=${system:user.name};

测试记录:

[root@hp1 ~]# hive
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set system:user.name;
system:user.name=root
hive> set system:user.name=7369;
hive> set system:user.name;
system:user.name=7369
hive> select * from test.emp where empno=${system:user.name};
Query ID = 7369_20201127180045_9f902d3a-070d-4e12-8641-38063197eafc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 18:00:47 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0010, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0010/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0010
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 18:00:55,955 Stage-1 map = 0%,  reduce = 0%
2020-11-27 18:01:02,281 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.11 sec
2020-11-27 18:01:03,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.29 sec
MapReduce Total cumulative CPU time: 6 seconds 290 msec
Ended Job = job_1606444720955_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.29 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 290 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 19.033 seconds, Fetched: 1 row(s)
hive> 

5.4 Env变量

代码:

export empno=7369
hive
set env:empno;
select * from test.emp where empno=${env:empno};

测试记录:

[root@hp1 ~]# export empno=7369
[root@hp1 ~]# hive
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set env:empno;
env:empno=7369
hive> select * from test.emp where empno=${env:empno};
Query ID = root_20201127180306_e2344df7-49f4-4746-9c60-6f62f81d36bb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 18:03:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0011, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0011/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0011
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 18:03:16,404 Stage-1 map = 0%,  reduce = 0%
2020-11-27 18:03:22,675 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.41 sec
MapReduce Total cumulative CPU time: 6 seconds 410 msec
Ended Job = job_1606444720955_0011
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.41 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 410 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 17.48 seconds, Fetched: 1 row(s)
hive> 

六.Hive系统架构

Hive系统架构图
image.png

6.1 Hive接口

用户接口主要有三个:CLI,ThriftServer和HWI。
最常用的是CLI,Cli启动的时候,会同时启动一个Hive Driver。
ThriftServer是以Thrift协议封装的Hive服务化接口,可以提供跨语言的访问如Python、C++等。并实现了JDBC/ODBC协议
HWI 提供了一个基于浏览器访问Hive的途径。

6.2 元数据存储

Hive 将元数据存储在数据库中,如mysql、oracle、derby。
Hive 中的元数据包括表的名字,表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等

6.3 驱动器(Driver)

  1. 编译器
    完成词法分析、语法分析,将HQL查询解析成AST
    AST生成逻辑执行计划
    逻辑执行计划生成物理MR执行计划
  2. 优化器
    对逻辑执行计划进行优化
    对物理执行计划进行优化
  3. 执行器
    生成的物理执行计划转变成MR Job
    提交到Hadoop上面执行
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值