备注:
Hive 版本 2.1.1
前言:
视图是从一个或几个基本表导出的表。视图本身不独立存储在数据库中,是一个虚表。
即数据库中只存放视图的定义而不存放视图对应的数据,这些数据仍存放在导出视图的基本表中。
视图在概念上与基本表等同,用户可以如同基本表那样使用视图,可以在视图上再定义视图。
引进VIEW的优点有:
(1)视图能够简化用户内的操作。
(2)视图使用户能以多种角度看待同一数据。
(3)视图对重构数据库提供了一定程度的逻辑独立性。
一.Hive视图介绍
创建视图创建一个具有给定名称的视图。如果已经存在同名的表或视图,则会抛出错误。如果不存在,您可以使用IF NOT EXISTS跳过错误。
如果没有提供列名,视图的列名将自动从定义SELECT表达式派生。(如果SELECT包含无别名的标量表达式,如x+y,结果视图列名将以_C0、_C1等形式生成)当重命名列时,也可以选择性地提供列注释。(注释不会自动从基础列继承。)
如果视图的定义选择表达式无效,则CREATE VIEW语句将失败。
请注意,视图是一个没有关联存储的纯逻辑对象。当查询引用视图时,将计算视图的定义,以生成一组行,供查询进一步处理。(这是一个概念性的描述;事实上,作为查询优化的一部分,Hive可以将视图的定义与查询的定义结合起来,例如将过滤器从查询向下推入视图。)
视图的模式在创建视图时被冻结;对基础表的后续更改(如添加列)将不会反映在视图的模式中。如果以不兼容的方式删除或更改基础表,后续查询无效视图的尝试将失败。
视图是只读的,不能用作加载/插入/更改的目标。有关更改元数据,请参见ALTER VIEW。
视图可以包含ORDER BY和LIMIT子句。如果引用查询也包含这些子句,那么查询级子句将在视图子句之后进行计算(以及在查询中的任何其他操作之后)。例如,如果一个视图指定LIMIT 5,并且引用查询被执行为(select * from v LIMIT 10),那么最多返回5行。
从Hive 0.13.0开始,视图的select语句可以包括一个或多个公共表表达式(CTEs),如select语法所示。例如CREATE VIEW语句中的cte,
语法:
创建视图
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...;
删除视图
DROP VIEW [IF EXISTS] [db_name.]view_name;
更改视图
ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ...)
ALTER VIEW [db_name.]view_name AS select_statement;
二.视图案例
2.1 视图能够简化用户内的操作
需求:需要查询员工表该部门最高薪水以及最低薪水以及全公司最高最低薪水
with tmp1 as
(
select d1.dname,
max(e1.sal) sal_dept_max,
min(e1.sal) sal_dept_min
from emp e1
left join dept d1
on e1.deptno = d1.deptno
group by d1.dname
),
tmp2 as
(
select max(e1.sal) sal_com_max,
min(e1.sal) sal_com_min
from emp e1
left join dept d1
on e1.deptno = d1.deptno
)
select dname, -- 部门名称
sal_dept_max, -- 部门最高工资
sal_dept_min, -- 部门最低工资
(select sal_com_max from tmp2) sal_com_max, -- 公司最高工资
(select sal_com_min from tmp2) sal_com_min -- 公司最低工资
from tmp1
现在,我已经写了上述的一个复杂的sql,实现了需求。
过了一段时间后,如果业务再次需要这个数据,我又得重新花时间写一遍。
如果此时我创建了视图,将复杂sql语句保存为视图,下次我直接查询视图即可。
hive> create view v_emp_sal as
> with tmp1 as
> (
> select d1.dname,
> max(e1.sal) sal_dept_max,
> min(e1.sal) sal_dept_min
> from emp e1
> left join dept d1
> on e1.deptno = d1.deptno
> group by d1.dname
> ),
> tmp2 as
> (
> select max(e1.sal) sal_com_max,
> min(e1.sal) sal_com_min
> from emp e1
> )
> select dname,
> sal_dept_max,
> sal_dept_min,
> tmp2.sal_com_max,
> tmp2.sal_com_min
> from tmp1
> left join tmp2
> on 1 = 1;
OK
dname sal_dept_max sal_dept_min sal_com_max sal_com_min
Time taken: 0.668 seconds
hive> select * from v_emp_sal;
Warning: Map Join MAPJOIN[29][bigTable=?] in task 'Stage-6:MAPRED' is a cross product
Warning: Shuffle Join JOIN[19][tables = [tmp1, tmp2]] in Stage 'Stage-3:MAPRED' is a cross product
Query ID = root_20201211154416_f01650cf-89c4-4ba0-ab12-40d9af14af50
Total jobs = 4
Launching Job 1 out of 4
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0190, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0190/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0190
Hadoop job information for Stage-5: number of mappers: 2; number of reducers: 1
2020-12-11 15:44:25,832 Stage-5 map = 0%, reduce = 0%
2020-12-11 15:44:32,066 Stage-5 map = 100%, reduce = 0%, Cumulative CPU 4.32 sec
2020-12-11 15:44:37,230 Stage-5 map = 100%, reduce = 100%, Cumulative CPU 6.3 sec
MapReduce Total cumulative CPU time: 6 seconds 300 msec
Ended Job = job_1606698967173_0190
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-12-11 15:44:44 Starting to launch local task to process map join; maximum memory = 1908932608
2020-12-11 15:44:45 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/bf2c5eff-9176-432f-a60a-2c605eae2e4e/hive_2020-12-11_15-44-16_584_6489641943512683012-1/-local-10008/HashTable-Stage-2/MapJoin-mapfile11--.hashtable
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1606698967173_0191, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0191/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0191
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2020-12-11 15:44:53,969 Stage-2 map = 0%, reduce = 0%
2020-12-11 15:45:00,161 Stage-2 map = 50%, reduce = 0%, Cumulative CPU 2.23 sec
2020-12-11 15:45:01,192 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 4.44 sec
2020-12-11 15:45:06,357 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 6.41 sec
MapReduce Total cumulative CPU time: 6 seconds 410 msec
Ended Job = job_1606698967173_0191
Stage-9 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
2020-12-11 15:45:13 Starting to launch local task to process map join; maximum memory = 1908932608
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 4
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0192, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0192/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0192
Hadoop job information for Stage-6: number of mappers: 1; number of reducers: 0
2020-12-11 15:45:22,982 Stage-6 map = 0%, reduce = 0%
2020-12-11 15:45:29,182 Stage-6 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
MapReduce Total cumulative CPU time: 1 seconds 870 msec
Ended Job = job_1606698967173_0192
MapReduce Jobs Launched:
Stage-Stage-5: Map: 2 Reduce: 1 Cumulative CPU: 6.3 sec HDFS Read: 16256 HDFS Write: 121 HDFS EC Read: 0 SUCCESS
Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 6.41 sec HDFS Read: 20875 HDFS Write: 197 HDFS EC Read: 0 SUCCESS
Stage-Stage-6: Map: 1 Cumulative CPU: 1.87 sec HDFS Read: 5245 HDFS Write: 240 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 580 msec
OK
v_emp_sal.dname v_emp_sal.sal_dept_max v_emp_sal.sal_dept_min v_emp_sal.sal_com_max v_emp_sal.sal_com_min
ACCOUNTING 5000.00 1300.00 5000.00 800.00
RESEARCH 3000.00 800.00 5000.00 800.00
SALES 2850.00 950.00 5000.00 800.00
Time taken: 73.707 seconds, Fetched: 3 row(s)
hive>
2.2 视图使用户能以多种角度看待同一数据
员工表员工的工资存的是金额,假设领导需要看的数据是:
sal < 1000 低工资
1000<= sal < 2000 中等工资
sal >= 2000 高工资
此时肯定不能修改底层数据,只能自己再做一层逻辑。
此时也可以创建一个视图,将emp列的sal定义改为上述需求
hive>
> create view v2 as
> select e.empno,e.deptno,e.hiredate, case when e.sal < 1000 then '低工资'
> when e.sal >= 1000 and e.sal < 2000 then '中等工资'
> when e.sal >= 2000 then '高工资'
> else '' end as sal
> from emp e;
OK
empno deptno hiredate sal
Time taken: 0.107 seconds
hive> select * from v2;
OK
v2.empno v2.deptno v2.hiredate v2.sal
7369 20 1980-12-17 低工资
7499 30 1981-02-20 中等工资
7876 20 1987-06-13 中等工资
7900 30 1981-12-03 低工资
7902 20 1981-12-03 高工资
7934 10 1982-01-23 中等工资
7521 30 1981-02-22 中等工资
7566 20 1981-04-02 高工资
7654 30 1981-09-28 中等工资
7698 30 1981-05-01 高工资
7782 10 1981-06-09 高工资
7788 20 1987-06-13 高工资
7839 10 1981-11-17 高工资
7844 30 1981-09-08 中等工资
Time taken: 0.157 seconds, Fetched: 14 row(s)
hive>
2.3 视图对重构数据库提供了一定程度的逻辑独立性
承接第2步的需求,假设此时,因为数据重构了,之前存的工资单位是美元,现在变成美分了,表数据也被同步更改了。
此时给老板呈现的多张报表依旧使用了v2这个视图,此时我们只需要修改v2这个视图即可,无需一个一个的报表的修改。
drop view v2 ;
create table emp_bak as select * from emp;
alter table emp change sal sal decimal(10,2);
truncate table emp;
insert into emp select empno,ename,job,mgr,hiredate,sal*100 sal,comm,deptno from emp_bak;
create view v2 as
select e.empno,e.deptno,e.hiredate, case when e.sal < 100000 then '低工资'
when e.sal >= 100000 and e.sal < 200000 then '中等工资'
when e.sal >= 200000 then '高工资'
else '' end as sal
from emp e;
hive> select * from v2;
OK
v2.empno v2.deptno v2.hiredate v2.sal
7876 20 1987-06-13 中等工资
7900 30 1981-12-03 低工资
7521 30 1981-02-22 中等工资
7566 20 1981-04-02 高工资
7654 30 1981-09-28 中等工资
7698 30 1981-05-01 高工资
7782 10 1981-06-09 高工资
7788 20 1987-06-13 高工资
7839 10 1981-11-17 高工资
7844 30 1981-09-08 中等工资
7369 20 1980-12-17 低工资
7499 30 1981-02-20 中等工资
7902 20 1981-12-03 高工资
7934 10 1982-01-23 中等工资
Time taken: 0.077 seconds, Fetched: 14 row(s)