第11章 Hive:SQL on Hadoop
11.7 HQL:数据查询
本节操作使用默认default数据库。
hive> use default;
OK
Time taken: 0.035 seconds
hive>
11.7.1 limit 子句
hive> select * from emp limit 5;
OK
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7839 KING PRESIDENT 0 1981-11-17 5000.0 0.0 10
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
7369 SMITH CLERK 7902 1980-12-17 800.0 0.0 20
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
Time taken: 2.18 seconds, Fetched: 5 row(s)
hive>
hive> select * from emp limit 1,3;
OK
7839 KING PRESIDENT 0 1981-11-17 5000.0 0.0 10
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
7369 SMITH CLERK 7902 1980-12-17 800.0 0.0 20
Time taken: 0.303 seconds, Fetched: 3 row(s)
hive>
11.7.2 case when then 语句
hive> select eid,ename,
> case
> when did=10 then 'd10'
> when did=20 then 'd20'
> when did=30 then 'd30'
> end
> from emp;
OK
7782 CLARK d10
7839 KING d10
7934 MILLER d10
7369 SMITH d20
7566 JONES d20
7902 FORD d20
7499 ALLEN d30
7521 WARD d30
7654 MARTIN d30
7698 BLAKE d30
7844 TURNER d30
7900 JAMES d30
Time taken: 0.469 seconds, Fetched: 12 row(s)
hive>
11.7.3 where 子句
可以用or
和and
连接多个表达式。
操作符 | 数据类型 | 说明 |
---|---|---|
A=B | 基本类型 | A等于B返回true,否则返回false |
A<>B ,A!=B | 基本类型 | A或B为null,返回null; A和B都不空时,A不等于B返回true,否则返回false |
A<B | 基本类型 | A或B为null,返回null; A小于B返回true,否则返回false |
A<=B | 基本类型 | A或B为null,返回null; A小于等于B返回true,否则返回false |
A>B | 基本类型 | A或B为null,返回null; A大于B返回true,否则返回false |
A>=B | 基本类型 | A或B为null,返回null; A大于等于B返回true,否则返回false |
A Between B And C | 基本类型 | 筛选 A 的值处于 B 和 C 之间 |
A not between B and C | 基本类型 | 筛选 A 的值不处于 B 和 C 之间 |
A is null | 所有类型 | 如果A等于null,返回true,否则返回false。注意,不能使用A=null |
A is not null | 所有类型 | 筛选 A 值不是null的 |
A like B | string | B是一个SQL正则表达式,% 表示一个或者多个字符,_ 表示一个字符 |
A not like B | string | 与正则表达式B相反结果 |
A rlike B | string | 正则匹配 |
hive> select * from emp
> where did=30 and comm<>0.0;
OK
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
Time taken: 0.667 seconds, Fetched: 3 row(s)
hive>
hive> select * from emp
> where hiredate between '1981-01-01' and '1981-12-30';
OK
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7839 KING PRESIDENT 0 1981-11-17 5000.0 0.0 10
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
Time taken: 0.311 seconds, Fetched: 10 row(s)
hive>
11.7.4 group by 子句
【例】查询每个部门的人数
hive> select count(*) from emp
> group by did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814101839_fe4cccbd-62c4-48cb-986b-a70f93c19535
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0003, Tracking URL = http://node1:8088/proxy/application_1502717288397_0003/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502717288397_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:19:10,549 Stage-1 map = 0%, reduce = 0%
2017-08-14 10:19:27,526 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.86 sec
2017-08-14 10:19:42,984 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.84 sec
MapReduce Total cumulative CPU time: 6 seconds 840 msec
Ended Job = job_1502717288397_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.84 sec HDFS Read: 9667 HDFS Write: 129 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 840 msec
OK
3
3
6
Time taken: 66.588 seconds, Fetched: 3 row(s)
hive>
与Oracle数据库一样,如果使用group by子句,那么查询的字段如果没有出现在group by子句的后面,则必须使用聚合函数。
hive> select ename,avg(sal) from emp group by did;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'
hive>
如果想对分组结果进行过滤,可以使用having子句
hive> select avg(sal) from emp
> group by did
> having avg(sal)>2000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814103035_2106518a-aab4-4cdd-9b39-7b08fb1a44b6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0007, Tracking URL = http://node1:8088/proxy/application_1502717288397_0007/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502717288397_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:31:08,601 Stage-1 map = 0%, reduce = 0%
2017-08-14 10:31:25,888 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.88 sec
2017-08-14 10:31:43,122 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.34 sec
MapReduce Total cumulative CPU time: 8 seconds 340 msec
Ended Job = job_1502717288397_0007
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.34 sec HDFS Read: 10807 HDFS Write: 149 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 340 msec
OK
2916.6666666666665
2258.3333333333335
Time taken: 70.202 seconds, Fetched: 2 row(s)
hive>
11.7.5 内连接
(1)准备阶段
部门数据
[root@node3 data]# vi dept
[root@node3 data]# cat dept
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
[root@node3 data]#
创建部门表dept,并导入数据
hive> create table if not exists dept(
> did int,
> dname string,
> dloc string)
> row format delimited fields terminated by ',';
OK
Time taken: 0.185 seconds
hive> load data local inpath '/root/data/dept' into table dept;
Loading data to table default.dept
OK
Time taken: 0.705 seconds
hive> select * from dept;
OK
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 0.155 seconds, Fetched: 4 row(s)
hive>
(2)等值连接
连接的两个表中,只有同时满足连接条件的记录才会放入结果表中。执行查询命令select eid,ename,dname from emp,dept where emp.did=dept.did;
hive> select eid,ename,dname
> from emp,dept
> where emp.did=dept.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170815101902_18fb066b-11f2-438d-8a92-9d4490b80ec2
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-15 10:19:13 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-15 10:19:17 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-19-02_412_1947825427115470836-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2017-08-15 10:19:17 Uploaded 1 File to: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-19-02_412_1947825427115470836-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (373 bytes)
2017-08-15 10:19:17 End of local task; Time Taken: 3.36 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502805552881_0002, Tracking URL = http://node1:8088/proxy/application_1502805552881_0002/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502805552881_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-15 10:19:38,347 Stage-3 map = 0%, reduce = 0%
2017-08-15 10:19:52,530 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1502805552881_0002
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.69 sec HDFS Read: 7746 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 690 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
Time taken: 52.488 seconds, Fetched: 12 row(s)
hive>
内连接另一种形式:select e.eid,e.ename,d.dname from emp e join dept d on e.did=d.did;
其中,e和d分别是表emp和dept的别名。
hive> select e.eid,e.ename,d.dname
> from emp e join dept d
> on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170815102454_0ca5cc0f-1802-4b4f-8156-d055482971b7
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-15 10:25:06 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-15 10:25:10 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-24-54_978_8533839352082414207-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2017-08-15 10:25:10 Uploaded 1 File to: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-24-54_978_8533839352082414207-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (373 bytes)
2017-08-15 10:25:10 End of local task; Time Taken: 3.864 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502805552881_0003, Tracking URL = http://node1:8088/proxy/application_1502805552881_0003/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502805552881_0003
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-15 10:25:27,803 Stage-3 map = 0%, reduce = 0%
2017-08-15 10:25:38,323 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.24 sec
MapReduce Total cumulative CPU time: 2 seconds 240 msec
Ended Job = job_1502805552881_0003
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.24 sec HDFS Read: 7746 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 240 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
Time taken: 44.558 seconds, Fetched: 12 row(s)
hive>
11.7.6 外连接
内连接只显示了俩张表都存在的数据,而外连接则是显示出所有的数据 。外连接又分为左外连接、右外连接以及全外连接,可以简称左连接、右连接和全连接。其中左连接是左边的表输出完整数据。右连接是右边的表输出完整数据,全连接就是两张表的所有行全部输出。
(1)补充数据
为了测试外连接,需要在雇员表中添加一个没有部门的新员工信息
hive> insert into emp(eid,ename,hiredate) values(8888,'HADRON','2016-08-31');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816091805_3f28cfe1-278a-486e-9ed4-34ee791f5322
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0005, Tracking URL = http://node1:8088/proxy/application_1502887685471_0005/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-16 09:18:20,468 Stage-1 map = 0%, reduce = 0%
2017-08-16 09:18:32,863 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.48 sec
MapReduce Total cumulative CPU time: 2 seconds 480 msec
Ended Job = job_1502887685471_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://cetc/user/hive/warehouse/emp/.hive-staging_hive_2017-08-16_09-18-05_560_4497324549163312419-1/-ext-10000
Loading data to table default.emp
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.48 sec HDFS Read: 4861 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 480 msec
OK
Time taken: 30.102 seconds
hive>
(2)左连接
hive> select e.eid,e.ename,d.dname from emp e left join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816091950_1483c055-0461-4293-9387-3221eb15b448
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:20:01 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:20:04 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-19-50_831_2141891134326428737-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2017-08-16 09:20:04 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-19-50_831_2141891134326428737-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (373 bytes)
2017-08-16 09:20:04 End of local task; Time Taken: 2.606 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0006, Tracking URL = http://node1:8088/proxy/application_1502887685471_0006/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0006
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:20:20,596 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:20:30,466 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.82 sec
MapReduce Total cumulative CPU time: 1 seconds 820 msec
Ended Job = job_1502887685471_0006
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.82 sec HDFS Read: 7158 HDFS Write: 486 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 820 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
8888 HADRON NULL
Time taken: 41.877 seconds, Fetched: 13 row(s)
hive>
(3)右连接
hive> select e.eid,e.ename,d.dname from emp e right join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092117_f97eea90-4039-45f9-8097-900c457f50f3
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:21:28 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:21:31 Dump the side-table for tag: 0 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-21-17_901_4210650042244478897-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable
2017-08-16 09:21:31 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-21-17_901_4210650042244478897-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (498 bytes)
2017-08-16 09:21:31 End of local task; Time Taken: 2.599 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0007, Tracking URL = http://node1:8088/proxy/application_1502887685471_0007/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0007
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:21:48,072 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:21:58,404 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.95 sec
MapReduce Total cumulative CPU time: 1 seconds 950 msec
Ended Job = job_1502887685471_0007
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.95 sec HDFS Read: 6169 HDFS Write: 488 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 950 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
NULL NULL OPERATIONS
Time taken: 42.768 seconds, Fetched: 13 row(s)
hive>
(4)全连接
hive> select e.eid,e.ename,d.dname from emp e full join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092230_3b3f51e8-2753-41a2-9f41-af98770847fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502887685471_0008, Tracking URL = http://node1:8088/proxy/application_1502887685471_0008/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0008
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2017-08-16 09:22:45,280 Stage-1 map = 0%, reduce = 0%
2017-08-16 09:23:06,469 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.28 sec
2017-08-16 09:23:16,137 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.52 sec
MapReduce Total cumulative CPU time: 8 seconds 520 msec
Ended Job = job_1502887685471_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 8.52 sec HDFS Read: 16627 HDFS Write: 515 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 520 msec
OK
8888 HADRON NULL
7782 CLARK ACCOUNTING
7934 MILLER ACCOUNTING
7839 KING ACCOUNTING
7902 FORD RESEARCH
7566 JONES RESEARCH
7369 SMITH RESEARCH
7654 MARTIN SALES
7521 WARD SALES
7900 JAMES SALES
7499 ALLEN SALES
7844 TURNER SALES
7698 BLAKE SALES
NULL NULL OPERATIONS
Time taken: 47.921 seconds, Fetched: 14 row(s)
hive>
(5)左半连接
hive> select e.eid,e.ename from emp e
> left semi join dept d
> on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816093310_97649892-9cb8-4b26-954a-1a301976184d
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:33:22 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:33:25 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-33-10_356_3322488615725018370-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile51--.hashtable
2017-08-16 09:33:25 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-33-10_356_3322488615725018370-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile51--.hashtable (332 bytes)
2017-08-16 09:33:25 End of local task; Time Taken: 2.642 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0010, Tracking URL = http://node1:8088/proxy/application_1502887685471_0010/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0010
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:33:43,441 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:33:55,350 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.93 sec
MapReduce Total cumulative CPU time: 2 seconds 930 msec
Ended Job = job_1502887685471_0010
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.93 sec HDFS Read: 7152 HDFS Write: 363 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 930 msec
OK
7782 CLARK
7839 KING
7934 MILLER
7369 SMITH
7566 JONES
7902 FORD
7499 ALLEN
7521 WARD
7654 MARTIN
7698 BLAKE
7844 TURNER
7900 JAMES
Time taken: 46.179 seconds, Fetched: 12 row(s)
hive>
左半连接和内连接的结果一样,只不过使用IN查询。
hive> select e.eid,e.ename from emp e
> where e.did in(select d.did from dept d);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092716_b479bf9e-fd03-48a1-be72-fb64d90f2efc
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:27:27 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:27:30 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-27-16_222_8300160340562225012-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable
2017-08-16 09:27:30 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-27-16_222_8300160340562225012-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable (332 bytes)
2017-08-16 09:27:30 End of local task; Time Taken: 2.987 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0009, Tracking URL = http://node1:8088/proxy/application_1502887685471_0009/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:27:46,861 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:27:59,352 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1502887685471_0009
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.91 sec HDFS Read: 7152 HDFS Write: 363 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 910 msec
OK
7782 CLARK
7839 KING
7934 MILLER
7369 SMITH
7566 JONES
7902 FORD
7499 ALLEN
7521 WARD
7654 MARTIN
7698 BLAKE
7844 TURNER
7900 JAMES
Time taken: 44.323 seconds, Fetched: 12 row(s)
hive>
注意,左半连接通常比内连接要高效,因为对于左表的一条 指定的记录,在右表一旦找到匹配的记录,Hive就会停止扫描。
(4)
第11章 Hive:SQL on Hadoop
11.7 HQL:数据查询
本节操作使用默认default数据库。
hive> use default;
OK
Time taken: 0.035 seconds
hive>
- 1
- 2
- 3
- 4
11.7.1 limit 子句
hive> select * from emp limit 5;
OK
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7839 KING PRESIDENT 0 1981-11-17 5000.0 0.0 10
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
7369 SMITH CLERK 7902 1980-12-17 800.0 0.0 20
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
Time taken: 2.18 seconds, Fetched: 5 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
hive> select * from emp limit 1,3;
OK
7839 KING PRESIDENT 0 1981-11-17 5000.0 0.0 10
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
7369 SMITH CLERK 7902 1980-12-17 800.0 0.0 20
Time taken: 0.303 seconds, Fetched: 3 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
11.7.2 case when then 语句
hive> select eid,ename,
> case
> when did=10 then 'd10'
> when did=20 then 'd20'
> when did=30 then 'd30'
> end
> from emp;
OK
7782 CLARK d10
7839 KING d10
7934 MILLER d10
7369 SMITH d20
7566 JONES d20
7902 FORD d20
7499 ALLEN d30
7521 WARD d30
7654 MARTIN d30
7698 BLAKE d30
7844 TURNER d30
7900 JAMES d30
Time taken: 0.469 seconds, Fetched: 12 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
11.7.3 where 子句
可以用or
和and
连接多个表达式。
操作符 | 数据类型 | 说明 |
---|---|---|
A=B | 基本类型 | A等于B返回true,否则返回false |
A<>B ,A!=B | 基本类型 | A或B为null,返回null; A和B都不空时,A不等于B返回true,否则返回false |
A<B | 基本类型 | A或B为null,返回null; A小于B返回true,否则返回false |
A<=B | 基本类型 | A或B为null,返回null; A小于等于B返回true,否则返回false |
A>B | 基本类型 | A或B为null,返回null; A大于B返回true,否则返回false |
A>=B | 基本类型 | A或B为null,返回null; A大于等于B返回true,否则返回false |
A Between B And C | 基本类型 | 筛选 A 的值处于 B 和 C 之间 |
A not between B and C | 基本类型 | 筛选 A 的值不处于 B 和 C 之间 |
A is null | 所有类型 | 如果A等于null,返回true,否则返回false。注意,不能使用A=null |
A is not null | 所有类型 | 筛选 A 值不是null的 |
A like B | string | B是一个SQL正则表达式,% 表示一个或者多个字符,_ 表示一个字符 |
A not like B | string | 与正则表达式B相反结果 |
A rlike B | string | 正则匹配 |
hive> select * from emp
> where did=30 and comm<>0.0;
OK
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
Time taken: 0.667 seconds, Fetched: 3 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
hive> select * from emp
> where hiredate between '1981-01-01' and '1981-12-30';
OK
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7839 KING PRESIDENT 0 1981-11-17 5000.0 0.0 10
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
Time taken: 0.311 seconds, Fetched: 10 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
11.7.4 group by 子句
【例】查询每个部门的人数
hive> select count(*) from emp
> group by did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814101839_fe4cccbd-62c4-48cb-986b-a70f93c19535
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0003, Tracking URL = http://node1:8088/proxy/application_1502717288397_0003/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502717288397_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:19:10,549 Stage-1 map = 0%, reduce = 0%
2017-08-14 10:19:27,526 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.86 sec
2017-08-14 10:19:42,984 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.84 sec
MapReduce Total cumulative CPU time: 6 seconds 840 msec
Ended Job = job_1502717288397_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.84 sec HDFS Read: 9667 HDFS Write: 129 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 840 msec
OK
3
3
6
Time taken: 66.588 seconds, Fetched: 3 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
与Oracle数据库一样,如果使用group by子句,那么查询的字段如果没有出现在group by子句的后面,则必须使用聚合函数。
hive> select ename,avg(sal) from emp group by did;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'
hive>
- 1
- 2
- 3
如果想对分组结果进行过滤,可以使用having子句
hive> select avg(sal) from emp
> group by did
> having avg(sal)>2000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814103035_2106518a-aab4-4cdd-9b39-7b08fb1a44b6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0007, Tracking URL = http://node1:8088/proxy/application_1502717288397_0007/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502717288397_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:31:08,601 Stage-1 map = 0%, reduce = 0%
2017-08-14 10:31:25,888 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.88 sec
2017-08-14 10:31:43,122 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.34 sec
MapReduce Total cumulative CPU time: 8 seconds 340 msec
Ended Job = job_1502717288397_0007
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.34 sec HDFS Read: 10807 HDFS Write: 149 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 340 msec
OK
2916.6666666666665
2258.3333333333335
Time taken: 70.202 seconds, Fetched: 2 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
11.7.5 内连接
(1)准备阶段
部门数据
[root@node3 data]# vi dept
[root@node3 data]# cat dept
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
[root@node3 data]#
- 1
- 2
- 3
- 4
- 5
- 6
- 7
创建部门表dept,并导入数据
hive> create table if not exists dept(
> did int,
> dname string,
> dloc string)
> row format delimited fields terminated by ',';
OK
Time taken: 0.185 seconds
hive> load data local inpath '/root/data/dept' into table dept;
Loading data to table default.dept
OK
Time taken: 0.705 seconds
hive> select * from dept;
OK
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 0.155 seconds, Fetched: 4 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
(2)等值连接
连接的两个表中,只有同时满足连接条件的记录才会放入结果表中。执行查询命令select eid,ename,dname from emp,dept where emp.did=dept.did;
hive> select eid,ename,dname
> from emp,dept
> where emp.did=dept.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170815101902_18fb066b-11f2-438d-8a92-9d4490b80ec2
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-15 10:19:13 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-15 10:19:17 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-19-02_412_1947825427115470836-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2017-08-15 10:19:17 Uploaded 1 File to: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-19-02_412_1947825427115470836-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (373 bytes)
2017-08-15 10:19:17 End of local task; Time Taken: 3.36 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502805552881_0002, Tracking URL = http://node1:8088/proxy/application_1502805552881_0002/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502805552881_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-15 10:19:38,347 Stage-3 map = 0%, reduce = 0%
2017-08-15 10:19:52,530 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1502805552881_0002
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.69 sec HDFS Read: 7746 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 690 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
Time taken: 52.488 seconds, Fetched: 12 row(s)
hive>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
内连接另一种形式:select e.eid,e.ename,d.dname from emp e join dept d on e.did=d.did;
其中,e和d分别是表emp和dept的别名。
hive> select e.eid,e.ename,d.dname
> from emp e join dept d
> on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170815102454_0ca5cc0f-1802-4b4f-8156-d055482971b7
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-15 10:25:06 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-15 10:25:10 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-24-54_978_8533839352082414207-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2017-08-15 10:25:10 Uploaded 1 File to: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-24-54_978_8533839352082414207-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (373 bytes)
2017-08-15 10:25:10 End of local task; Time Taken: 3.864 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502805552881_0003, Tracking URL = http://node1:8088/proxy/application_1502805552881_0003/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502805552881_0003
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-15 10:25:27,803 Stage-3 map = 0%, reduce = 0%
2017-08-15 10:25:38,323 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.24 sec
MapReduce Total cumulative CPU time: 2 seconds 240 msec
Ended Job = job_1502805552881_0003
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.24 sec HDFS Read: 7746 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 240 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
Time taken: 44.558 seconds, Fetched: 12 row(s)
hive>
11.7.6 外连接
内连接只显示了俩张表都存在的数据,而外连接则是显示出所有的数据 。外连接又分为左外连接、右外连接以及全外连接,可以简称左连接、右连接和全连接。其中左连接是左边的表输出完整数据。右连接是右边的表输出完整数据,全连接就是两张表的所有行全部输出。
(1)补充数据
为了测试外连接,需要在雇员表中添加一个没有部门的新员工信息
hive> insert into emp(eid,ename,hiredate) values(8888,'HADRON','2016-08-31');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816091805_3f28cfe1-278a-486e-9ed4-34ee791f5322
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0005, Tracking URL = http://node1:8088/proxy/application_1502887685471_0005/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-16 09:18:20,468 Stage-1 map = 0%, reduce = 0%
2017-08-16 09:18:32,863 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.48 sec
MapReduce Total cumulative CPU time: 2 seconds 480 msec
Ended Job = job_1502887685471_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://cetc/user/hive/warehouse/emp/.hive-staging_hive_2017-08-16_09-18-05_560_4497324549163312419-1/-ext-10000
Loading data to table default.emp
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.48 sec HDFS Read: 4861 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 480 msec
OK
Time taken: 30.102 seconds
hive>
(2)左连接
hive> select e.eid,e.ename,d.dname from emp e left join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816091950_1483c055-0461-4293-9387-3221eb15b448
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:20:01 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:20:04 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-19-50_831_2141891134326428737-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2017-08-16 09:20:04 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-19-50_831_2141891134326428737-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (373 bytes)
2017-08-16 09:20:04 End of local task; Time Taken: 2.606 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0006, Tracking URL = http://node1:8088/proxy/application_1502887685471_0006/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0006
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:20:20,596 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:20:30,466 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.82 sec
MapReduce Total cumulative CPU time: 1 seconds 820 msec
Ended Job = job_1502887685471_0006
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.82 sec HDFS Read: 7158 HDFS Write: 486 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 820 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
8888 HADRON NULL
Time taken: 41.877 seconds, Fetched: 13 row(s)
hive>
(3)右连接
hive> select e.eid,e.ename,d.dname from emp e right join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092117_f97eea90-4039-45f9-8097-900c457f50f3
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:21:28 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:21:31 Dump the side-table for tag: 0 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-21-17_901_4210650042244478897-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable
2017-08-16 09:21:31 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-21-17_901_4210650042244478897-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (498 bytes)
2017-08-16 09:21:31 End of local task; Time Taken: 2.599 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0007, Tracking URL = http://node1:8088/proxy/application_1502887685471_0007/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0007
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:21:48,072 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:21:58,404 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.95 sec
MapReduce Total cumulative CPU time: 1 seconds 950 msec
Ended Job = job_1502887685471_0007
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.95 sec HDFS Read: 6169 HDFS Write: 488 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 950 msec
OK
7782 CLARK ACCOUNTING
7839 KING ACCOUNTING
7934 MILLER ACCOUNTING
7369 SMITH RESEARCH
7566 JONES RESEARCH
7902 FORD RESEARCH
7499 ALLEN SALES
7521 WARD SALES
7654 MARTIN SALES
7698 BLAKE SALES
7844 TURNER SALES
7900 JAMES SALES
NULL NULL OPERATIONS
Time taken: 42.768 seconds, Fetched: 13 row(s)
hive>
(4)全连接
hive> select e.eid,e.ename,d.dname from emp e full join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092230_3b3f51e8-2753-41a2-9f41-af98770847fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502887685471_0008, Tracking URL = http://node1:8088/proxy/application_1502887685471_0008/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0008
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2017-08-16 09:22:45,280 Stage-1 map = 0%, reduce = 0%
2017-08-16 09:23:06,469 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.28 sec
2017-08-16 09:23:16,137 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.52 sec
MapReduce Total cumulative CPU time: 8 seconds 520 msec
Ended Job = job_1502887685471_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 8.52 sec HDFS Read: 16627 HDFS Write: 515 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 520 msec
OK
8888 HADRON NULL
7782 CLARK ACCOUNTING
7934 MILLER ACCOUNTING
7839 KING ACCOUNTING
7902 FORD RESEARCH
7566 JONES RESEARCH
7369 SMITH RESEARCH
7654 MARTIN SALES
7521 WARD SALES
7900 JAMES SALES
7499 ALLEN SALES
7844 TURNER SALES
7698 BLAKE SALES
NULL NULL OPERATIONS
Time taken: 47.921 seconds, Fetched: 14 row(s)
hive>
(5)左半连接
hive> select e.eid,e.ename from emp e
> left semi join dept d
> on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816093310_97649892-9cb8-4b26-954a-1a301976184d
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:33:22 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:33:25 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-33-10_356_3322488615725018370-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile51--.hashtable
2017-08-16 09:33:25 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-33-10_356_3322488615725018370-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile51--.hashtable (332 bytes)
2017-08-16 09:33:25 End of local task; Time Taken: 2.642 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0010, Tracking URL = http://node1:8088/proxy/application_1502887685471_0010/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0010
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:33:43,441 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:33:55,350 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.93 sec
MapReduce Total cumulative CPU time: 2 seconds 930 msec
Ended Job = job_1502887685471_0010
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.93 sec HDFS Read: 7152 HDFS Write: 363 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 930 msec
OK
7782 CLARK
7839 KING
7934 MILLER
7369 SMITH
7566 JONES
7902 FORD
7499 ALLEN
7521 WARD
7654 MARTIN
7698 BLAKE
7844 TURNER
7900 JAMES
Time taken: 46.179 seconds, Fetched: 12 row(s)
hive>
左半连接和内连接的结果一样,只不过使用IN查询。
hive> select e.eid,e.ename from emp e
> where e.did in(select d.did from dept d);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092716_b479bf9e-fd03-48a1-be72-fb64d90f2efc
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:27:27 Starting to launch local task to process map join; maximum memory = 518979584
2017-08-16 09:27:30 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-27-16_222_8300160340562225012-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable
2017-08-16 09:27:30 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-27-16_222_8300160340562225012-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable (332 bytes)
2017-08-16 09:27:30 End of local task; Time Taken: 2.987 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0009, Tracking URL = http://node1:8088/proxy/application_1502887685471_0009/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502887685471_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:27:46,861 Stage-3 map = 0%, reduce = 0%
2017-08-16 09:27:59,352 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1502887685471_0009
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.91 sec HDFS Read: 7152 HDFS Write: 363 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 910 msec
OK
7782 CLARK
7839 KING
7934 MILLER
7369 SMITH
7566 JONES
7902 FORD
7499 ALLEN
7521 WARD
7654 MARTIN
7698 BLAKE
7844 TURNER
7900 JAMES
Time taken: 44.323 seconds, Fetched: 12 row(s)
hive>
注意,左半连接通常比内连接要高效,因为对于左表的一条 指定的记录,在右表一旦找到匹配的记录,Hive就会停止扫描。