数据仓库Hive——查询(下)

最新推荐文章于 2022-02-20 00:15:55 发布

2NaCl

最新推荐文章于 2022-02-20 00:15:55 发布

阅读量665

点赞数 1

分类专栏： Hadoop 文章标签： Hive

本文链接：https://blog.csdn.net/qq_41936805/article/details/87900410

版权

Hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

四、查询

4.Join语句

4.1 等值Join

Hive支持通常的Sql Join语句，但是只支持等值连接，不支持不等值连接

hive> select us.ttdate,st.tdate,st.interest,us.interest from stu1 st join user_interest_partition us on st.tdate = us.ttdate;
Query ID = centos01_20190224110959_d83372d5-9f0c-469f-8ac4-5e93bc61dbc9
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/centos01/modules/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/centos01/modules/hbase-1.3.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-02-24 11:10:16,045 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/centos01/centos01_20190224110959_d83372d5-9f0c-469f-8ac4-5e93bc61dbc9.log
2019-02-24 11:10:18     Starting to launch local task to process map join;      maximum memory = 477626368
2019-02-24 11:10:22     Dump the side-table for tag: 0 with group count: 427 into file: file:/tmp/centos01/b3f41e7c-804e-498a-aa3c-3c12a24a1983/hive_2019-02-24_11-09-59_247_5078018496957470896-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2019-02-24 11:10:23     Uploaded 1 File to: file:/tmp/centos01/b3f41e7c-804e-498a-aa3c-3c12a24a1983/hive_2019-02-24_11-09-59_247_5078018496957470896-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (13573 bytes)
2019-02-24 11:10:23     End of local task; Time Taken: 4.564 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-24 11:10:27,208 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1212854639_0007
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 131145 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 27.964 seconds

4.2 表的别名

使用别名可以简化查询
使用表明前缀可以提高效率
代码可以见4.1

4.3 内连接

只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来。
代码如4.1

4.4 左外连

只要让=两边满足，左边拥有右边没有的，就可以对select的进行全部输出，而左边比右边多的部分，全部进行补Null操作

hive> select us.ttdate,st.tdate,st.interest,us.interest from stu1 st left join user_interest_partition us on st.tdate = us.ttdate;
Query ID = centos01_20190224114227_ecc19278-a081-4615-bf0b-57701d22d406
Total jobs = 2
Stage-5 is selected by condition resolver.
Stage-1 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/centos01/modules/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/centos01/modules/hbase-1.3.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-02-24 11:42:38,936 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/centos01/centos01_20190224114227_ecc19278-a081-4615-bf0b-57701d22d406.log
2019-02-24 11:42:43     Starting to launch local task to process map join;      maximum memory = 477626368
2019-02-24 11:42:43     Dump the side-table for tag: 1 with group count: 0 into file: file:/tmp/centos01/b3f41e7c-804e-498a-aa3c-3c12a24a1983/hive_2019-02-24_11-42-27_841_580350348549979545-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2019-02-24 11:42:43     Uploaded 1 File to: file:/tmp/centos01/b3f41e7c-804e-498a-aa3c-3c12a24a1983/hive_2019-02-24_11-42-27_841_580350348549979545-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (260 bytes)
2019-02-24 11:42:43     End of local task; Time Taken: 0.51 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-24 11:42:46,051 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1939198263_0008
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 281836 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

在这里插入图片描述

4.5 右外连

只要让=两边满足，右边拥有左边没有的，就可以对select的进行全部输出，右边比左边多出来的，也会进行补Null操作。

hive> select us.ttdate,st.tdate,st.interest,us.interest from stu1 st right join user_interest_partition us on st.tdate = us.ttdate;
Query ID = centos01_20190224114409_5b26ed57-9139-415a-8440-4170e7cd707f
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/centos01/modules/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/centos01/modules/hbase-1.3.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-02-24 11:44:18,946 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/centos01/centos01_20190224114409_5b26ed57-9139-415a-8440-4170e7cd707f.log
2019-02-24 11:44:20     Starting to launch local task to process map join;      maximum memory = 477626368
2019-02-24 11:44:23     Dump the side-table for tag: 0 with group count: 428 into file: file:/tmp/centos01/b3f41e7c-804e-498a-aa3c-3c12a24a1983/hive_2019-02-24_11-44-09_026_6422025048488395719-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile30--.hashtable
2019-02-24 11:44:23     Uploaded 1 File to: file:/tmp/centos01/b3f41e7c-804e-498a-aa3c-3c12a24a1983/hive_2019-02-24_11-44-09_026_6422025048488395719-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (13591 bytes)
2019-02-24 11:44:23     End of local task; Time Taken: 2.516 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-24 11:44:26,002 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local522367838_0010
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 140918 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

4.6 满外连

将会返回所有表中符合条件的所有记录，如果任意表的指定字段没有符合条件的话，那么就是用Null值替代
感觉是又左外连，又右外连，满足一者就可以进行输出，但是都不满足就是Null。

hive> select us.ttdate,st.tdate,st.interest,us.interest from stu1 st full join user_interest_partition us on st.tdate = us.ttdate;
Query ID = centos01_20190224120019_bdcfadd9-63bc-4fee-a1c1-cf75e897dffa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 12:00:21,039 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local98710081_0011
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 602764 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

5.排序

5.1 全局排序(Order By)

ASC 升序(默认)
DESC 降序

hive> select * from stu1 order by interest desc; 
Query ID = centos01_20190224133449_a3eae884-742f-4ef2-b926-0d819793625c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 13:34:51,129 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local360184143_0013
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 510711 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

5.2 按照自定义别名排序

hive> select interest* 2 doubleInterest from stu1 order by doubleInterest desc;
Query ID = centos01_20190224133843_d7ecf983-dbeb-46d3-9a9d-f044b1ea4efd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 13:38:45,270 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1152362567_0014
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 540030 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

5.3 多个列排序

就是order by后面多一个参数

hive> select * from stu1 order by interest,tdate desc;
Query ID = centos01_20190224134600_fcbe357f-37dd-45c8-a246-a6e5d6f1b0b5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 13:46:01,625 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local755569569_0016
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 598668 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

5.4 每个MapReduce内部排序(Sort By)

SortBy：每个MapReduce内部进行排序，对全局结果集来说不是排序
在设置reducer个数之前要先在hive-site.xml(也有可能叫hive-default.xml)中设置hive.exec.reducers.max也就是reducer的最大运行个数，默认最大为999，这个设置的时候挺玄学的，
设置的太小，数据量很大，会导致这个reduce异常的慢，从而导致这个任务不能结束，也有可能会宕机
设置的太大，产生的小文件太多，合并起来代价太高，namenode的内存占用也会增大。
(1)设置reduce个数

hive> set mapreduce.job.reduce = 3;

(2)查看设置reduce个数

hive> set mapreduce.job.reduce;
mapreduce.job.reduce=3

(3)根据部门降序查看员工信息

hive> select * from stu1 sort by interest desc;

(4) 将查询系统导入到文件中

hive> insert overwrite local directory '/home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint/test4.csv' select * from stu1 sort by interest desc;
Query ID = centos01_20190224144005_8b73c0c6-8ae5-4d2f-9e88-cf835ea0bb70
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 14:40:06,721 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local965640329_0019
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint/test4.csv
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint/test4.csv
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 686625 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.384 seconds

5.5 分区排序(Distribute by)

Distribute by类似MR中的partition，进行分区，结合sort by使用，
但是注意，hive要求Distribute by要写在sort by的前面

hive> select * from stu1 distribute by tdate sort by interest desc;

5.6 Cluster By

当distribute by和sort by字段相同的时候就可以使用Cluster By，它同时具备distribute by和sort by的功能，但是排序只能是倒序排序，不能指定规则

6.分桶及抽样查询

6.1分桶表数据存储

分区提供一个隔离数据和优化查询的便利方式。不过并非所有数据集都可以形成合理的分区。
分桶操作是为了让数据更加方便管理

6.1.1先创建分桶表，直接导入文件

创建分桶表()，相当于把数据文件分成了n份

hive> create table stu_stuck(name string,age int) clustered by(age) into 4 buckets row format delimited fields terminated by ',';
OK
Time taken: 0.602 seconds

查看表结构
会发现这行

Num Buckets:            4

导入数据

hive> load data local inpath '/home/centos01/modules/apache-hive-1.2.2-bin/examples/files/x.txt' into table stu_stuck;
Loading data to table default.stu_stuck
Table default.stu_stuck stats: [numFiles=1, totalSize=80]
OK
Time taken: 1.169 seconds

最后查看分桶情况
在这里插入图片描述
发现并没有分成4个桶，是因为我们导入数据的方法不同，应该使用子查询的方式导入，然后我们来看6.1.2

6.1.2创建分桶表时，数据通过子查询的方式导入

首先创建一个普通的表new_stu_stuck

hive> create table new_stu_stuck(name string,age int);
OK
Time taken: 0.09 seconds

然后普通的导入数据到new_stu_stuck

hive> load data local inpath '/home/centos01/modules/apache-hive-1.2.2-bin/examples/files/x.txt' into table new_stu_stuck;
Loading data to table default.new_stu_stuck
Table default.new_stu_stuck stats: [numFiles=1, totalSize=80]
OK
Time taken: 0.7 seconds

然后清空原来的stu_stuck

hive > truncate table stu_stuck；

然后子查询导入数据到stu_stuck

hive> insert into table stu_stuck select name,age from new_stu_stuck;
Query ID = centos01_20190224205710_8e4cafc0-8932-4c7a-a213-163f1ad25e33
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-24 20:57:13,046 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local542417088_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://linux01:8020/user/hive/warehouse/stu_stuck/.hive-staging_hive_2019-02-24_20-57-10_630_1818080774033854070-1/-ext-10000
Loading data to table default.stu_stuck
Table default.stu_stuck stats: [numFiles=1, numRows=11, totalSize=113, rawDataSize=102]
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 160 HDFS Write: 267 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.754 seconds

然后发现还是只有一个分桶
在这里插入图片描述
然后来设置一个属性

hive> set hive.enforce.bucketing = true;
hive> set mapreduce.job.reduce=-1;

然后再插入
就发现分桶完成(无视两条红线)
在这里插入图片描述

6.2 分桶抽样查询

在这里插入图片描述

6.3 数据块抽样

基于百分比对行数进行抽取，提取前0.1%的数据

hive> select * from stu_stuck tablesample(0.1 percent);
OK

2NaCl

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据仓库Hive——查询(下)

文章目录四、查询4.Join语句4.1 等值Join4.2 表的别名4.3 内连接4.4 左外连4.5 右外连4.6 满外连5.排序5.1 全局排序(Order By)5.2 按照自定义别名排序5.3 多个列排序5.4 每个MapReduce内部排序(Sort By)5.5 分区排序(Distribute by)5.6 Cluster By6.分桶及抽样查询6.1分桶表数据存储6.1.1先创建分桶...
复制链接

扫一扫