数据仓库Hive——DML和查询(上)

最新推荐文章于 2023-04-16 22:01:28 发布

2NaCl

最新推荐文章于 2023-04-16 22:01:28 发布

阅读量283

点赞数 1

分类专栏： Hadoop 文章标签： Hive

本文链接：https://blog.csdn.net/qq_41936805/article/details/87891636

版权

Hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

三、DML操作

1.数据导入

1.1Load给表里装入数据

语法：
[1]load data:加载数据
[2]local:表示从本地加载数据，否则从HDFS
[3]inpath:表示加载路径
[4]overwrite:覆盖写入
[5]into table:写入表中
[6]short_Table:具体的表名
[7]partition:指定分区号

hive> load data local inpath '/home/centos01/modules/apache-hive-1.2.2-bin/examples/files/x.txt' overwrite into table shortTable partition(year='2010',month='04');
Loading data to table default.shorttable partition (year=2010, month=04)
Partition default.shorttable{year=2010, month=04} stats: [numFiles=1, numRows=0, totalSize=13, rawDataSize=0]
OK
Time taken: 0.616 seconds

1.2通过查询语句向表中插入数据

先创建一张分区表

hive> create table stu1(tdate bigint,interest double,seven_interest double) partitioned by (year string) row format delimited fields terminated by ',';
OK
Time taken: 0.137 seconds

平时普通插入数据

hive> insert into table stu1 partition(year = '2010') values (2010,1.025,9.999);
Query ID = centos01_20190223155025_f28003cd-2808-47a9-82a0-a97a760ee1a0
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 15:50:28,030 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local120146978_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/.hive-staging_hive_2019-02-23_15-50-25_921_6506809753036083134-1/-ext-10000
Loading data to table default.stu1 partition (year=2010)
Partition default.stu1{year=2010} stats: [numFiles=1, numRows=1, totalSize=17, rawDataSize=16]
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 9946 HDFS Write: 49077 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.357 seconds

现在使用基本模式插入语句

hive> insert overwrite table stu1 partition(year='2010') select tdate,interest,seven_interest from stu1  where year='2010';
Query ID = centos01_20190223160342_25f36789-e5b1-4b25-9fc2-9d970d51b106
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:03:44,018 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local487572732_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/.hive-staging_hive_2019-02-23_16-03-42_190_3152532333837736856-1/-ext-10000
Loading data to table default.stu1 partition (year=2010)
Partition default.stu1{year=2010} stats: [numFiles=1, numRows=1, totalSize=17, rawDataSize=16]
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 10058 HDFS Write: 49172 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.198 seconds

查询一下

hive> insert overwrite table stu1 partition(year='2010') select tdate,interest,seven_interest from stu1  where year='2011';
Query ID = centos01_20190223162020_17a24fcf-1db8-4945-9373-560448678eba
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:20:21,712 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local1592452816_0010
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/.hive-staging_hive_2019-02-23_16-20-20_378_2894671489011544195-1/-ext-10000
Loading data to table default.stu1 partition (year=2010)
Partition default.stu1{year=2010} stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 59332 HDFS Write: 88626 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.034 seconds
hive> desc stu1;
OK
tdate                   bigint                                      
interest                double                                      
seven_interest          double                                      
year                    string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
year                    string                                      
Time taken: 0.03 seconds, Fetched: 9 row(s)

2.数据导出

2.1Insert导出

2.1.1将查询结果导出到本地

hive> insert overwrite local directory '/home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint' select * from stu1;
Query ID = centos01_20190223165232_50f55171-8490-4b3c-92f9-fa327e038867
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:52:33,561 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local643035792_0012
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 157852 HDFS Write: 196798 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.308 seconds

如果我们这么输出的话，在输出csv文件的时候会出现一些乱码的情况，所以就有了我们的2.1.2

2.1.2将查询结果格式化导出到本地

hive> insert overwrite local directory '/home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint' row format delimited fields terminated by ',' select * from stu1;
Query ID = centos01_20190223165555_eab0d523-271f-41ce-ba3d-66cce2b40bf6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:55:57,185 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local1800033420_0013
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 177398 HDFS Write: 196798 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.312 seconds

2.1.3将查询结果格式化导出到HDFS上

hive> insert overwrite directory 'user/hive/warehouse/outprint' row format delimited fields terminated by ','select * from stu1;
Query ID = centos01_20190223170255_f1b5f343-cf17-47c3-a757-92e164404f4a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 17:02:57,085 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local229905877_0014
Stage-3 is filtered out by condition resolver.
Stage-2 is selected by condition resolver.
Stage-4 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 17:02:58,321 Stage-2 map = 100%,  reduce = 0%
Ended Job = job_local1230355508_0015
Moving data to: user/hive/warehouse/outprint
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 196944 HDFS Write: 219734 SUCCESS
Stage-Stage-2:  HDFS Read: 109940 HDFS Write: 121335 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.79 seconds

然后我们到我们的hdfs上顺着路径去查找这个文件
在这里插入图片描述
成功。

2.2Hadoop命令导出到本地

hive> dfs -get /user/hive/warehouse/outprint/000000_0 /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/test.csv;

2.3Hive Shell命令导出

hive> show tables;[centos01@linux01 apache-hive-1.2.2-bin]$ bin/hive -e 'select * from default.stu1;' > /home/centos01/modules/apache-hive-1.2.2-bin/iotmpv
outprint/test2.csv 
Logging initialized using configuration in file:/home/centos01/modules/apache-hive-1.2.2-bin/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/centos01/modules/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/centos01/modules/hbase-1.3.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
OK
Time taken: 1.944 seconds, Fetched: 429 row(s)

在这里插入图片描述
导出成功

2.4Export导出到HDFS上

hive> export table default.stu1 to '/user/hive/warehouse/outprint/testprint.csv';
Copying data from file:/tmp/centos01/0dceefbc-1723-4afe-8154-449feedfe263/hive_2019-02-23_21-44-30_221_8295514754594839775-1/-local-10000/_metadata
Copying file: file:/tmp/centos01/0dceefbc-1723-4afe-8154-449feedfe263/hive_2019-02-23_21-44-30_221_8295514754594839775-1/-local-10000/_metadata
Copying data from hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010
Copying file: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/mfd_day_share_interest.csv
Copying data from hdfs://linux01:8020/user/hive/warehouse/stu1/year=2011
Copying file: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2011/000000_0
OK
Time taken: 1.534 seconds

导出成功
在这里插入图片描述

3.删除表中数据

hive> truncate table stu;
OK
Time taken: 0.106 seconds

注意：truncate只能删除内部表，不能删除外部表

四、查询

1.基本查询

1.1列别名

重命名一个列有助于计算，这个其实在学HBase的时候就想到了，列存储数据库(虽然Hive并不是)如果列名过长会过多的消耗内存空间，产生不必要的麻烦，所以as一下是很方便的事情，下面就来介绍一下

hive> select tdate t,interest as i from stu1;

至于起别名的时候写不写as都行，看个人习惯。

1.2算术运算符

运算符	描述
A+B	加法
A-B	减法
A*B	乘法
A/B	除法
A%B	A对B取余
A&B	AB按照位置取与
A^B	AB按位置取异或
~A	A按位取反
A丨B	AB按照位置取或

1.3常用函数

1.3.1 count

hive> select count(*) from stu1;
Query ID = centos01_20190223222341_07b9de4b-07c5-446f-9710-f72e4c3c8233
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-23 22:23:43,461 Stage-1 map = 100%,  reduce = 0%
2019-02-23 22:23:44,475 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1064054946_0001
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 87957 HDFS Write: 40383 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
429
Time taken: 3.308 seconds, Fetched: 1 row(s)

1.3.2 max

hive> select max(tdate) sumNum from stu1;
Query ID = centos01_20190224085856_03d4ed01-3491-44ae-bd7a-878db173ca80
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:58:57,897 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1109629230_0002
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 58638 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
20140831
Time taken: 1.459 seconds, Fetched: 1 row(s)

1.3.3 min

hive> select min(tdate) sumNum from stu1;
Query ID = centos01_20190224085918_e8284ec5-ad16-4313-b302-b1cb065453e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:59:19,748 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1173673069_0003
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 87957 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
20130701
Time taken: 1.48 seconds, Fetched: 1 row(s)

1.3.4 sum

hive> select sum(tdate) sumNum from stu1;
Query ID = centos01_20190224085754_367acdef-2f0e-4707-9f5c-7ee23321d623
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:58:04,335 Stage-1 map = 0%,  reduce = 0%
2019-02-24 08:58:06,446 Stage-1 map = 100%,  reduce = 0%
2019-02-24 08:58:07,473 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local489812945_0001
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 29319 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
8618362131
Time taken: 12.605 seconds, Fetched: 1 row(s)

1.3.5 avg

hive> select avg(tdate) sumNum from stu1;
Query ID = centos01_20190224085943_ef097957-7f9a-4c0c-b911-d79e8a182032
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:59:45,237 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1920915701_0004
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 117276 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
2.013636011915888E7
Time taken: 1.478 seconds, Fetched: 1 row(s)

1.4 Limit语句

查询返回多行字句的时候，limit可以限制返回的行数

hive> select * from stu1 limit 5;
OK
NULL    NULL    NULL    2010
20130701        1.5787  6.307   2010
20130701        1.5787  6.307   2010
20130702        1.5461  6.174   2010
20130703        1.467   6.034   2010
Time taken: 0.095 seconds, Fetched: 5 row(s)

2.Where语句

操作符	支持的数据类型	描述
A=B	基本数据类型	如果 A等于 B 则返回 TRUE，反之返回 FALSE
A<=>B	基本数据类型	如果 A 和 B 都为 NULL，返回 TRUE，其他的和等号（=）操作符的结果一致，如果任一为 NULL 则结果为 NULL
A<>B, A!=B	基本数据类型	A 或者 B 为 NULL 则返回 NULL；如果 A 不等于 B，则返回 TRUE，反之返回 FALSE
A<B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 小于 B，则返回 TRUE，反之返回 FALSE
A<=B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 小于等于 B，则返回 TRUE，反之返回 FALSE
A>B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 大于 B，则返回 TRUE，反之返回 FALSE
A>=B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 大于等于 B，则返回 TRUE，反之返回 FALSE
A [NOT] BETWEEN B AND C	基本数据类型	如果 A，B 或者 C 任一为 NULL，则结果为 NULL。如果 A 的值大于等于 B 而且小于或等于 C，则结果为 TRUE，反之为 FALSE。如果使用 NOT 关键字则可达到相反的效果
A IS NULL	所有数据类型	如果 A 等于 NULL，则返回 TRUE，反之返回 FALSE
A IS NOT NULL	所有数据类型	如果 A 不等于 NULL，则返回 TRUE，反之返回 FALSE
IN(数值 1, 数值 2)	所有数据类型	使用 IN 运算显示列表中的值
A [NOT] LIKE B	STRING 类型	B 是一个 SQL 下的简单正则表达式，如果 A 与其匹配的话，则返回 TRUE；反之返回 FALSE。B 的表达式说明如下：‘x%’表示 A 必须以字母‘x’开头，‘%x’表示 A 必须以字母’x’ 结尾，而‘%x%’表示 A 包含有字母’x’,可以位于开头，结尾或者字符串中间。如果使用 NOT关键字则可达到相反的效果。
A RLIKE B, A REGEXP B	STRING 类型	B 是一个正则表达式，如果 A 与其匹配，则返回TRUE；反之返回FALSE。匹配使用的是JDK 中的正则表达式接口实现的，因为正则也依据其中的规则。例如，正则表达式必须和整个字符串 A 相匹配，而不是只需与其字符串匹配。

2.1比较运算符（Between…and…，Null，In）

2.1.1 Between…and…

查询在…和…之间的数据，用在where之后

hive> select * from stu1 where tdate between 20140821 and 20140831;
OK
20140821        1.1168  4.158   2010
20140822        1.104   4.153   2010
20140823        1.0997  4.146   2010
20140824        1.0993  4.139   2010
20140825        1.1122  4.139   2010
20140826        1.1085  4.131   2010
20140827        1.1079  4.123   2010
20140828        1.1043  4.116   2010
20140829        1.1169  4.123   2010
20140830        1.1206  4.134   2010
20140831        1.1204  4.146   2010

2.1.2 is Null

查询空值

hive> select * from stu1 where tdate is NUll;
OK
NULL    NULL    NULL    2010
Time taken: 0.081 seconds, Fetched: 1 row(s)

2.1.3 In（n1,n2）

查询是满足值是n1和n2的数据

hive> select * from stu1 where tdate in (20140401,20140831);
OK
20140401        1.4368  5.356   2010
20140831        1.1204  4.146   2010
Time taken: 0.086 seconds, Fetched: 2 row(s)

2.2.Like和RLike

使用like运算可以选择类似的值
选择条件可以包含字符或者数字（%表示任意个字符 _表示一个字符）
RLike字句是Hive这个功能的扩展，其可以通过Java的正则表达式这个更强大的语言来匹配条件
举例：
查询以1开头的利率信息

hive> select * from stu1 where interest like '1%';
OK
20130701        1.5787  6.307   2010
20130701        1.5787  6.307   2010
20130702        1.5461  6.174   2010
20130703        1.467   6.034   2010

查询含有2的利率信息

hive> select * from stu1 where interest like '%2%';

查询第二个数值是3的利率信息

hive> select * from stu1 where interest like '_3%';

2.4 逻辑运算符(And逻辑并 Or逻辑或 Not逻辑非)

3.分组

※※※3.1 Group By

Group by通常和聚合函数一起使用，按照一个或者多个列队结果进行分组，然后对每个组进行聚合操作

hive> select avg(interest),tdate from stu1 group by tdate;
Query ID = centos01_20190224094703_82c686d7-36d1-4fd7-a041-4d2fb2e53f74
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 09:47:04,787 Stage-1 map = 100%,  reduce = 0%
2019-02-24 09:47:05,793 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1477191845_0005
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 364116 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

3.2 Having

where	Having
针对查询的时候发挥作用	针对查询结果发挥作用
后面不能写分组函数	后面可以写分组函数
	只用于Group By的统计语句

hive> select avg(interest),tdate from stu1 group by tdate having avg(interest)> 1.5;
Query ID = centos01_20190224095607_6fe9db90-2c79-470c-abf6-d8eee70e0b19
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 09:56:09,270 Stage-1 map = 100%,  reduce = 0%
2019-02-24 09:56:10,277 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1381061535_0006
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 393435 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK

2NaCl

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据仓库Hive——DML和查询(上)

文章目录三、DML操作1.数据导入1.1Load给表里装入数据1.2通过查询语句向表中插入数据2.数据导出2.1Insert导出2.1.1将查询结果导出到本地2.1.2将查询结果格式化导出到本地2.1.3将查询结果格式化导出到HDFS上2.2Hadoop命令导出到本地2.3Hive Shell命令导出2.4Export导出到HDFS上3.删除表中数据四、查询1.基本查询1.1列别名1.2算术运算符...
复制链接

扫一扫