三、DML操作
1.数据导入
1.1Load给表里装入数据
语法:
[1]load data:加载数据
[2]local:表示从本地加载数据,否则从HDFS
[3]inpath:表示加载路径
[4]overwrite:覆盖写入
[5]into table:写入表中
[6]short_Table:具体的表名
[7]partition:指定分区号
hive> load data local inpath '/home/centos01/modules/apache-hive-1.2.2-bin/examples/files/x.txt' overwrite into table shortTable partition(year='2010',month='04');
Loading data to table default.shorttable partition (year=2010, month=04)
Partition default.shorttable{year=2010, month=04} stats: [numFiles=1, numRows=0, totalSize=13, rawDataSize=0]
OK
Time taken: 0.616 seconds
1.2通过查询语句向表中插入数据
先创建一张分区表
hive> create table stu1(tdate bigint,interest double,seven_interest double) partitioned by (year string) row format delimited fields terminated by ',';
OK
Time taken: 0.137 seconds
平时普通插入数据
hive> insert into table stu1 partition(year = '2010') values (2010,1.025,9.999);
Query ID = centos01_20190223155025_f28003cd-2808-47a9-82a0-a97a760ee1a0
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 15:50:28,030 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local120146978_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/.hive-staging_hive_2019-02-23_15-50-25_921_6506809753036083134-1/-ext-10000
Loading data to table default.stu1 partition (year=2010)
Partition default.stu1{year=2010} stats: [numFiles=1, numRows=1, totalSize=17, rawDataSize=16]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 9946 HDFS Write: 49077 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.357 seconds
现在使用基本模式插入语句
hive> insert overwrite table stu1 partition(year='2010') select tdate,interest,seven_interest from stu1 where year='2010';
Query ID = centos01_20190223160342_25f36789-e5b1-4b25-9fc2-9d970d51b106
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:03:44,018 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local487572732_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/.hive-staging_hive_2019-02-23_16-03-42_190_3152532333837736856-1/-ext-10000
Loading data to table default.stu1 partition (year=2010)
Partition default.stu1{year=2010} stats: [numFiles=1, numRows=1, totalSize=17, rawDataSize=16]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 10058 HDFS Write: 49172 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.198 seconds
查询一下
hive> insert overwrite table stu1 partition(year='2010') select tdate,interest,seven_interest from stu1 where year='2011';
Query ID = centos01_20190223162020_17a24fcf-1db8-4945-9373-560448678eba
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:20:21,712 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local1592452816_0010
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/.hive-staging_hive_2019-02-23_16-20-20_378_2894671489011544195-1/-ext-10000
Loading data to table default.stu1 partition (year=2010)
Partition default.stu1{year=2010} stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 59332 HDFS Write: 88626 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.034 seconds
hive> desc stu1;
OK
tdate bigint
interest double
seven_interest double
year string
# Partition Information
# col_name data_type comment
year string
Time taken: 0.03 seconds, Fetched: 9 row(s)
2.数据导出
2.1Insert导出
2.1.1将查询结果导出到本地
hive> insert overwrite local directory '/home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint' select * from stu1;
Query ID = centos01_20190223165232_50f55171-8490-4b3c-92f9-fa327e038867
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:52:33,561 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local643035792_0012
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 157852 HDFS Write: 196798 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.308 seconds
如果我们这么输出的话,在输出csv文件的时候会出现一些乱码的情况,所以就有了我们的2.1.2
2.1.2将查询结果格式化导出到本地
hive> insert overwrite local directory '/home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint' row format delimited fields terminated by ',' select * from stu1;
Query ID = centos01_20190223165555_eab0d523-271f-41ce-ba3d-66cce2b40bf6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 16:55:57,185 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local1800033420_0013
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
Copying data to local directory /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/outprint
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 177398 HDFS Write: 196798 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.312 seconds
2.1.3将查询结果格式化导出到HDFS上
hive> insert overwrite directory 'user/hive/warehouse/outprint' row format delimited fields terminated by ','select * from stu1;
Query ID = centos01_20190223170255_f1b5f343-cf17-47c3-a757-92e164404f4a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 17:02:57,085 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local229905877_0014
Stage-3 is filtered out by condition resolver.
Stage-2 is selected by condition resolver.
Stage-4 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-02-23 17:02:58,321 Stage-2 map = 100%, reduce = 0%
Ended Job = job_local1230355508_0015
Moving data to: user/hive/warehouse/outprint
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 196944 HDFS Write: 219734 SUCCESS
Stage-Stage-2: HDFS Read: 109940 HDFS Write: 121335 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.79 seconds
然后我们到我们的hdfs上顺着路径去查找这个文件
成功。
2.2Hadoop命令导出到本地
hive> dfs -get /user/hive/warehouse/outprint/000000_0 /home/centos01/modules/apache-hive-1.2.2-bin/iotmp/test.csv;
2.3Hive Shell命令导出
hive> show tables;[centos01@linux01 apache-hive-1.2.2-bin]$ bin/hive -e 'select * from default.stu1;' > /home/centos01/modules/apache-hive-1.2.2-bin/iotmpv
outprint/test2.csv
Logging initialized using configuration in file:/home/centos01/modules/apache-hive-1.2.2-bin/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/centos01/modules/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/centos01/modules/hbase-1.3.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
OK
Time taken: 1.944 seconds, Fetched: 429 row(s)
导出成功
2.4Export导出到HDFS上
hive> export table default.stu1 to '/user/hive/warehouse/outprint/testprint.csv';
Copying data from file:/tmp/centos01/0dceefbc-1723-4afe-8154-449feedfe263/hive_2019-02-23_21-44-30_221_8295514754594839775-1/-local-10000/_metadata
Copying file: file:/tmp/centos01/0dceefbc-1723-4afe-8154-449feedfe263/hive_2019-02-23_21-44-30_221_8295514754594839775-1/-local-10000/_metadata
Copying data from hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010
Copying file: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2010/mfd_day_share_interest.csv
Copying data from hdfs://linux01:8020/user/hive/warehouse/stu1/year=2011
Copying file: hdfs://linux01:8020/user/hive/warehouse/stu1/year=2011/000000_0
OK
Time taken: 1.534 seconds
导出成功
3.删除表中数据
hive> truncate table stu;
OK
Time taken: 0.106 seconds
注意:truncate只能删除内部表,不能删除外部表
四、查询
1.基本查询
1.1列别名
重命名一个列有助于计算,这个其实在学HBase的时候就想到了,列存储数据库(虽然Hive并不是)如果列名过长会过多的消耗内存空间,产生不必要的麻烦,所以as一下是很方便的事情,下面就来介绍一下
hive> select tdate t,interest as i from stu1;
至于起别名的时候写不写as都行,看个人习惯。
1.2算术运算符
运算符 | 描述 |
---|---|
A+B | 加法 |
A-B | 减法 |
A*B | 乘法 |
A/B | 除法 |
A%B | A对B取余 |
A&B | AB按照位置取与 |
A^B | AB按位置取异或 |
~A | A按位取反 |
A丨B | AB按照位置取或 |
1.3常用函数
1.3.1 count
hive> select count(*) from stu1;
Query ID = centos01_20190223222341_07b9de4b-07c5-446f-9710-f72e4c3c8233
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-23 22:23:43,461 Stage-1 map = 100%, reduce = 0%
2019-02-23 22:23:44,475 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1064054946_0001
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 87957 HDFS Write: 40383 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
429
Time taken: 3.308 seconds, Fetched: 1 row(s)
1.3.2 max
hive> select max(tdate) sumNum from stu1;
Query ID = centos01_20190224085856_03d4ed01-3491-44ae-bd7a-878db173ca80
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:58:57,897 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1109629230_0002
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 58638 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
20140831
Time taken: 1.459 seconds, Fetched: 1 row(s)
1.3.3 min
hive> select min(tdate) sumNum from stu1;
Query ID = centos01_20190224085918_e8284ec5-ad16-4313-b302-b1cb065453e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:59:19,748 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1173673069_0003
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 87957 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
20130701
Time taken: 1.48 seconds, Fetched: 1 row(s)
1.3.4 sum
hive> select sum(tdate) sumNum from stu1;
Query ID = centos01_20190224085754_367acdef-2f0e-4707-9f5c-7ee23321d623
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:58:04,335 Stage-1 map = 0%, reduce = 0%
2019-02-24 08:58:06,446 Stage-1 map = 100%, reduce = 0%
2019-02-24 08:58:07,473 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local489812945_0001
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 29319 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
8618362131
Time taken: 12.605 seconds, Fetched: 1 row(s)
1.3.5 avg
hive> select avg(tdate) sumNum from stu1;
Query ID = centos01_20190224085943_ef097957-7f9a-4c0c-b911-d79e8a182032
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 08:59:45,237 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1920915701_0004
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 117276 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
2.013636011915888E7
Time taken: 1.478 seconds, Fetched: 1 row(s)
1.4 Limit语句
查询返回多行字句的时候,limit可以限制返回的行数
hive> select * from stu1 limit 5;
OK
NULL NULL NULL 2010
20130701 1.5787 6.307 2010
20130701 1.5787 6.307 2010
20130702 1.5461 6.174 2010
20130703 1.467 6.034 2010
Time taken: 0.095 seconds, Fetched: 5 row(s)
2.Where语句
操作符 | 支持的数据类型 | 描述 |
---|---|---|
A=B | 基本数据类型 | 如果 A等于 B 则返回 TRUE,反之返回 FALSE |
A<=>B | 基本数据类型 | 如果 A 和 B 都为 NULL,返回 TRUE,其他的和等号(=)操作符的结果一致,如果任一 为 NULL 则结果为 NULL |
A<>B, A!=B | 基本数据类型 | A 或者 B 为 NULL 则返回 NULL;如果 A 不 等于 B,则返回 TRUE,反之返回 FALSE |
A<B | 基本数据类型 | A 或者 B 为 NULL,则返回 NULL;如果 A 小 于 B,则返回 TRUE,反之返回 FALSE |
A<=B | 基本数据类型 | A 或者 B 为 NULL,则返回 NULL;如果 A 小 于等于 B,则返回 TRUE,反之返回 FALSE |
A>B | 基本数据类型 | A 或者 B 为 NULL,则返回 NULL;如果 A 大 于 B,则返回 TRUE,反之返回 FALSE |
A>=B | 基本数据类型 | A 或者 B 为 NULL,则返回 NULL;如果 A 大 于等于 B,则返回 TRUE,反之返回 FALSE |
A [NOT] BETWEEN B AND C | 基本数据类型 | 如果 A,B 或者 C 任一为 NULL,则结果为 NULL。如果 A 的值大于等于 B 而且小于或等 于 C,则结果为 TRUE,反之为 FALSE。如果 使用 NOT 关键字则可达到相反的效果 |
A IS NULL | 所有数据类型 | 如果 A 等于 NULL,则返回 TRUE,反之返回 FALSE |
A IS NOT NULL | 所有数据类型 | 如果 A 不等于 NULL,则返回 TRUE,反之返 回 FALSE |
IN(数值 1, 数值 2) | 所有数据类型 | 使用 IN 运算显示列表中的值 |
A [NOT] LIKE B | STRING 类型 | B 是一个 SQL 下的简单正则表达式,如果 A 与其匹配的话,则返回 TRUE;反之返回 FALSE。B 的表达式说明如下:‘x%’表示 A 必 须以字母‘x’开头,‘%x’表示 A 必须以字母’x’ 结尾,而‘%x%’表示 A 包含有字母’x’,可以位 于开头,结尾或者字符串中间。如果使用 NOT关键字则可达到相反的效果。 |
A RLIKE B, A REGEXP B | STRING 类型 | B 是一个正则表达式,如果 A 与其匹配,则返 回TRUE;反之返回FALSE。匹配使用的是JDK 中的正则表达式接口实现的,因为正则也依据 其中的规则。例如,正则表达式必须和整个字 符串 A 相匹配,而不是只需与其字符串匹配。 |
2.1比较运算符(Between…and…,Null,In)
2.1.1 Between…and…
查询在…和…之间的数据,用在where之后
hive> select * from stu1 where tdate between 20140821 and 20140831;
OK
20140821 1.1168 4.158 2010
20140822 1.104 4.153 2010
20140823 1.0997 4.146 2010
20140824 1.0993 4.139 2010
20140825 1.1122 4.139 2010
20140826 1.1085 4.131 2010
20140827 1.1079 4.123 2010
20140828 1.1043 4.116 2010
20140829 1.1169 4.123 2010
20140830 1.1206 4.134 2010
20140831 1.1204 4.146 2010
2.1.2 is Null
查询空值
hive> select * from stu1 where tdate is NUll;
OK
NULL NULL NULL 2010
Time taken: 0.081 seconds, Fetched: 1 row(s)
2.1.3 In(n1,n2)
查询是满足值是n1和n2的数据
hive> select * from stu1 where tdate in (20140401,20140831);
OK
20140401 1.4368 5.356 2010
20140831 1.1204 4.146 2010
Time taken: 0.086 seconds, Fetched: 2 row(s)
2.2.Like和RLike
使用like运算可以选择类似的值
选择条件可以包含字符或者数字(%表示任意个字符 _表示一个字符)
RLike字句是Hive这个功能的扩展,其可以通过Java的正则表达式这个更强大的语言来匹配条件
举例:
查询以1开头的利率信息
hive> select * from stu1 where interest like '1%';
OK
20130701 1.5787 6.307 2010
20130701 1.5787 6.307 2010
20130702 1.5461 6.174 2010
20130703 1.467 6.034 2010
查询含有2的利率信息
hive> select * from stu1 where interest like '%2%';
查询第二个数值是3的利率信息
hive> select * from stu1 where interest like '_3%';
2.4 逻辑运算符(And逻辑并 Or逻辑或 Not逻辑非)
3.分组
※※※3.1 Group By
Group by通常和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组进行聚合操作
hive> select avg(interest),tdate from stu1 group by tdate;
Query ID = centos01_20190224094703_82c686d7-36d1-4fd7-a041-4d2fb2e53f74
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 09:47:04,787 Stage-1 map = 100%, reduce = 0%
2019-02-24 09:47:05,793 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1477191845_0005
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 364116 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3.2 Having
where | Having |
---|---|
针对查询的时候发挥作用 | 针对查询结果发挥作用 |
后面不能写分组函数 | 后面可以写分组函数 |
只用于Group By的统计语句 |
hive> select avg(interest),tdate from stu1 group by tdate having avg(interest)> 1.5;
Query ID = centos01_20190224095607_6fe9db90-2c79-470c-abf6-d8eee70e0b19
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-02-24 09:56:09,270 Stage-1 map = 100%, reduce = 0%
2019-02-24 09:56:10,277 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1381061535_0006
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 393435 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK