Hive基本使用(3)

一、查询

语法:
SELECT [ALL | DISTINCT] select_expr, select_expr, … – 查哪些
FROM table_reference – 从哪查
[WHERE where_condition] – 过滤条件
[GROUP BY col_list] – 分组
[HAVING having_contiditon] – 分组后过滤条件
[ORDER BY col_list] – 全局排序
[CLUSTER BY col_list – 分区排序
|
[DISTRIBUTE BY col_list] – 分区
[SORT BY col_list] – 区内排序
]
[LIMIT number] – 限制返回的条数

1.数据 表 准备
a. 建员工部门表:

Time taken: 0.512 seconds
hive (dyhtest)> create table if not exists dept(
              > deptno int,
              > dname string,
              > loc int
              > )
              > row format delimited fields terminated by '\t';
OK
Time taken: 1.909 seconds
hive (dyhtest)> create table if not exists emp(
              > empno int,
              > ename string,
              > job string,
              > mgr int,
              > hiredate string, 
              > sal double, 
              > comm double,
              > deptno int)
              > row format delimited fields terminated by '\t';
OK
Time taken: 0.117 seconds

b.数据准备

[atdyh@hadoop102 datas]$ vim dept.txt
10	ACCOUNTING	1700
20	RESEARCH	1800
30	SALES	1900
40	OPERATIONS	1700

[atdyh@hadoop102 datas]$ vim emp.txt
7369	SMITH	CLERK	7902	1980-12-17	800.00		20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		20
7839	KING	PRESIDENT		1981-11-17	5000.00		10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.00		20
7900	JAMES	CLERK	7698	1981-12-3	950.00		30
7902	FORD	ANALYST	7566	1981-12-3	3000.00		20
7934	MILLER	CLERK	7782	1982-1-23	1300.00		10

c.加载数据

--- 部门
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/dept.txt' into table dept;
Loading data to table dyhtest.dept
OK
Time taken: 1.466 seconds
hive (dyhtest)> select * from dept;
OK
dept.deptno	dept.dname	dept.loc
10	ACCOUNTING	1700
20	RESEARCH	1800
30	SALES	1900
40	OPERATIONS	1700
Time taken: 1.829 seconds, Fetched: 4 row(s)

2.数据准备完毕之后开启查询之旅哦
开启之前以下几点需要注意哦:
注意:
(1)SQL 语言大小写不敏感。
linux下的mysql大小写是敏感的
(2)SQL 可以写在一行或者多行
(3)关键字不能被缩写也不能分行
(4)各子句一般要分行写。
(5)使用缩进提高语句的可读性。

a. limit语句

hive (dyhtest)> select * from emp;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
Time taken: 0.127 seconds, Fetched: 14 row(s)
hive (dyhtest)> select * from emp limit 5;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
NULL	CLERK	7902	NULL	800.00	NULL	20.0	NULL
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
Time taken: 0.159 seconds, Fetched: 5 row(s)
hive (dyhtest)> select * from emp limit 2,3;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
Time taken: 0.131 seconds, Fetched: 3 row(s)
hive (dyhtest)> 

注意: hive低版本,只支持limit 3 一个参数,高版本 之后支持 limit 2,3 两个参数,进行范围查找

b.Rlike
RLIKE子句是Hive中这个功能的一个扩展,其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。

--- 名字中含有A的员工信息
hive (dyhtest)> select * from  emp where ename Rlike '[A]'; 
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
Time taken: 2.116 seconds, Fetched: 7 row(s)

注意:正则表达式可能大家用的不熟悉,后续博主会更新正则表达式的文章,这个用法大家了解就好

c. Group By语句
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。

注意:分组之后,select后面只能跟组标识(分组字段) 和 聚合函数(分组函数)

demo1:计算每个部门的平均薪资

hive (dyhtest)> select deptno,avg(sal) as avg_sal from emp group by deptno;
Query ID = atdyh_20220629232415_cfae7dd6-64a7-43b5-ba95-e2d90b0172c8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1656514177847_0001, Tracking URL = http://hadoop103:8088/proxy/application_1656514177847_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1656514177847_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-06-29 23:24:27,644 Stage-1 map = 0%,  reduce = 0%
2022-06-29 23:24:34,959 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.75 sec
2022-06-29 23:24:40,112 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.55 sec
MapReduce Total cumulative CPU time: 7 seconds 550 msec
Ended Job = job_1656514177847_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.55 sec   HDFS Read: 16481 HDFS Write: 196 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 550 msec
OK
deptno	avg_sal
NULL	NULL
10	2916.6666666666665
20	2518.75
30	1566.6666666666667
Time taken: 25.24 seconds, Fetched: 4 row(s)

demo2:计算每个部门岗位最高月薪

hive (dyhtest)> select deptno ,job ,max(sal) max_sal from emp group by deptno,job ; 
Query ID = atdyh_20220629235752_e170bdf1-e1fa-4b62-9f34-77ea8e2d7a34
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1656514177847_0002, Tracking URL = http://hadoop103:8088/proxy/application_1656514177847_0002/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1656514177847_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-06-29 23:58:09,002 Stage-1 map = 0%,  reduce = 0%
2022-06-29 23:58:17,303 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.63 sec
2022-06-29 23:58:24,624 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.31 sec
MapReduce Total cumulative CPU time: 14 seconds 310 msec
Ended Job = job_1656514177847_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 14.31 sec   HDFS Read: 15064 HDFS Write: 376 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 310 msec
OK
deptno	job	max_sal
NULL	7902	NULL
20	ANALYST	3000.0
10	CLERK	1300.0
20	CLERK	1100.0
30	CLERK	950.0
10	MANAGER	2450.0
20	MANAGER	2975.0
30	MANAGER	2850.0
10	PRESIDENT	5000.0
30	SALESMAN	1600.0
Time taken: 32.889 seconds, Fetched: 10 row(s)

demo3:计算emp中每个部门中最高薪水的那个人
思路:
1.要每个部门,最高薪水的那个人,涉及到部门(部门表dept),薪水(员工信息表emp),那个人(员工信息表emp)
2.结果数据确定后,根据结果数据可以推断我们的开发逻辑。先从emp表,根据deptno group by,然后select deptno,max(sal)
3.查询到的结果集和emp表关联,结果集没要求其他数据,所以我们用join就可以了

hive (dyhtest)> select e.deptno,e.sal,e.ename 
              > from emp e
              > join (select deptno,max(sal) as max_sal from emp  group by deptno) a
              > on e.deptno = a.deptno and e.sal = a.max_sal;
Query ID = atdyh_20220702091329_a7d97fbd-918e-4695-a1fb-adca420cfc8a
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1656722309456_0001, Tracking URL = http://hadoop103:8088/proxy/application_1656722309456_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1656722309456_0001
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-07-02 09:13:44,837 Stage-2 map = 0%,  reduce = 0%
2022-07-02 09:13:50,167 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.61 sec
2022-07-02 09:14:01,516 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 7.44 sec
MapReduce Total cumulative CPU time: 7 seconds 440 msec
Ended Job = job_1656722309456_0001
2022-07-02 09:14:11	Dump the side-table for tag: 0 with group count: 11 into file: file:/tmp/atdyh/3844baf9-c259-4e3b-ad34-8232f650620b/hive_2022-07-02_09-13-29_967_3191515226531721885-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2022-07-02 09:14:11	Uploaded 1 File to: file:/tmp/atdyh/3844baf9-c259-4e3b-ad34-8232f650620b/hive_2022-07-02_09-13-29_967_3191515226531721885-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (648 bytes)
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1656722309456_0002, Tracking URL = http://hadoop103:8088/proxy/application_1656722309456_0002/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1656722309456_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2022-07-02 09:14:19,436 Stage-3 map = 0%,  reduce = 0%
2022-07-02 09:14:25,688 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.18 sec
MapReduce Total cumulative CPU time: 3 seconds 180 msec
Ended Job = job_1656722309456_0002
MapReduce Jobs Launched: 
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 7.44 sec   HDFS Read: 13822 HDFS Write: 174 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 3.18 sec   HDFS Read: 7737 HDFS Write: 197 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 620 msec
OK
e.deptno	e.sal	e.ename
10	5000.0	KING
20	3000.0	SCOTT
20	3000.0	FORD
30	2850.0	BLAKE
Time taken: 56.836 seconds, Fetched: 4 row(s)

d.having关键字

计算emp中除了CLERK岗位之外的剩余员工的每个部门的平均工资大于1000的部门和平均工资。

hive (dyhtest)> select
              >   deptno , avg(sal) avg_sal 
              > from 
              > emp
              > where job != 'CLERK'
              > group by deptno 
              > having avg_sal >2000 ;
Query ID = atdyh_20220704233101_015b6acb-b8f1-43a6-86dd-19529e5cbace
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1656948379750_0001, Tracking URL = http://hadoop103:8088/proxy/application_1656948379750_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1656948379750_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-04 23:31:16,210 Stage-1 map = 0%,  reduce = 0%
2022-07-04 23:31:22,556 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.81 sec
2022-07-04 23:31:28,731 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 10.62 sec
MapReduce Total cumulative CPU time: 10 seconds 620 msec
Ended Job = job_1656948379750_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 10.62 sec   HDFS Read: 16274 HDFS Write: 143 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 620 msec
OK
deptno	avg_sal
10	3725.0
20	2991.6666666666665
Time taken: 28.323 seconds, Fetched: 2 row(s)

注意:
a. where后面不能写分组函数,而having后面可以使用分组函数。
b. having只用于group by分组统计语句。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
Hive是一个基于Hadoop数据仓库工具,用于进行大规模数据分析和查询。下面是Hive的一些基本操作命令: 1. 使用命令`show databases;`可以查看当前所有的数据库。 2. 使用命令`CREATE DATABASE park;`可以创建一个名为park的数据库。实际上,创建数据库相当于在Hadoop的HDFS文件系统中创建了一个目录节点,统一存在`/usr/hive/warehouse`目录下。 3. 使用命令`USE park;`可以进入park数据库。 4. 使用命令`show tables;`可以查看当前数据库下的所有表。 5. 使用命令`CREATE TABLE stu (id INT, name STRING);`可以创建一个名为stu的表,其中包含id和name两个字段。在Hive中,使用的是STRING类型来表示字符,而不是CHAR或VARCHAR类型。所创建的表实际上也是HDFS中的一个目录节点。默认情况下,所有在default数据库下创建的表都直接存在`/usr/hive/warehouse`目录下。 6. 使用命令`INSERT INTO TABLE stu VALUES (1, 'John');`可以向stu表中插入数据。HDFS不支持数据的修改和删除,但在Hive 2.0版本后开始支持数据的追加,可以使用`INSERT INTO`语句执行追加操作。Hive支持查询和行级别的插入,但不支持行级别的删除和修改。实际上,Hive的操作是通过执行MapReduce任务来完成的。插入数据后,我们可以在HDFS的stu目录下发现多了一个文件,其中存储了插入的数据。因此,可以得出结论:Hive存储的数据是通过HDFS的文件来存储的。 7. 使用命令`SELECT id FROM stu;`可以查询stu表中的数据,并只返回id字段的值。 需要注意的是,如果想在HDFS目录下自己创建一个分区,并在该分区下上传文件,需要执行一些额外的操作。首先,手动创建的分区在Hive中是无法使用的,因为原数据库中没有记录该分区的信息。要让自己创建的分区被识别,需要执行命令`ALTER TABLE book ADD PARTITION (category = 'jp') LOCATION '/user/hive/warehouse/park.db/book/category=jp';`。这条命令的作用是在原数据表book中创建对应的分区信息。另外,还可以使用命令`ALTER TABLE book****** 'nn');`来修改分区。 希望以上信息能对你有所帮助!

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值