一、查询
语法:
SELECT [ALL | DISTINCT] select_expr, select_expr, … – 查哪些
FROM table_reference – 从哪查
[WHERE where_condition] – 过滤条件
[GROUP BY col_list] – 分组
[HAVING having_contiditon] – 分组后过滤条件
[ORDER BY col_list] – 全局排序
[CLUSTER BY col_list – 分区排序
|
[DISTRIBUTE BY col_list] – 分区
[SORT BY col_list] – 区内排序
]
[LIMIT number] – 限制返回的条数
1.数据 表 准备
a. 建员工部门表:
Time taken: 0.512 seconds
hive (dyhtest)> create table if not exists dept(
> deptno int,
> dname string,
> loc int
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 1.909 seconds
hive (dyhtest)> create table if not exists emp(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double,
> deptno int)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.117 seconds
b.数据准备
[atdyh@hadoop102 datas]$ vim dept.txt
10 ACCOUNTING 1700
20 RESEARCH 1800
30 SALES 1900
40 OPERATIONS 1700
[atdyh@hadoop102 datas]$ vim emp.txt
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANALYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10
c.加载数据
--- 部门
hive (dyhtest)> load data local inpath '/opt/module/hive-3.1.2/datas/dept.txt' into table dept;
Loading data to table dyhtest.dept
OK
Time taken: 1.466 seconds
hive (dyhtest)> select * from dept;
OK
dept.deptno dept.dname dept.loc
10 ACCOUNTING 1700
20 RESEARCH 1800
30 SALES 1900
40 OPERATIONS 1700
Time taken: 1.829 seconds, Fetched: 4 row(s)
2.数据准备完毕之后开启查询之旅哦
开启之前以下几点需要注意哦:
注意:
(1)SQL 语言大小写不敏感。
linux下的mysql大小写是敏感的
(2)SQL 可以写在一行或者多行
(3)关键字不能被缩写也不能分行
(4)各子句一般要分行写。
(5)使用缩进提高语句的可读性。
a. limit语句
hive (dyhtest)> select * from emp;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
NULL CLERK 7902 NULL 800.00 NULL 20.0 NULL
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.127 seconds, Fetched: 14 row(s)
hive (dyhtest)> select * from emp limit 5;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
NULL CLERK 7902 NULL 800.00 NULL 20.0 NULL
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
Time taken: 0.159 seconds, Fetched: 5 row(s)
hive (dyhtest)> select * from emp limit 2,3;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
Time taken: 0.131 seconds, Fetched: 3 row(s)
hive (dyhtest)>
注意: hive低版本,只支持limit 3 一个参数,高版本 之后支持 limit 2,3 两个参数,进行范围查找
b.Rlike
RLIKE子句是Hive中这个功能的一个扩展,其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。
--- 名字中含有A的员工信息
hive (dyhtest)> select * from emp where ename Rlike '[A]';
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
Time taken: 2.116 seconds, Fetched: 7 row(s)
注意:正则表达式可能大家用的不熟悉,后续博主会更新正则表达式的文章,这个用法大家了解就好
c. Group By语句
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。
注意:分组之后,select后面只能跟组标识(分组字段) 和 聚合函数(分组函数)
demo1:计算每个部门的平均薪资
hive (dyhtest)> select deptno,avg(sal) as avg_sal from emp group by deptno;
Query ID = atdyh_20220629232415_cfae7dd6-64a7-43b5-ba95-e2d90b0172c8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1656514177847_0001, Tracking URL = http://hadoop103:8088/proxy/application_1656514177847_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1656514177847_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-06-29 23:24:27,644 Stage-1 map = 0%, reduce = 0%
2022-06-29 23:24:34,959 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.75 sec
2022-06-29 23:24:40,112 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.55 sec
MapReduce Total cumulative CPU time: 7 seconds 550 msec
Ended Job = job_1656514177847_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.55 sec HDFS Read: 16481 HDFS Write: 196 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 550 msec
OK
deptno avg_sal
NULL NULL
10 2916.6666666666665
20 2518.75
30 1566.6666666666667
Time taken: 25.24 seconds, Fetched: 4 row(s)
demo2:计算每个部门岗位最高月薪
hive (dyhtest)> select deptno ,job ,max(sal) max_sal from emp group by deptno,job ;
Query ID = atdyh_20220629235752_e170bdf1-e1fa-4b62-9f34-77ea8e2d7a34
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1656514177847_0002, Tracking URL = http://hadoop103:8088/proxy/application_1656514177847_0002/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1656514177847_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-06-29 23:58:09,002 Stage-1 map = 0%, reduce = 0%
2022-06-29 23:58:17,303 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.63 sec
2022-06-29 23:58:24,624 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.31 sec
MapReduce Total cumulative CPU time: 14 seconds 310 msec
Ended Job = job_1656514177847_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 14.31 sec HDFS Read: 15064 HDFS Write: 376 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 310 msec
OK
deptno job max_sal
NULL 7902 NULL
20 ANALYST 3000.0
10 CLERK 1300.0
20 CLERK 1100.0
30 CLERK 950.0
10 MANAGER 2450.0
20 MANAGER 2975.0
30 MANAGER 2850.0
10 PRESIDENT 5000.0
30 SALESMAN 1600.0
Time taken: 32.889 seconds, Fetched: 10 row(s)
demo3:计算emp中每个部门中最高薪水的那个人
思路:
1.要每个部门,最高薪水的那个人,涉及到部门(部门表dept),薪水(员工信息表emp),那个人(员工信息表emp)
2.结果数据确定后,根据结果数据可以推断我们的开发逻辑。先从emp表,根据deptno group by,然后select deptno,max(sal)
3.查询到的结果集和emp表关联,结果集没要求其他数据,所以我们用join就可以了
hive (dyhtest)> select e.deptno,e.sal,e.ename
> from emp e
> join (select deptno,max(sal) as max_sal from emp group by deptno) a
> on e.deptno = a.deptno and e.sal = a.max_sal;
Query ID = atdyh_20220702091329_a7d97fbd-918e-4695-a1fb-adca420cfc8a
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1656722309456_0001, Tracking URL = http://hadoop103:8088/proxy/application_1656722309456_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1656722309456_0001
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-07-02 09:13:44,837 Stage-2 map = 0%, reduce = 0%
2022-07-02 09:13:50,167 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.61 sec
2022-07-02 09:14:01,516 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 7.44 sec
MapReduce Total cumulative CPU time: 7 seconds 440 msec
Ended Job = job_1656722309456_0001
2022-07-02 09:14:11 Dump the side-table for tag: 0 with group count: 11 into file: file:/tmp/atdyh/3844baf9-c259-4e3b-ad34-8232f650620b/hive_2022-07-02_09-13-29_967_3191515226531721885-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2022-07-02 09:14:11 Uploaded 1 File to: file:/tmp/atdyh/3844baf9-c259-4e3b-ad34-8232f650620b/hive_2022-07-02_09-13-29_967_3191515226531721885-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (648 bytes)
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1656722309456_0002, Tracking URL = http://hadoop103:8088/proxy/application_1656722309456_0002/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1656722309456_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2022-07-02 09:14:19,436 Stage-3 map = 0%, reduce = 0%
2022-07-02 09:14:25,688 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 3.18 sec
MapReduce Total cumulative CPU time: 3 seconds 180 msec
Ended Job = job_1656722309456_0002
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 7.44 sec HDFS Read: 13822 HDFS Write: 174 SUCCESS
Stage-Stage-3: Map: 1 Cumulative CPU: 3.18 sec HDFS Read: 7737 HDFS Write: 197 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 620 msec
OK
e.deptno e.sal e.ename
10 5000.0 KING
20 3000.0 SCOTT
20 3000.0 FORD
30 2850.0 BLAKE
Time taken: 56.836 seconds, Fetched: 4 row(s)
d.having关键字
计算emp中除了CLERK岗位之外的剩余员工的每个部门的平均工资大于1000的部门和平均工资。
hive (dyhtest)> select
> deptno , avg(sal) avg_sal
> from
> emp
> where job != 'CLERK'
> group by deptno
> having avg_sal >2000 ;
Query ID = atdyh_20220704233101_015b6acb-b8f1-43a6-86dd-19529e5cbace
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1656948379750_0001, Tracking URL = http://hadoop103:8088/proxy/application_1656948379750_0001/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1656948379750_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-07-04 23:31:16,210 Stage-1 map = 0%, reduce = 0%
2022-07-04 23:31:22,556 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.81 sec
2022-07-04 23:31:28,731 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 10.62 sec
MapReduce Total cumulative CPU time: 10 seconds 620 msec
Ended Job = job_1656948379750_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 10.62 sec HDFS Read: 16274 HDFS Write: 143 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 620 msec
OK
deptno avg_sal
10 3725.0
20 2991.6666666666665
Time taken: 28.323 seconds, Fetched: 2 row(s)
注意:
a. where后面不能写分组函数,而having后面可以使用分组函数。
b. having只用于group by分组统计语句。