Hive函数—窗口函数概念
- 窗口函数又名开窗函数,属于分析函数的一种。用于解决复杂报表统计需求的功能强大的函数,很多场景都需要用到。
- 窗口函数用于计算基于组的某种聚合值,窗口函数对于每个组返回多行,而聚合函数对于每个组只返回一行。
- 窗口函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化。
Hive窗口函数—over关键字
-
使用窗口函数之前一般要要通过over()进行开窗
-
查询emp表中的工资总和:select sum(sal) from emp;,但是如果在查询工资总和的基础上再加入别的字段,会报错:
hive (default)> select ename, sal, sum(sal) salsum from emp; FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'
-
为了解决上述问题,可以使用窗口函数,并且查询员工姓名、薪水、薪水总和
hive (default)> select ename, sal, sum(sal) over() salsum, > concat(round(sal / sum(sal) over()*100, 1) || '%') ratiosal > from emp; Query ID = root_20221215155652_86f66456-32c4-4c1f-bd79-2b113c454ca2 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1671090331623_0003) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 39.40 s ---------------------------------------------------------------------------------------------- OK ename sal salsum ratiosal SMITH 800 29025 2.8% ALLEN 1600 29025 5.5% WARD 1250 29025 4.3% JONES 2975 29025 10.2% MARTIN 1250 29025 4.3% BLAKE 2850 29025 9.8% CLARK 2450 29025 8.4% SCOTT 3000 29025 10.3% KING 5000 29025 17.2% TURNER 1500 29025 5.2% ADAMS 1100 29025 3.8% JAMES 950 29025 3.3% FORD 3000 29025 10.3% MILLER 1300 29025 4.5% NULL NULL 29025 NULL Time taken: 46.755 seconds, Fetched: 15 row(s)
-
窗口函数是针对每一行数据的;如果over中没有参数,默认的是全部结果集;
Hive函数—order by子句
-
order by 子句对输入的数据进行排序:增加了order by子句;sum:从分组的第一行到当前行求和
hive (default)> select ename, sal,deptno, sum(sal) over(partition by deptno order by sal) salsum, > concat(round(sal / sum(sal) over() * 100,1) || '%') > from emp; Query ID = root_20221215160217_3d50326a-f52f-468f-b949-cee89841f880 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1671090331623_0003) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 6.79 s ---------------------------------------------------------------------------------------------- OK ename sal deptno salsum _c4 NULL NULL NULL NULL NULL MILLER 1300 10 1300 4.5% CLARK 2450 10 3750 8.4% KING 5000 10 8750 17.2% SMITH 800 20 800 2.8% ADAMS 1100 20 1900 3.8% JONES 2975 20 4875 10.2% FORD 3000 20 10875 10.3% SCOTT 3000 20 10875 10.3% JAMES 950 30 950 3.3% MARTIN 1250 30 3450 4.3% WARD 1250 30 3450 4.3% TURNER 1500 30 4950 5.2% ALLEN 1600 30 6550 5.5% BLAKE 2850 30 9400 9.8% Time taken: 7.563 seconds, Fetched: 15 row(s)
Hive函数—Window子句
-
语法格式:
rows between ... and ...
-
如果要对窗口的结果做更细粒度的划分,使用window子句,有如下的几个选项:
1. unbounded preceding。组内第一行数据 2. n preceding。组内当前行的前n行数据 3. current row。当前行数据 4. n following。组内当前行的后n行数据 5. unbounded following。组内最后一行数据
- 传统方式:按照deptno为分区字段,并以sal为排序字段,查询员工姓名,部门薪水总和:
hive (default)> select ename,sal,deptno,sum(sal) over(partition by deptno order by sal) from emp;
Query ID = root_20221215162616_ca9f4d1f-ece3-42d1-95fc-efcee6757be9
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1671090331623_0004)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.61 s
----------------------------------------------------------------------------------------------
OK
ename sal deptno sum_window_0
NULL NULL NULL NULL
MILLER 1300 10 1300
CLARK 2450 10 3750
KING 5000 10 8750
SMITH 800 20 800
ADAMS 1100 20 1900
JONES 2975 20 4875
FORD 3000 20 10875
SCOTT 3000 20 10875
JAMES 950 30 950
MARTIN 1250 30 3450
WARD 1250 30 3450
TURNER 1500 30 4950
ALLEN 1600 30 6550
BLAKE 2850 30 9400
Time taken: 16.3 seconds, Fetched: 15 row(s)
-
使用 Window子句 子句方式实现按照deptno为分区字段,并以ename 为排序字段,查询员工姓名,部门薪水总和:(注意:sum_window_0,sum_window_0是第一行到当前行的 sal的和)
hive (default)> select ename, sal, deptno,sum(sal) > over(partition by deptno order by ename rows between unbounded preceding and current row) > from emp; Query ID = root_20221215163355_b70b527f-9f1f-470c-9d1a-84debaf8bef2 Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1671090331623_0005) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.58 s ---------------------------------------------------------------------------------------------- OK ename sal deptno sum_window_0 NULL NULL NULL NULL CLARK 2450 10 2450 KING 5000 10 7450 MILLER 1300 10 8750 ADAMS 1100 20 1100 FORD 3000 20 4100 JONES 2975 20 7075 SCOTT 3000 20 10075 SMITH 800 20 10875 ALLEN 1600 30 1600 BLAKE 2850 30 4450 JAMES 950 30 5400 MARTIN 1250 30 6650 TURNER 1500 30 8150 WARD 1250 30 9400 Time taken: 13.829 seconds, Fetched: 15 row(s)
-
使用window求组内的,第一行到最后一行的和 ,注意看sum_window_0字段
hive (default)> select ename, sal, deptno, > sum(sal) over(partition by deptno order by ename > rows between unbounded preceding and > unbounded following > ) from emp; Query ID = root_20221215164027_27c5f3bd-e0b9-4755-8e80-f3266d8e2e0a Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1671090331623_0006) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.15 s ---------------------------------------------------------------------------------------------- OK ename sal deptno sum_window_0 NULL NULL NULL NULL CLARK 2450 10 8750 KING 5000 10 8750 MILLER 1300 10 8750 ADAMS 1100 20 10875 FORD 3000 20 10875 JONES 2975 20 10875 SCOTT 3000 20 10875 SMITH 800 20 10875 ALLEN 1600 30 9400 BLAKE 2850 30 9400 JAMES 950 30 9400 MARTIN 1250 30 9400 TURNER 1500 30 9400 WARD 1250 30 9400 Time taken: 10.499 seconds, Fetched: 15 row(s)
-
求组内:前一行,当前行,后一行的和 注意看:sum_window_0字段
hive (default)> select ename, sal, deptno, > sum(sal) over(partition by deptno order by ename > rows between 1 preceding and 1 following) > from emp; Query ID = root_20221215164818_0c08d9de-634f-4e75-a65d-441c29e429c2 Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1671090331623_0007) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.72 s ---------------------------------------------------------------------------------------------- OK ename sal deptno sum_window_0 NULL NULL NULL NULL CLARK 2450 10 7450 KING 5000 10 8750 MILLER 1300 10 6300 ADAMS 1100 20 4100 FORD 3000 20 7075 JONES 2975 20 8975 SCOTT 3000 20 6775 SMITH 800 20 3800 ALLEN 1600 30 4450 BLAKE 2850 30 5400 JAMES 950 30 5050 MARTIN 1250 30 3700 TURNER 1500 30 4000 WARD 1250 30 2750 Time taken: 13.409 seconds, Fetched: 15 row(s)