Hive函数—(窗口函数概念,over关键字,order by子句,Window子句)

Hive函数—窗口函数概念

  1. 窗口函数又名开窗函数,属于分析函数的一种。用于解决复杂报表统计需求的功能强大的函数,很多场景都需要用到。
  2. 窗口函数用于计算基于组的某种聚合值,窗口函数对于每个组返回多行,而聚合函数对于每个组只返回一行。
  3. 窗口函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化。

Hive窗口函数—over关键字

  1. 使用窗口函数之前一般要要通过over()进行开窗

  2. 查询emp表中的工资总和:select sum(sal) from emp;,但是如果在查询工资总和的基础上再加入别的字段,会报错:

    hive (default)> select ename, sal, sum(sal) salsum from emp;
    FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'
    
  3. 为了解决上述问题,可以使用窗口函数,并且查询员工姓名、薪水、薪水总和

    hive (default)> select ename, sal, sum(sal) over() salsum,
                  > concat(round(sal / sum(sal) over()*100, 1) || '%') ratiosal
                  > from emp;
    Query ID = root_20221215155652_86f66456-32c4-4c1f-bd79-2b113c454ca2
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1671090331623_0003)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 39.40 s    
    ----------------------------------------------------------------------------------------------
    OK
    ename	sal	salsum	ratiosal
    SMITH	800	29025	2.8%
    ALLEN	1600	29025	5.5%
    WARD	1250	29025	4.3%
    JONES	2975	29025	10.2%
    MARTIN	1250	29025	4.3%
    BLAKE	2850	29025	9.8%
    CLARK	2450	29025	8.4%
    SCOTT	3000	29025	10.3%
    KING	5000	29025	17.2%
    TURNER	1500	29025	5.2%
    ADAMS	1100	29025	3.8%
    JAMES	950	29025	3.3%
    FORD	3000	29025	10.3%
    MILLER	1300	29025	4.5%
    NULL	NULL	29025	NULL
    Time taken: 46.755 seconds, Fetched: 15 row(s)
    
  4. 窗口函数是针对每一行数据的;如果over中没有参数,默认的是全部结果集;

Hive函数—order by子句

  1. order by 子句对输入的数据进行排序:增加了order by子句;sum:从分组的第一行到当前行求和

    hive (default)> select ename, sal,deptno, sum(sal) over(partition by deptno order by sal) salsum,
                  > concat(round(sal / sum(sal) over() * 100,1) || '%')
                  > from emp;
    Query ID = root_20221215160217_3d50326a-f52f-468f-b949-cee89841f880
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1671090331623_0003)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    Reducer 3 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 6.79 s     
    ----------------------------------------------------------------------------------------------
    OK
    ename	sal	deptno	salsum	_c4
    NULL	NULL	NULL	NULL	NULL
    MILLER	1300	10	1300	4.5%
    CLARK	2450	10	3750	8.4%
    KING	5000	10	8750	17.2%
    SMITH	800	20	800	2.8%
    ADAMS	1100	20	1900	3.8%
    JONES	2975	20	4875	10.2%
    FORD	3000	20	10875	10.3%
    SCOTT	3000	20	10875	10.3%
    JAMES	950	30	950	3.3%
    MARTIN	1250	30	3450	4.3%
    WARD	1250	30	3450	4.3%
    TURNER	1500	30	4950	5.2%
    ALLEN	1600	30	6550	5.5%
    BLAKE	2850	30	9400	9.8%
    Time taken: 7.563 seconds, Fetched: 15 row(s)
    

Hive函数—Window子句

  1. 语法格式:

    rows between ... and ...
    
  2. 如果要对窗口的结果做更细粒度的划分,使用window子句,有如下的几个选项:

    1. unbounded preceding。组内第一行数据
    2. n preceding。组内当前行的前n行数据
    3. current row。当前行数据
    4. n following。组内当前行的后n行数据
    5. unbounded following。组内最后一行数据
    

在这里插入图片描述

  1. 传统方式:按照deptno为分区字段,并以sal为排序字段,查询员工姓名,部门薪水总和:
hive (default)> select ename,sal,deptno,sum(sal) over(partition by deptno order by sal) from emp;
Query ID = root_20221215162616_ca9f4d1f-ece3-42d1-95fc-efcee6757be9
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1671090331623_0004)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.61 s    
----------------------------------------------------------------------------------------------
OK
ename	sal	deptno	sum_window_0
NULL	NULL	NULL	NULL
MILLER	1300	10	1300
CLARK	2450	10	3750
KING	5000	10	8750
SMITH	800	20	800
ADAMS	1100	20	1900
JONES	2975	20	4875
FORD	3000	20	10875
SCOTT	3000	20	10875
JAMES	950	30	950
MARTIN	1250	30	3450
WARD	1250	30	3450
TURNER	1500	30	4950
ALLEN	1600	30	6550
BLAKE	2850	30	9400
Time taken: 16.3 seconds, Fetched: 15 row(s)
  1. 使用 Window子句 子句方式实现按照deptno为分区字段,并以ename 为排序字段,查询员工姓名,部门薪水总和:(注意:sum_window_0,sum_window_0是第一行到当前行的 sal的和)

    hive (default)> select ename, sal, deptno,sum(sal)
                  > over(partition by deptno order by ename rows between unbounded preceding and current row)
                  > from emp;
    Query ID = root_20221215163355_b70b527f-9f1f-470c-9d1a-84debaf8bef2
    Total jobs = 1
    Launching Job 1 out of 1
    Tez session was closed. Reopening...
    Session re-established.
    Status: Running (Executing on YARN cluster with App id application_1671090331623_0005)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.58 s     
    ----------------------------------------------------------------------------------------------
    OK
    ename	sal	deptno	sum_window_0
    NULL	NULL	NULL	NULL
    CLARK	2450	10	2450
    KING	5000	10	7450
    MILLER	1300	10	8750
    ADAMS	1100	20	1100
    FORD	3000	20	4100
    JONES	2975	20	7075
    SCOTT	3000	20	10075
    SMITH	800	20	10875
    ALLEN	1600	30	1600
    BLAKE	2850	30	4450
    JAMES	950	30	5400
    MARTIN	1250	30	6650
    TURNER	1500	30	8150
    WARD	1250	30	9400
    Time taken: 13.829 seconds, Fetched: 15 row(s)
    
  2. 使用window求组内的,第一行到最后一行的和 ,注意看sum_window_0字段

    hive (default)> select ename, sal, deptno,
                  > sum(sal) over(partition by deptno order by ename
                  > rows between unbounded preceding and
                  > unbounded following
                  > ) from emp;
    Query ID = root_20221215164027_27c5f3bd-e0b9-4755-8e80-f3266d8e2e0a
    Total jobs = 1
    Launching Job 1 out of 1
    Tez session was closed. Reopening...
    Session re-established.
    Status: Running (Executing on YARN cluster with App id application_1671090331623_0006)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.15 s     
    ----------------------------------------------------------------------------------------------
    OK
    ename	sal	deptno	sum_window_0
    NULL	NULL	NULL	NULL
    CLARK	2450	10	8750
    KING	5000	10	8750
    MILLER	1300	10	8750
    ADAMS	1100	20	10875
    FORD	3000	20	10875
    JONES	2975	20	10875
    SCOTT	3000	20	10875
    SMITH	800	20	10875
    ALLEN	1600	30	9400
    BLAKE	2850	30	9400
    JAMES	950	30	9400
    MARTIN	1250	30	9400
    TURNER	1500	30	9400
    WARD	1250	30	9400
    Time taken: 10.499 seconds, Fetched: 15 row(s)
    
  3. 求组内:前一行,当前行,后一行的和 注意看:sum_window_0字段

    hive (default)> select ename, sal, deptno,
                  > sum(sal) over(partition by deptno order by ename
                  > rows between 1 preceding and 1 following)
                  > from emp;
    Query ID = root_20221215164818_0c08d9de-634f-4e75-a65d-441c29e429c2
    Total jobs = 1
    Launching Job 1 out of 1
    Tez session was closed. Reopening...
    Session re-established.
    Status: Running (Executing on YARN cluster with App id application_1671090331623_0007)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.72 s     
    ----------------------------------------------------------------------------------------------
    OK
    ename	sal	deptno	sum_window_0
    NULL	NULL	NULL	NULL
    CLARK	2450	10	7450
    KING	5000	10	8750
    MILLER	1300	10	6300
    ADAMS	1100	20	4100
    FORD	3000	20	7075
    JONES	2975	20	8975
    SCOTT	3000	20	6775
    SMITH	800	20	3800
    ALLEN	1600	30	4450
    BLAKE	2850	30	5400
    JAMES	950	30	5050
    MARTIN	1250	30	3700
    TURNER	1500	30	4000
    WARD	1250	30	2750
    Time taken: 13.409 seconds, Fetched: 15 row(s)
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值