[Hive]Hive的分析函数

目录

1.Order by 注意点

2.sum函数的作用范围

3.ROWS BETWEEN

4.常用的分析函数

4.1 RANk、DENSE_RANK

4.2 ROW_NUMBER

4.3 SUM

4.4 FIRST_VALUE、LAST_VALUE

4.4 LAG、LEAD

5.GROUPING SETS、GROUPING__ID

6. CUBE

7.ROLLUP


1.Order by 注意点

当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING。

查询的结果和ORDER BY相关,默认为升序

2.sum函数的作用范围

如果不指定ROWS BETWEEN,默认为从起点到当前行;

如果不指定ORDER BY,则将分组内所有值累加;

3.ROWS BETWEEN

关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:

PRECEDING:往前

FOLLOWING:往后

CURRENT ROW:当前行

UNBOUNDED:无界限(起点或终点)

UNBOUNDED PRECEDING:表示从前面的起点

UNBOUNDED FOLLOWING:表示到后面的终点

4.常用的分析函数

4.1 RANk、DENSE_RANK

RANK()在出现等级相同的元素时预留为空,DENSE_RANK()不会。

某产品类型有两个并列第一

RANK():第一二为1,第三位3

DENSE_RANK():第一二为1,第三位2

4.2 ROW_NUMBER

ROW_NUMBER()从1开始,为每条记录返回一个数字

select a,b,row_number()over(partition by a desc a) as rn from t

 

4.3 SUM

例如,按照部门ID大小排升序,依次统计所有工资部门的累计工资支出。

SELECT deptid,SUM(salary) salary_amount,

SUM(SUM(salary)) OVER( ORDER BY deptid ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_amount

FROM employee

GROUP BY deptid

ORDER BY deptid;

例如,按照部门ID大小排升序,依次统计当前部门和之前1个部门的工资部门的累计工资支出。

SELECT deptid,SUM(salary) salary_amount,SUM(SUM(salary)) OVER( ORDER BY deptid ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS cumulative_amount FROM employee GROUP BY deptid ORDER BY deptid;

 

计算前3个月之间的和

 SUM( SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS cumulative_amount 

前后一个月之间的和

SUM( SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS cumulative_amount

 

4.4 FIRST_VALUE、LAST_VALUE

FIRST_VALUE:取分组内排序后,窗口中第一个值

LAST_VALUE: 取分组内排序后,窗口中最后一个值

例如:

  前一行和后一行之间的第一个值

FIRST_VALUE(SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS xxxx;

前一行和后一行之间的最后一个值

LAST_VALUE(SUM(amount)) OVER (ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS xxxx;   

4.4 LAG、LEAD

LEAD(col,n,DEFAULT) :用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),

第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

LAG(col,n,DEFAULT) :与lead相反,用于统计窗口内往上第n行值。第一个参数为列名,

第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

  LAG(column_name1,1) OVER(ORDER BY column_name2)

  LEAD(column_name1,1) OVER(ORDER BY column_name2) 

这样就获得前一条、后一条的数据

select *,LAG(empid,1) over (order by empid asc) from employee;

5.GROUPING SETS、GROUPING__ID

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL,其中的GROUPING__ID,表示结果属于哪一个分组集合。

select

    user_type,

    sales,

    count(user_id) as pv,

    GROUPING__ID

from

    order_detail

group by

    user_type,sales

GROUPING SETS(user_type,sales)

ORDER BY

    GROUPING__ID;

 

+------------+--------+-----+---------------+--+

| user_type  | sales  | pv  | grouping__id  |

+------------+--------+-----+---------------+--+

| old        | NULL   | 3   | 1             |

| new        | NULL   | 7   | 1             |

| NULL       | 6      | 1   | 2             |

| NULL       | 5      | 2   | 2             |

| NULL       | 3      | 2   | 2             |

| NULL       | 2      | 2   | 2             |

| NULL       | 1      | 3   | 2             |

+------------+--------+-----+---------------+--+

6. CUBE

根据GROUP BY的维度的所有组合进行聚合。

SELECT

month,

day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM lxw1234

GROUP BY month,day

WITH CUBE

ORDER BY GROUPING__ID;

 

等价于

SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM lxw1234

UNION ALL

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

UNION ALL

SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day

UNION ALL

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day

 

 结果:

month                day             uv     GROUPING__ID

--------------------------------------------

NULL            NULL            7       0

2020-03         NULL            5       1

2020-04         NULL            6       1

NULL            2020-04-12      2       2

NULL            2020-04-13      3       2

NULL            2020-04-15      2       2

NULL            2020-04-16      2       2

NULL            2020-03-10      4       2

NULL            2020-03-12      1       2

2020-03         2020-03-10      4       3

2020-03         2020-03-12      1       3

2020-04         2020-04-16      2       3

2020-04         2020-04-12      2       3

2020-04         2020-04-13      3       3

2020-04         2020-04-15      2       3

7.ROLLUP

是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

可以通过ROLLU实现这上钻过程,如:月天的UV->月的UV->总UV。

比如,以month维度进行层级聚合:

SELECT

month,

day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID 

FROM lxw1234

GROUP BY month,day

WITH ROLLUP

ORDER BY GROUPING__ID;

 

等价于

SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM lxw1234

UNION ALL

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

UNION ALL

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day;

结果:

month                day             uv     GROUPING__ID

---------------------------------------------------

NULL             NULL            7       0

2020-03          NULL            5       1

2020-04          NULL            6       1

2020-03          2020-03-10      4       3

2020-03          2020-03-12      1       3

2020-04          2020-04-12      2       3

2020-04          2020-04-13      3       3

2020-04          2020-04-15      2       3

2020-04          2020-04-16      2       3

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值