Hive学习笔记:04 SQL的窗口函数、OVER的使用


本文将介绍Hive SQL中窗口函数、分析函数以及Over的使用,其实不仅仅是Hive,对于很多数数据库来说同样也适用,比如Mysql8,Oracle,MSSQL等传统的关系型数据库。

测试环境准备

如有一张表stock_hq,表格中的数据如下所示:

TDATESECCODEBLOCKAMOUNTPRICE
20221113000001.szA0121010.10
20221113000002.szA022109.10
20221113000003.szA012108.10
20221114000001.szA0121010.10
20221114000002.szA022109.10
20221114000003.szA012108.10
20221115000001.szA0121010.10
20221115000002.szA022109.10
20221115000003.szA012108.10
20221116000001.szA0121010.10
20221116000002.szA022109.10
20221116000003.szA012108.10
20221117000001.szA0111010.10
20221117000002.szA021109.10
20221117000003.szA011109.10
20221118000001.szA0110010.10
20221118000002.szA0210010.10
20221118000003.szA0110010.10

窗口和Over

窗口其实就是一个数据范围,它指定了我们统计计算分些数据范围。在Spark和Flink中我们知道,窗口有全局窗口和滚动窗口之分,同样在SQL的窗口中也有类似的概念。在SQL中,窗口是通过Over来实现的。

OVER语法

OVER( [PARTITION BY xx] [ORDER BY XX] [Window specifications ])

在Over()中可以由上述3者进行不同的组合,或者3者都可以不指定。

PARTITION BY: 开窗字段

PARTITON BY 选项是可选的,它可以指定一个或者多个字段进行开窗,如果不指定开窗字段,则只有“一个窗口”。

  • 示例1:不指定开窗字段,不指定排序,不定义窗口大小,则默认表中所有的数据都在一个窗口中
SELECT SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over() AS SUM_AMOUNT FROM stock_hq;

输出:

TDATESECCODEAMOUNTSUM_AMOUNT
20221113000001.sz2103150
20221113000002.sz2103150
20221113000003.sz2103150
20221114000001.sz2103150
20221114000002.sz2103150
20221114000003.sz2103150
20221115000001.sz2103150
20221115000002.sz2103150
20221115000003.sz2103150
20221116000001.sz2103150
20221116000002.sz2103150
20221116000003.sz2103150
20221117000001.sz1103150
20221117000002.sz1103150
20221117000003.sz1103150
20221118000001.sz1003150
20221118000002.sz1003150
20221118000003.sz1003150

很明显,上述数据中Sum(Amount)统计的是所有记录的和,也就是说所有数据在同一窗口中。

  • 示例2:指定一个开窗字段
SELECT SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over(PARTITION BY SECCODE) AS SUM_AMOUNT FROM stock_hq;

上述语句中指定了一个开窗字段,没有指定排序和窗口大小,则分成多个窗口,每个窗口中的数据是每个SECCODE的所有数据。

SECCODETDATEAMOUNTSUM_AMOUNT
000001.sz202211132101050
000001.sz202211142101050
000001.sz202211181001050
000001.sz202211152101050
000001.sz202211171101050
000001.sz202211162101050
000002.sz202211132101050
000002.sz202211181001050
000002.sz202211171101050
000002.sz202211162101050
000002.sz202211152101050
000002.sz202211142101050
000003.sz202211152101050
000003.sz202211162101050
000003.sz202211142101050
000003.sz202211171101050
000003.sz202211132101050
000003.sz202211181001050
  • 示例3:指定多个开窗字段
select BLOCK,SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over(PARTITION BY BLOCK,SECCODE) AS SUM_AMOUNT FROM stock_hq;

输出:

BLOCKSECCODETDATEAMOUNTSUM_AMOUNT
A01000001.sz202211132101050
A01000001.sz202211181001050
A01000001.sz202211142101050
A01000001.sz202211171101050
A01000001.sz202211152101050
A01000001.sz202211162101050
A01000003.sz202211181001050
A01000003.sz202211171101050
A01000003.sz202211162101050
A01000003.sz202211152101050
A01000003.sz202211142101050
A01000003.sz202211132101050
A02000002.sz202211162101050
A02000002.sz202211152101050
A02000002.sz202211171101050
A02000002.sz202211142101050
A02000002.sz202211181001050
A02000002.sz202211132101050

开窗字段值相同的数据分在同一个窗口中(类似于group by).

ORDER BY: 排序

指定了Order By后,那么窗口就不再是“静态”的了,而成了一个动态滚动的窗口,废话不多说,上菜:

  • 示例1:不指定分区,只指定Order by
select TDATE,SECCODE, AMOUNT,SUM(AMOUNT) over(order by SECCODE) AS SUM_AMOUNT FROM stock_hq;

输出:

TDATESECCODEAMOUNTSUM_AMOUNT
20221113000001.sz2101050
20221114000001.sz2101050
20221118000001.sz1001050
20221115000001.sz2101050
20221117000001.sz1101050
20221116000001.sz2101050
20221113000002.sz2102100
20221118000002.sz1002100
20221117000002.sz1102100
20221116000002.sz2102100
20221115000002.sz2102100
20221114000002.sz2102100
20221115000003.sz2103150
20221116000003.sz2103150
20221114000003.sz2103150
20221117000003.sz1103150
20221113000003.sz2103150
20221118000003.sz1003150

如果只单独指定了Order By,Order By字段相同的数据会先分成一组做一个统计,然后再到下一个组如000002.sz的数据时会将000002.sz的所有的数据先做个统计,再累加上一个分组000001.sz的统计结果。

  • 示例2:指定分区,指定排序
select TDATE,SECCODE, AMOUNT,SUM(AMOUNT) over(partition by SECCODE order by TDATE) AS SUM_AMOUNT FROM stock_hq;
TDATESECCODEAMOUNTSUM_AMOUNT
20221113000001.sz210210
20221114000001.sz210420
20221115000001.sz210630
20221116000001.sz210840
20221117000001.sz110950
20221118000001.sz1001050
20221113000002.sz210210
20221114000002.sz210420
20221115000002.sz210630
20221116000002.sz210840
20221117000002.sz110950
20221118000002.sz1001050
20221113000003.sz210210
20221114000003.sz210420
20221115000003.sz210630
20221116000003.sz210840
20221117000003.sz110950
20221118000003.sz1001050

在每一个窗口中,每一行的统计结果为上一行的统计结果加上当前行的值。

Window Specifications :窗口定义

窗口的定义主要用于指定窗口的大小,有如下几种语义进行指定:

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

语法解析

UNBOUNDED 无边界的
PRECEDING 当前行的前
FOLLOWING 当前行后跟多少行

示例组合:

  • ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING 当前行的前3行到前一行
  • ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING 当前行的前3行+当前行+当前行的后一行
  • ROWS BETWEEN UNBOUNDED PRECEDING AND 2 PRECEDING 第一行到当前行的前2行
  • ROWS BETWEEN CURRENT ROW AND UNBOUND FOLLWING 当前行到末尾行
  • ROWS BETWEEN 3 FOLLOWING AND UNBOUND FOLLOWING 从当前行往后数3行开始到末尾
  • ROWS BETWEEN 3 FOLLOWING AND 10 FOLLOWING 从当前行的后面第3行开始到后面第10行之间的数据。

ROWS和RANGE的区别

由于我们上面的日期是连续的,所以需要删除某一天的数据,让效果看起来更明显。

delete from stock_hq where tdate = 20221115;
  • ROWS是根据数据的物理顺序来指定窗口的大小,可以不指定排序字段(采用数据库的默认顺序)。

如:ROWS BETWEEN 3 PRECEDING AND CURRENT ROW.
在000001.sz窗口中,假如当前行的日期是20221117:到当前行时统计的是当前行前3行的值加当前行的值也就是740。统计的范围是固定的,与当前行的值无关。

示例:

select SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over( partition BY SECCODE order by TDATE ROWS BETWEEN 3 PRECEDING AND CURRENT Row)   AS SUM_AMOUNT FROM stock_hq;

输出:

SECCODETDATEAMOUNTSUM_AMOUNT
000001.sz20221113210210
000001.sz20221114210420
000001.sz20221116210630
000001.sz20221117110740
000001.sz20221118100630
000002.sz20221113210210
000002.sz20221114210420
000002.sz20221116210630
000002.sz20221117110740
000002.sz20221118100630
000003.sz20221113210210
000003.sz20221114210420
000003.sz20221116210630
000003.sz20221117110740
000003.sz20221118100630
  • RANGE 必须指定一个排序字段,且排序字段必须是数字类型或时间类型,窗口的大小和当前行的排序字段的值有关

如:RANGE BETWEEN 3 PRECEDING AND CURRENT ROW.
假如当前行的日期是20221117,到当前行时统计的是日期大于等于20221114到当前行20221117这一范围上的值,如下示例中是530,是一个逻辑上的窗口设定,与当前值有关。

select SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over( partition BY SECCODE order by TDATE RANGE BETWEEN 3 PRECEDING AND CURRENT Row) AS SUM_AMOUNT FROM stock_hq;

输出:

SECCODETDATEAMOUNTSUM_AMOUNT
000001.sz20221113210210
000001.sz20221114210420
000001.sz20221116210630
000001.sz20221117110530
000001.sz20221118100420
000002.sz20221113210210
000002.sz20221114210420
000002.sz20221116210630
000002.sz20221117110530
000002.sz20221118100420
000003.sz20221113210210
000003.sz20221114210420
000003.sz20221116210630
000003.sz20221117110530
000003.sz20221118100420

窗口函数

LEAD

  • 函数原型:LEAD(column,rows,default_value)
  • 用途:将窗口的数据整体上移指定的行,上移后空缺的值使用指定的default_value填充。

示例:每个窗口的数据整体往上移一行,空缺的值默认为NULL

select TDATE,SECCODE, AMOUNT,LEAD(AMOUNT,1) over(partition by SECCODE order by TDATE) AS LEAD_AMOUNT FROM stock_hq;

输出:

TDATESECCODEAMOUNTLEAD_AMOUNT
20221113000001.sz210210
20221114000001.sz210210
20221116000001.sz210110
20221117000001.sz110100
20221118000001.sz100
20221113000002.sz210210
20221114000002.sz210210
20221116000002.sz210110
20221117000002.sz110100
20221118000002.sz100
20221113000003.sz210210
20221114000003.sz210210
20221116000003.sz210110
20221117000003.sz110100
20221118000003.sz100

LAG

  • 函数原型:LAG(column,rows,default_value)
  • 用途:将窗口的数据整体下移指定的行,下移后空缺的值使用指定的default_value填充。
select TDATE,SECCODE, AMOUNT,LAG(AMOUNT,1) over(partition by SECCODE order by TDATE) AS LEAD_AMOUNT FROM stock_hq;

输出:

TDATESECCODEAMOUNTLAG_AMOUNT
20221113000001.sz210
20221114000001.sz210210
20221116000001.sz210210
20221117000001.sz110210
20221118000001.sz100110
20221113000002.sz210
20221114000002.sz210210
20221116000002.sz210210
20221117000002.sz110210
20221118000002.sz100110
20221113000003.sz210
20221114000003.sz210210
20221116000003.sz210210
20221117000003.sz110210
20221118000003.sz100110

FIRST_VALUE

  • 函数原型:FIRST_VALUE(column,Bool)

最多两个参数,第一个参数是列名,第二个参数是一个bool值,默认是false。如果设为true,则会跳过NULL值找第一个不为NULL的值。

Tips:有些关系型数据库中只有一个参数,请注意。

  • 用途:取窗口中的第一个值

LAST_VALUE

  • 函数原型:LAST_VALUE(column,Bool)

最多两个参数,第一个参数是列名,第二个参数是一个bool值,默认是false。如果设为true,则会跳过NULL值找第一个不为NULL的值。

  • 用途:取窗口中的最后一个值。
### 复杂的SQL窗口函数 `OVER` 用法示例及解析 #### 使用场景一:多列分区并排序后的累计求和 当需要基于多个维度进行数据分析时,可以利用 `PARTITION BY` 和 `ORDER BY` 来实现更精细的操作。例如,在销售记录表中按地区和地区内的不同产品类别分别统计销售额,并计算每个类别的累计金额。 ```sql SELECT region, product_category, sale_amount, SUM(sale_amount) OVER (PARTITION BY region, product_category ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sales FROM sales_table; ``` 此查询语句会按照区域和地区内不同的商品种类来划分数据集[^1],并对每一部分中的日期顺序排列后执行累加操作,从而得到截至当前行为止的所有历史交易总额。 #### 场景二:获取前后若干条记录的信息 通过指定 `ROWS/RANGE BETWEEN ... PRECEDING/FOLLOWING` 可以访问相邻几行的数据项。这对于时间序列预测模型构建特别有用,因为能够方便地取得过去几天甚至未来一段时间的趋势变化情况作为特征输入给机器学习算法训练使用。 假设有一个股票价格日志表格,想要知道每一天收盘价相对于前一天的变化幅度: ```sql WITH stock_prices AS ( SELECT '2023-09-01'::date as trade_date, 100.5 as close_price UNION ALL SELECT '2023-09-02', 101.7 UNION ALL ... ) SELECT trade_date, close_price, LAG(close_price, 1) OVER (ORDER BY trade_date ASC) prev_close_price, ROUND((close_price - LAG(close_price, 1) OVER (ORDER BY trade_date ASC)) / LAG(close_price, 1) * 100, 2) change_percent FROM stock_prices; ``` 这里定义了一个CTE表达式模拟了一些简单的股价变动;主查询则应用了 `LAG()` 函数去检索上一日的价格,并据此算出了百分比增减量[^2]。 #### 场景三:动态调整窗口大小 有时可能希望随着遍历过程改变所考虑的历史长度范围。这可以通过设置可变边界条件达成目的。比如下面的例子展示了如何在一个订单明细表里找出最近三个月以来每个月份最高销量的产品型号及其对应的月度总销量: ```sql SELECT DISTINCT ON (month) month, model_id, MAX(total_sold) OVER w max_monthly_sales, total_sold monthly_sales FROM orders_detail WINDOW w AS (DISTRIBUTION BY model_id PARTITION BY EXTRACT(MONTH FROM order_time), EXTRACT(YEAR FROM order_time) ORDER BY total_sold DESC RANGE BETWEEN INTERVAL '3 months' PRECEDING AND CURRENT ROW); ``` 这段代码片段首先创建了一个名为 `w` 的窗口框架,它根据年份月份组合以及具体机型来进行分组聚合运算,同时指定了一个相对固定的滚动周期——即从当前时刻往前推移三个自然月的时间跨度内寻找最大值[^3]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值