Hive开窗函数的进阶SQL案例

一、开窗函数基础

1. ​定义与作用

开窗函数(Window Functions)在保留原始行数据的同时,对分组内的行进行聚合或排序分析,常用于累计计算、排名、移动平均等场景。与普通聚合函数(如SUMAVG)的区别在于:

  • 普通聚合函数​:每组返回一行(行数减少)。
  • 开窗函数​:每组返回多行(保留原始行数)。
2. ​核心语法结构
函数名() OVER (
    [PARTITION BY 列1, 列2...]  -- 分组依据
    [ORDER BY 列A [ASC|DESC]]   -- 排序依据
    [ROWS BETWEEN 范围]         -- 窗口范围
)
  • ​**PARTITION BY**​:按列分组,函数在每个分组内独立计算(如按省份分组)。
  • ​**ORDER BY**​:分组内按列排序,影响窗口范围逻辑(如时间升序)。
  • ​**ROWS BETWEEN**​:指定窗口边界,常见选项如下表:
范围关键字含义
UNBOUNDED PRECEDING起点行(分区第一行)
CURRENT ROW当前行
n PRECEDING前n行
n FOLLOWING后n行
UNBOUNDED FOLLOWING终点行(分区最后一行)

默认行为说明​:

  • 仅用 PARTITION BY → 窗口为整个分组(UNBOUNDED PRECEDING TO UNBOUNDED FOLLOWING)。
  • PARTITION BY + ORDER BY → 窗口为起点到当前行(UNBOUNDED PRECEDING TO CURRENT ROW

二、累计计算场景

1. ​年度招聘人数累加
WITH t1 AS (
  SELECT SUBSTR(hiredate, 1, 4) AS year FROM emp
),
t2 AS (
  SELECT year, COUNT(*) AS cnt 
  FROM t1 GROUP BY year
)
SELECT 
  year, 
  cnt,
  SUM(cnt) OVER (ORDER BY year) AS cumulative_cnt  -- 按年份滚动累加
FROM t2;

结果说明​:

yearcntcumulative_cnt
20235050
202480130
202570200

三、排名分析场景

3. ​并列排名处理
SELECT 
  name, class, score,
  ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,  -- 连续唯一序号
  RANK() OVER (ORDER BY score DESC) AS rank,          -- 并列跳号(1,1,3)
  DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank -- 并列不跳号(1,1,2)
FROM Scores;

排名函数对比​:

分数row_numrankdense_rank
95111
95211
90332

四、移动计算场景

4. ​1日移动平均销售额
SELECT 
  shop, day, sales,
  AVG(sales) OVER (
    ORDER BY day 
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW  -- 限定窗口为当前行及前2行
  ) AS avg_3day 
FROM daily_sales;

结果片段​:

daysalesavg_3day
2025-01-01200200.0
2025-01-02300250.0
2025-01-03250250.0
4. 2​相邻行差值计算
SELECT 
  record_date, temperature,
  temperature - LAG(temperature, 1) OVER (ORDER BY record_date) AS diff  -- 今日与昨日温差
FROM weather;

五、连续行为分析

5. ​用户连续登录统计
-- 步骤1:生成连续性标识
WITH t1 AS (
  SELECT 
    user_id, login_date,
    DATE_SUB(login_date, INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) DAY) AS grp
  FROM logins
)
-- 步骤2:聚合连续区间
SELECT 
  user_id,
  MIN(login_date) AS start_date,
  MAX(login_date) AS end_date,
  COUNT(*) AS consecutive_days
FROM t1
GROUP BY user_id, grp;

输出说明​:

user_idstart_dateend_dateconsecutive_days
10012025-01-012025-01-055
10012025-01-072025-01-082

六、高级偏移分析

6.1 ​会话最新未读消息提取
WITH RankedMessages AS (
  SELECT 
    from_id, to_id, msg, isRead, time,
    ROW_NUMBER() OVER (
      PARTITION BY from_id, to_id 
      ORDER BY time DESC
    ) AS rn  -- 按会话分组,时间倒序排
  FROM chat_history
)
SELECT from_id, to_id, msg
FROM RankedMessages
WHERE rn = 1 AND isRead = 0;  -- 取最新且未读的消息
6.2 ​计算登录间隔天数
SELECT 
  user_id, start_date, end_date,
  DATEDIFF(
    start_date, 
    LAG(end_date, 1) OVER (PARTITION BY user_id ORDER BY start_date)
  ) - 1 AS gap_days  -- 本次开始日期与上次结束日期的间隔
FROM login_sessions;

七、分区聚合控制

7. ​动态计算组内极值
SELECT 
  id, val,
  MAX(val) OVER (
    PARTITION BY id 
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS group_max  -- 分组内全局最大值
FROM test;

结果特点​:同一分组的所有行显示相同的最大值。


附:核心函数适用场景总结

场景推荐函数关键子句
累计求和SUM() OVER(ORDER BY)ORDER BY 定义累加顺序
组内排名DENSE_RANK() OVER(PARTITION BY)PARTITION BY 分组依据
移动平均AVG() OVER(ROWS BETWEEN n PRECEDING)ROWS BETWEEN 限定窗口范围
相邻记录比较LAG()/LEAD()偏移量参数控制前后行
连续性行为分析ROW_NUMBER() + 日期差值生成连续性标识列

提示:实际开发中需注意

  1. 性能优化​:避免全分区无边界窗口(如 UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),优先用 ROWS 限定物理范围
  2. 空值处理​:LAG/LEAD 的第三个参数可设置默认值(如 LAG(col,1,0)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值