Hive 窗口函数
窗口函数是 SQL 中用于在指定的数据窗口(如行或分区)内进行复杂计算的函数。Hive 从版本 0.11.0 开始支持窗口函数,它们可以用于在数据集的子集上进行计算,而无需对整个数据集进行聚合。
窗口函数包括以下类型:
- 排名函数:如
RANK()
,DENSE_RANK()
,ROW_NUMBER()
- 聚合函数:如
SUM()
,AVG()
,MAX()
,MIN()
- 值函数:如
LEAD()
,LAG()
,FIRST_VALUE()
,LAST_VALUE()
窗口函数语法
窗口函数的基本语法如下:
function_name (expression) OVER ( [PARTITION BY partition_expression] [ORDER BY order_expression] [window_specification] )
function_name
:窗口函数的名称。expression
:函数应用的表达式。OVER
:定义窗口的范围。PARTITION BY
:指定分区列。ORDER BY
:指定排序列。window_specification
:定义窗口的范围(行或范围)。
常用窗口函数示例
1. ROW_NUMBER()
ROW_NUMBER()
函数为每一行分配一个唯一的行号,从 1 开始。
示例:
SELECT id, name, amount, ROW_NUMBER() OVER (PARTITION BY category ORDER BY amount DESC) AS row_num FROM sales;
2. RANK() 和 DENSE_RANK()
RANK()
函数为分区内的每一行分配一个排名,相同值的行将具有相同的排名,跳过下一个排名。
DENSE_RANK()
函数类似于 RANK()
,但不会跳过排名。
示例:
SELECT id, name, amount, RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS rank, DENSE_RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS dense_rank FROM sales;
3. SUM() 和 AVG()
SUM()
和 AVG()
是聚合函数,可以在窗口内进行累计求和和平均值计算。
示例:
SELECT id, category, amount, SUM(amount) OVER (PARTITION BY category ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total, AVG(amount) OVER (PARTITION BY category ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg FROM sales;
4. LEAD() 和 LAG()
LEAD()
函数返回当前行之后的第 N 行的值,LAG()
函数返回当前行之前的第 N 行的值。
示例:
SELECT id, category, amount, LEAD(amount, 1) OVER (PARTITION BY category ORDER BY id) AS next_amount, LAG(amount, 1) OVER (PARTITION BY category ORDER BY id) AS prev_amount FROM sales;
5. FIRST_VALUE() 和 LAST_VALUE()
FIRST_VALUE()
函数返回窗口内的第一个值,LAST_VALUE()
函数返回窗口内的最后一个值。
示例:
SELECT id, category, amount, FIRST_VALUE(amount) OVER (PARTITION BY category ORDER BY id) AS first_amount, LAST_VALUE(amount) OVER (PARTITION BY category ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_amount FROM sales;
窗口规范
窗口规范可以通过以下方式定义窗口的范围:
- ROWS:基于物理行数的范围。
- RANGE:基于逻辑范围的范围。
示例:使用 ROWS
SELECT id, category, amount, SUM(amount) OVER (PARTITION BY category ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS moving_sum FROM sales;
示例:使用 RANGE
SELECT id, category, amount, SUM(amount) OVER (PARTITION BY category ORDER BY id RANGE BETWEEN INTERVAL 1 DAY PRECEDING AND INTERVAL 1 DAY FOLLOWING) AS moving_sum FROM sales;
结论
窗口函数是 Hive 中强大的数据分析工具,能够在特定的数据窗口内进行复杂计算,适用于排名、移动平均、累计求和等场景。掌握这些窗口函数,可以大大提高数据分析的效率和灵活性。