LanguageManual WindowingAndAnalytics
Windowing and Analytics Functions
链接:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-EnhancementstoHiveQL
大概:
Window function
- LEAD函数
- LAG函数
- FIRST_VALUE函数
- LAST_VALUE函数
OVER clause --- 一个OVER子句仅适用于前面的函数调用
analytics function
DISTINCT support
aggregate function
例子:
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
其中 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING 应该就是window specification
这些都是对Hive QL的增强.
窗口和分析函数
Windowing and Analytics Functions
- Enhancements to Hive QL(对Hive QL的增强)
- Examples
- PARTITION BY with one partitioning column, no ORDER BY or window specification(PARTITION BY一个partitioning column,没有ORDER BY或window 规范)
- PARTITION BY with two partitioning columns, no ORDER BY or window specification(PARTITION BY由两个partitioning columns,不指定ORDERBY或Window规范)
- PARTITION BY with one partitioning column, one ORDER BY column, and no window specification(PARTITION BY按一个partitioning column,一个ORDER BY column,而没有window规范)
- PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification(PARTITION BY由两个partitioning columns,两个ORDER BY columns,并且没有window规范(规格))
- PARTITION BY with partitioning, ORDER BY, and window specification(PARTITION BY分区、ORDER BY和window规范)
- WINDOW clause(WINDOW 分句)
- LEAD using default 1 row lead and not specifying default value(LEAD[领导]使用默认1 row lead[1排引线]而不指定default value)
- LAG specifying a lag of 3 rows and default value of 0(LAG[走得极慢,落后],指定a lag of 3 rows,default value为0)
- Distinct counting for each partition(每个partition的Distinct[明显的]计数)
Enhancements to Hive QL
Version
Introduced in Hive version 0.11.
This section introduces the Hive QL enhancements for windowing and analytics functions.
本节介绍用于windowing and analytics的Hive QL增强功能。
See "Windowing Specifications in HQL" (attached to HIVE-4197) for details.
HIVE-896 has more information, including links to earlier documentation in the initial comments.(hive-896有更多信息,包括the initial comments[最初注释]中的links to earlier documentation。)
All of the windowing and analytics functions operate as per the SQL standard.(所有的windowing and analytics functions都按照the SQL standard。)
The current release supports the following functions for windowing and analytics:(current release[当前版本]支持以下windowing and analytics功能:)
1. Windowing functions
LEAD(领导)
The number of rows to lead can optionally be specified. (可以选择指定要lead的The number of rows。)If the number of rows to lead is not specified, the lead is one row.(如果未指定要lead的the number of rows,则lead为 one row。)
Returns null when the lead for the current row extends beyond the end of the window.(当the current row的the lead扩展到the end of the window,返回NULL。)
LAG(滞后)
The number of rows to lag can optionally be specified.(可以选择指定要lag[滞后]的The number of rows。) If the number of rows to lag is not specified, the lag is one row.(如果未指定要lag的the number of rows,则lag为one row。)
Returns null when the lag for the current row extends before the beginning of the window.(当the current row的lag在the beginning of the window之前扩展时,返回NULL。)
FIRST_VALUE
This takes at most two parameters. (这最多需要两个parameters。)The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which is false by default. (The first parameter是您想要第一个值的column,第二个(可选的)parameter必须是一个Boolean,默认情况下为false。)If set to true it skips null values.(如果设置为true,则跳过null values。)
LAST_VALUE
This takes at most two parameters.(这最多需要两个parameters。) The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.(The first parameter是您想要最后一个值的column,第二个参数(可选)必须是一个boolean,默认情况下为false。如果设置为true,则跳过null values。)
2. The OVER clause(从句)
OVER with standard aggregates ( aggregates函数配合OVER从句):
- COUNT
- SUM
- MIN
- MAX
- AVG
OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype(使用带有any primitive datatype[任何原始数据类型]的one or more partitioning columns的PARTITION BY 语句结束。).
OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.(OVER with PARTITION BY 和 ORDER BY,使用one or more partitioning and/or ordering columns of any datatype)
OVER with a window specification.(OVER with a window 规范.) Windows can be defined separately in a WINDOW clause.(Windows可以在WINDOW clause中单独定义。) Window specifications support the following formats:(Window规范支持the following formats:)
这里是Window规范的格式
UNBOUNDED:无限的
PRECEDING:在前的
FOLLOWING:后面的
(ROWS | RANGE)
BETWEEN
(UNBOUNDED | [num]) PRECEDING
AND
([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE)
BETWEEN
CURRENT ROW
AND
(CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE)
BETWEEN
[num] FOLLOWING
AND
(UNBOUNDED | [num]) FOLLOWING
When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(range between unbounded preceding and current).(当ORDER BY没有指定Window从句时时,WINDOW规范默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(range between unbounded preceding and current)。)
When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.(当ORDER BY和WINDOW子句都丢失时,WINDOW规范默认为ROW在UNBOUNDED PRECEDING和UNBOUNDED FOLLOWING。)
The OVER clause supports the following functions, but it does not support a window with them (see HIVE-4797):(OVER子句支持以下函数,但它不支持带有它们的window(参见hive-4797):)
Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank.(排名功能:Rank,NTile,DenseRank,CumeDist,PercentRank。)
Lead and Lag functions.
3. Analytics functions
- RANK(rank)
- ROW_NUMBER(row_number)
- DENSE_RANK(denser_rank)
- CUME_DIST(cume_dist)
- PERCENT_RANK(percent_rank)
- NTILE(ntile)
4. Distinct support in Hive 2.1.0 and later(Hive 2.1.0及更高版本中的Distinct support)(see HIVE-9534)
Distinct is supported for aggregation functions including SUM, COUNT and AVG, which aggregate over the distinct values within each partition.(包含SUM、COUNT和AVG在内的聚合函数支持DISTIC,这些函数在每个partition内的the distinct values上进行聚合。) Current implementation has the limitation that no ORDER BY or window specification can be supported in the partitioning clause for performance reason. (当前实现的限制是,由于性能原因,the partitioning clause中不支持ORDER BY或window规范。)The supported syntax is as follows.(支持的语法如下所示。)
COUNT(DISTINCT a) OVER (PARTITION BY c) |
ORDER BY and window specification is supported for distinct in Hive 2.2.0 (see HIVE-13453). (Hive2.2.0中支持ORDER BY和window规范(参见hive-13453)。)An example is as follows.(一个例子如下。)
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) |
5. Aggregate functions in OVER clause support in Hive 2.1.0 and later(Aggregate函数在Hive2.1.0和后来的条款支持) (see HIVE-13475)
Support to reference aggregate functions within the OVER clause has been added. (在OVER子句中添加了对引用aggregate函数的支持。)For instance, currently we can use the SUM aggregation function within the OVER clause as follows.(例如,目前我们可以在OVER子句中使用SUM聚合函数,如下所示。)
SELECT rank() OVER (ORDER BY sum(b)) FROM T GROUP BY a; |
Examples
This section provides examples of how to use the Hive QL windowing and analytics functions in SELECT statements. (本节提供了如何在SELECT语句中使用Hive QL windowing and analytics functions的示例。)See HIVE-896 for additional examples.
PARTITION BY with one partitioning column, no ORDER BY or window specification
SELECT a, COUNT(b) OVER (PARTITION BY c) FROM T; |
PARTITION BY with two partitioning columns, no ORDER BY or window specification
SELECT a, COUNT(b) OVER (PARTITION BY c, d) FROM T; |
PARTITION BY with one partitioning column, one ORDER BY column, and no window specification
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d) FROM T; |
PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification
SELECT a, SUM(b) OVER (PARTITION BY c, d ORDER BY e, f) FROM T; |
PARTITION BY with partitioning, ORDER BY, and window specification
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM T; |
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) FROM T; |
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) FROM T; |
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM T; |
There can be multiple OVER clauses in a single query.(在a single query中可以有多个OVER clauses。) A single OVER clause only applies to the immediately preceding function call. (一个OVER子句仅适用于前面的函数调用)In this example, the first OVER clause applies to COUNT(b) and the second OVER clause applies to SUM(b)(在本例中,第一个OVER子句适用于COUNT(b),第二个OVER子句适用于SUM(b)。):
SELECT a, COUNT(b) OVER (PARTITION BY c), SUM(b) OVER (PARTITION BY c) FROM T; |
Aliases can be used as well, with or without the keyword AS:(Aliases也可以使用,有或不带关键字AS,如下所示:)
SELECT a, COUNT(b) OVER (PARTITION BY c) AS b_count, SUM(b) OVER (PARTITION BY c) b_sum FROM T; |
WINDOW clause
SELECT a, SUM(b) OVER w FROM T WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING); |
LEAD using default 1 row lead and not specifying default value
SELECT a, LEAD(a) OVER (PARTITION BY b ORDER BY C) FROM T; |
LAG specifying a lag of 3 rows and default value of 0
SELECT a, LAG(a, 3, 0) OVER (PARTITION BY b ORDER BY C) FROM T; |
Distinct counting for each partition
SELECT a, COUNT(distinct a) OVER (PARTITION BY b) FROM T; |