转载:Hive开窗函数 Window Functions

原始链接:https://cwiki.apache.org/confluence/display/ASTERIXDB/Window+Functions

Window Functions

转至元数据结尾

转至元数据起始

 

 

Motivation

Add support for SQL-compatible window functions to SQL++

 

Motivation

Add support for SQL-compatible window functions to SQL++

User Model

Window function call ::= function_name(arg1, arg2, ... argN) OVER (frame_var AS)? 
   ( (PARTITION BY expr1, expr2, …. exprN)? (ORDER BY exprA, exprB, … exprN)? Frame_Spec? )
Frame_Spec ::= ( ROWS | RANGE | GROUPS ) Frame_Extent Frame_Exclusion?
Frame_Extent ::= Frame_Bound | BETWEEN Frame_Bound AND Frame_Bound
Frame_Bound ::= UNBOUNDED PRECEDING | expr PRECEDING | CURRENT ROW | UNBOUNDED FOLLOWING | expr FOLLOWING
Frame_Exclusion ::= EXCLUDE ( CURRENT ROW | GROUP | TIES | NO OTHERS )

Function_name is one of the following:

  • Window function: cume_dist(), dense_rank(), first_value(), lag(), last_value(), lead(), nth_value(), ntile(), rank(), ratio_to_report(), row_number(), percent_rank()
  • SQL aggregate function: count(), sum(), avg(), min(), max(), etc
  • SQL++ aggregate function: array_count(), array_sum(), strict_count(), strict_sum(), etc

frame_var is a variable which is bound to the frame contents and will be in-scope for function arguments. It contains an array (or multi-set if there’s no order by) of objects where each object has the same structure as the GROUP AS variable object (one field for each variable in the current scope).

Example: 

from emp select array_sum((from w select value w.emp.salary)) over w as (partition by emp.dept) 

Design

The following components were added:

SQL++ Expression

 org.apache.asterix.lang.sqlpp.expression.WindowExpression

 This expression is for a single window function call.

 Contents:

  • Function name
  • Function arguments
  • List of ‘partition by’ expressions
  • List of ‘order by’ expressions and order modifiers
  • Frame mode: (enum) RANGE | ROWS | GROUPS
  • Frame boundary start, end : (enum) CURRENT_ROW | UNBOUNDED_PRECEDING |  UNBOUNDED_FOLLOWING | BOUNDED_PRECEDING (+ boundary expression) | BOUNDED_FOLLOWING ( + boundary expression)
  • Frame exclusion kind: (enum) CURRENT_ROW | GROUP | TIES | NO_OTHERS
  • Frame variable and its field list mapping
     

Algebricks Operator

org.apache.hyracks.algebricks.core.algebra.operators.logical.WindowOperator

This operator can evaluate multiple function calls over the same window definition

Contents:

  • List of ‘partition by’ expressions
  • List of ‘order by’ expressions and order modifiers
  • List of ‘frame start’ expressions (empty if unbounded)
  • List of ‘frame end’ expressions (empty if unbounded)
  • List of ‘frame value’ expressions
  • List of ‘frame exclusion’ expressions (empty if exclusion kind is ‘no others’)
  • List of output variables and function call expressions. These are running aggregates such as row_number(), rank() that operate on whole partitions (frame definition is not applicable to these functions)
  • Nested plan. These are regular SQL++ aggregates that operate on window frames. Each frame is sent into the nested plan through the nested tuple source operator. SQL aggregate functions are rewritten into SQL++ aggregates by SqlppWindowExpressionVisitor

The logic of the operator is a follows:

  1. Split input into partitions as specified by ‘partition by’ expressions
  2. Then order tuples within each partition as specified by ‘order by’ expressions
  3. Then for each partition
    1. For each tuple compute its running aggregates
    2. For each tuple find a window frame and compute regular aggregates by sending that frame into the nested plan

Window frame computation:

For each tuple in the partition we need to find a set of tuples from the same partition that match the frame definition. This is essentially a self-join with the following condition:

‘frame value’ >= ‘frame start’ and ‘frame value’ <= ‘frame end’ and ‘frame value’ != ‘frame exclusion’

Expressions generated for ‘frame value’, ‘frame start’ and ‘frame end’ depend on the frame mode and frame boundary specification:

Frame mode

‘frame value’ expressions

RANGE

{  ‘order by’ expr1, expr2, … exprN }
all ‘order by’ expressions

ROWS

{ row_number() }
a single expression which is a result of computing row_number() over the input dataset (this additional window operator is introduced by the compiler)

GROUPS 

{ dense_rank() }
a single expression which is a result of computing dense_rank() over the input dataset (this additional window operator is introduced by the compiler)

 

Boundary kind

‘frame start’ and ‘frame end’ expressions

CURRENT ROW{ ‘frame value’ expr1, expr2, … exprN }
all ‘frame value’ expressions’ 
BOUNDED PRECEDING

{ ‘frame value’ expression – ‘boundary’ expression }
(only a single ‘frame value’ expression is allowed in this mode)

BOUNDED FOLLOWING

{ ‘frame value’ expression + ‘boundary’ expression }
(only a single ‘frame value’ expression is allowed in this mode)

UNBOUNDED PRECEDING 

{} (empty)

UNBOUNDED FOLLOWING 

{} (empty)

 

‘frame exclusion’ expressions use the same mechanism as ‘frame value’ expressions and are generated as follows:

Exclusion kind

‘frame exclusion’ expressions

NO OTHERS

{} (empty)

CURRENT ROW { row_number() }
GROUP

{ dense_rank() }

TIES

{ dense_rank(), not row_number() }

 

Hyracks Runtime

Window operator runtime consists of an abstract class org.apache.hyracks.algebricks.runtime.operators.win.AbstractWindowPushRuntime and its subclasses:

  • WindowSimplePushRuntime – used for row_number(), rank(), dense_rank(), and some others – running aggregates that do not require information about partition length

  • WindowMaterializingPushRuntime – used for cume_dist(), ntile(), percent_rank() – running aggregates that require information about partition length

  • WindowNestedPlansPushRuntime – used for regular aggregates – this implementation supports frame computation.  

    • WindowNestedPlansUnboundedPushRuntime - optimized version used when the frame is equivalent to the whole partition (unbounded preceding to unbounded following)
    • WindowNestedPlansRunningPushRuntime - optimized version used when the frame is unbounded preceding to current row / n following

Open items

  • Optimize performance of WindowNestedPlansPushRuntime: support reverse aggregation steps for sliding frames (x preceding to y following, etc)

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值