Grouping__ID 在不同版本中的使用方法不一样

GROUPING SETS clause

Hive2.3版本之后

Grouping__ID function

When aggregates are displayed for a column its value is null. This may conflict in case the column itself has some null values. There needs to be some way to identify NULL in column, which means aggregate and NULL in column, which means value. GROUPING__ID function is the solution to that.

This function returns a bitvector corresponding to whether each column is present or not. For each column, a value of "1" is produced for a row in the result set if that column has been aggregated in that row, otherwise the value is "0". This can be used to differentiate when there are nulls in the data.

Consider the following example:

Column1 (key)

Column2 (value)

1

NULL

1

1

2

2

3

3

3

NULL

4

5

The following query:

SELECT key, value, GROUPING__ID, count(*)

FROM T1

GROUP BY key, value WITH ROLLUP;

will have the following results:

Column 1 (key)

Column 2 (value)

GROUPING__ID

count(*)

NULL

NULL

3

6

1

NULL

0

2

1

NULL

1

1

1

1

0

1

2

NULL

1

1

2

2

0

1

3

NULL

0

2

3

NULL

1

1

3

3

0

1

4

NULL

1

1

4

5

0

1

Note that the third column is a bitvector of columns being selected.
For the first row, none of the columns are being selected.
For the second row, both the columns are being selected (and the second column happens to be null), which explains the value 0.
For the third row, only the first column is being selected, which explains the value 1.

 

hive 2.3版本之前

Grouping__ID function (before Hive 2.3.0)

补充:2.3之前,的grouping__id值的二进制每一位数值1表示使用该列进行聚合。0表示不适用该列聚合。

Grouping__ID function was fixed in Hive 2.3.0, thus behavior before that release is different (this is expected). For each column, the function would return a value of "0" iif that column has been aggregated in that row, otherwise the value is "1".

Hence the following query:

SELECT key, value, GROUPING__ID, count(*)

FROM T1

GROUP BY key, value WITH ROLLUP;

will have the following results.

Column 1 (key)

Column 2 (value)

GROUPING__ID

 count(*)

NULL

NULL

0

6

1

NULL

1

2

1

NULL

3

1

1

1

3

1

2

NULL

1

1

2

2

3

1

3

NULL

1

2

3

NULL

3

1

3

3

3

1

4

NULL

1

1

4

5

3

1

解析:

对于第一行GROUPING__ID为0,因此表示使用0列进行聚合,即是整体作为一个分组聚合。

对于第一行GROUPING__ID为1,表示使用第一列,即key列进行聚合,因此共计有2条

对于第一行GROUPING__ID为3,二进制为11,表示使用key与value两列进行聚合,进行共计只有一行。

 

补充:

hive文档个人觉着不如spark文档质量高,可以尝试阅读spark文档,个人认为还是spark文档简单清晰明了,如下:

https://spark.apache.org/docs/latest/api/sql/index.html#grouping_id

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值