Table of Contents
spark-sql和支持在线多维分析函数,这里介绍几个比较常用的函数:grouping sets,grouping__id,rollup和cube。
准备数据
ldldwd.ec_tm_business_store表:
businessID businessName ID
1 CPD 11
3 LD 12
4 PPD 13
2 ACD 14
GROUPING SETS
对group后的分组再次进行分组,在sets后传入分组的组合。可以理解为:如果使用了grouping sets,那么group by字段仅限定分组字段,实际上是对grouping sets中的组合进行分组。
例如,不加grouping sets,使用group by分组
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
结果为:
businessID businessName
2 ACD
1 CPD
4 PPD
3 LD
使用grouping sets再对businessID,businessName划分三个子分组,分别为分组一: businessID,businessName 分组二:businessID 分组三:businessName,sql如下:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
grouping sets((businessID,businessName),businessID,businessName)
;
结果为:
businessID businessName
3 NULL
NULL PPD
2 ACD
1 NULL
4 PPD
2 NULL
1 CPD
NULL LD
NULL ACD
3 LD
NULL CPD
4 NULL
其含义为:对ec_tm_business_store表,分别按照(businessID,businessName)和businessID以及businessName分组。并将三个分组的查询结果结合在一起。等同于:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
union
select businessID,null as businessName
from ldldwd.ec_tm_business_store
group by businessID
union
select null as businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
因为分组二和分组三只对一个字段进行分组,所以另一个字段理所应当为null。
需要注意的是:
一:如果使用了grouping sets,那么仅对sets中的集合进行分组,不会按group by后的全字段进行分组,如:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
grouping sets(businessID,businessName)
;
结果为:
businessID businessName
3 NULL
NULL PPD
1 NULL
2 NULL
NULL LD
NULL ACD
NULL CPD
4 NULL
并没有显示 group by businessID,businessName 的结果。
二:grouping sets中不能出现group by后不包含的字段,如:
select businessID,businessName
from ldldwd.ec_tm_business_store
group by businessID,businessName
grouping sets(ID,businessID,businessName)
;
以上sql的sets中多了ID字段,而group by不包含ID字段,这是会报错的。
Error in query: id#35 doesn't show up in the GROUP BY list ArrayBuffer(businessID#36 AS businessID#50, businessName#37 AS businessName#51);
GROUPING__ID
grouping__id可以结合grouping sets使用,可以区分数据属于哪个分组。它的值表示当前行中,哪些字段没有参与到分组中,如:
select businessID,businessName,ID,grouping__id
from ldldwd.ec_tm_business_store
group by businessID,businessName,ID
grouping sets((businessID,businessName),businessID,businessName,ID)
order by grouping__id
;
结果为:
businessID businessName ID grouping__id
3 LD NULL 1
2 ACD NULL 1
1 CPD NULL 1
4 PPD NULL 1
2 NULL NULL 3
1 NULL NULL 3
3 NULL NULL 3
4 NULL NULL 3
NULL LD NULL 5
NULL PPD NULL 5
NULL CPD NULL 5
NULL ACD NULL 5
NULL NULL 11 6
NULL NULL 12 6
NULL NULL 13 6
NULL NULL 14 6
乍一看是不是一头雾水?
实际上,grouping__id对分组字段按照位图策略对其附了一个id值,按照二进制,将靠近group by的字段设为高位,远离group by的字段设为低位。然后对当前子分组中,没有参与到分组字段的id值进行累加。
本例中,group by后依次是businessID,businessName,ID,那么他们对应的id值为:
businessID:4
businessName:2:
ID:1
那么就能解释grouping__id的值是怎么来的了:
businessID businessName ID grouping__id
--ID未参与分组,grouping__id等于ID字段的id值,为1
3 LD NULL 1
2 ACD NULL 1
1 CPD NULL 1
4 PPD NULL 1
--ID字段和businessName字段未参与分组,grouping__id等于 (ID字段的id值+businessName字段的id值)=1+2=3
2 NULL NULL 3
1 NULL NULL 3
3 NULL NULL 3
4 NULL NULL 3
--ID字段和businessID字段未参与分组,grouping__id等于 (ID字段的id值+businessID字段的id值)=1+4=5
NULL LD NULL 5
NULL PPD NULL 5
NULL CPD NULL 5
NULL ACD NULL 5
--businessID和businessName未参与分组,grouping__id等于 (businessID字段的id值+businessName字段的id值)=2+4=6
NULL NULL 11 6
NULL NULL 12 6
NULL NULL 13 6
NULL NULL 14 6
通过预判grouping__id的值,可以对数据进行有效的区分。
ROLLUP
rollup的使用方法通常是直接在group by之后加上with rollup。其作用为:将group by后的字段视为层级关系,并按照层级进行上卷。如:group by a,b,c with rollup 相当于group by a,b,c grouping sets((a,b,c),(a,b),(a))。
即:以group byd的所有字段进行分组,再依次去除最右边的字段进行分组,最后把所有的结果union起来。
举个例子:
准备数据:
year month day amount
2020 01 05 10
2020 01 08 10
2020 02 08 10
2020 02 05 20
2020 03 05 30
2020 03 08 30
sql:
SELECT year,month,sum(amount) as amount
from sellout_tm
group by year,month
with rollup
结果:
year month amount
2020 01 20
2020 02 30
2020 03 60
2020 null 110
上面的sql相当于
SELECT year,month,sum(amount) as amount
from sellout_tm
group by year,month
grouping sets((year,month),year)
CUBE
与kylin的cube一样,对group by的所有字段进行组合,如:group by a,b,c with cube 相当于group by a,b,c grouping sets((a,b,c),(a,b),(a,c),(b,c),a,b,c)。这里就不赘述了。