你还在用group by吗?高手都在用grouping sets

我们先来看看一个实际生成中的问题:

现有一个用户访问记录表,用户在不同的渠道、操作系统、版本下的访问记录,
如下:

create table test.test10 as 
select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver
union all
select '10002' as uid,'xiaomi' as qid,'android' as os,'1.2.1' as ver
union all
select '10003' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver
union all
select '10002' as uid,'appstore' as qid,'ios' as os,'1.2.0' as ver
union all
select '10001' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver
union all
select '10003' as uid,'appstore' as qid,'ios' as os,'1.1.9' as ver
union all
select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver
union all
select '10002' as uid,'oppo' as qid,'android' as os,'1.2.1' as ver;

问题:

  1. 每个渠道、操作系统、版本下访问次数和访问人数
  2. 每个操作系统下访问次数和访问人数
  3. 总的访问次数和访问人数

分析:
上面三个问题都是求访问次数和访问人数,涉及到人数,我们首先考虑的就是去重。
那怎么去重效率最高呢?

我们先来看看通过group by的方式:
  1. 先根据每个渠道、操作系统、版本去重
  2. 根据1的去重结果再根据操作系统去重
  3. 再根据2的结果到用户维度去重

去重:
根据每个渠道、操作系统、版本去重:

create table test.test10_qid_os_ver_uid as 
select
  uid,
  qid,
  os,
  ver,
  count(1) as pv
from test.test10
group by uid,qid,os,ver;

根据操作系统去重:

create table test.test10_os_uid as 
select
  uid,
  os,
  sum(pv) as pv
from test.test10_qid_os_ver_uid
group by uid,os;

根据用户去重:

create table test.test10_uid as 
select
  uid,
  sum(pv) as pv
from test.test10_os_uid
group by uid;

聚合:
每个渠道、操作系统、版本聚合:

spark-sql> select
         >   qid,
         >   os,
         >   ver,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from test.test10_qid_os_ver_uid
         > group by qid,os,ver;
appstore        ios     1.1.9   1       1                                       
appstore		ios		1.2.0	1		1
xiaomi			android	1.2.2	1		2
oppo			android	1.2.3	2		2
xiaomi			android	1.2.1	1		1
oppo			android	1.2.1	1		1
Time taken: 6.842 seconds, Fetched 6 row(s)

每个操作系统聚合:

spark-sql> select
         >   os,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from test.test10_os_uid
         > group by os;
android 3       6                                                               
ios		2		2
Time taken: 10.561 seconds, Fetched 2 row(s)

每个用户聚合:

spark-sql> select
         >   count(1) as uv,
         >   sum(pv) as pv
         > from test.test10_uid;
3       8                                                                       
Time taken: 3.711 seconds, Fetched 1 row(s)

看到上面这么多代码量是不是有点吓人,每个维度的去重,再每个维度的聚合。
如果不考虑代码的优化,可以直接从qid、os、ver这个最明细的表进行去重和聚合。

我们再来看看通过grouping sets的方式:
  1. 先根据各个需要的维度进行去重
  2. 再进行聚合

去重:
我们去重的维度:

  • uid、qid、os、ver
  • uid、os
  • uid
spark-sql> select
         >   uid,
         >   qid,
         >   os,
         >   ver,
         >   count(1) as pv
         > from test.test10
         > group by uid,qid,os,ver
         > grouping sets
         > (
         >   (uid,qid,os,ver),
         >   (uid,os),
         >   (uid)
         > )
         > order by qid,os,ver;
10001   NULL    	NULL    	NULL    3                                               
10002	NULL		NULL		NULL	3
10003	NULL		NULL		NULL	2
10001	NULL		android		NULL	3
10003	NULL		android		NULL	1
10002	NULL		android		NULL	2
10003	NULL		ios			NULL	1
10002	NULL		ios			NULL	1
10003	appstore	ios			1.1.9	1
10002	appstore	ios			1.2.0	1
10002	oppo		android		1.2.1	1
10003	oppo		android		1.2.3	1
10001	oppo		android		1.2.3	1
10002	xiaomi		android		1.2.1	1
10001	xiaomi		android		1.2.2	2
Time taken: 4.051 seconds, Fetched 15 row(s)

注:
当grouping sets只根据uid维度去重的时候,qid、os、ver三个维度为NULL,即qid、os、ver三个维度为汇总
当grouping sets根据uid、os维度去重的时候,qid、ver这两个维度为NULL,即qid、ver两个维度为汇总

聚合:
根据uid去重后的结果进行聚合:

spark-sql> select 
         >   qid,
         >   os,
         >   ver,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from
         > (select
         >   uid,
         >   nvl(qid,'all') as qid,
         >   nvl(os,'all') as os,
         >   nvl(ver,'all') as ver,
         >   count(1) as pv
         > from test.test10
         > group by uid,qid,os,ver
         > grouping sets
         > (
         >   (uid,qid,os,ver),
         >   (uid,os),
         >   (uid)
         > )) t2
         > group by qid,os,ver
         > order by qid,os,ver;
all			all			all		3		8
all     	android 	all     3       6    
all			ios			all		2		2                                           
appstore	ios			1.1.9	1		1
appstore	ios			1.2.0	1		1
xiaomi	android			1.2.2	1		2
oppo	android			1.2.3	2		2
xiaomi	android			1.2.1	1		1
oppo	android			1.2.1	1		1
Time taken: 8.257 seconds, Fetched 9 row(s)

注意:这里有个坑,如果qid、os、ver维度里面有一个NULL值,数据会出现异常情况
案例:

spark-sql> with test1 as
         > (select '10001' as uid,'xiaomi' as qid
         > union all
         > select '10002' as uid,'xiaomi' as qid
         > union all
         > select '10003' as uid,'oppo' as qid
         > union all
         > select '10003' as uid,NULL as qid)
         > select 
         >   qid,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from 
         > (select 
         >   uid,
         >   qid,
         >   count(1) as pv
         > from test1
         > group by uid,qid
         > grouping sets
         > (
         >   (uid,qid),
         >   (uid)
         > )) t1
         > group by qid;
NULL    4       5                                                               
oppo	1		1
xiaomi	2		2
Time taken: 6.396 seconds, Fetched 3 row(s)

qid维度里面有一个NULL值,造成结果是汇总数据有问题,本来uv是3,pv是4,现在出现了uv为4,pv为5,是因为grouing sets再处理qid维度时,算到汇总维度里了。我们怎么去解决这个问题呢?所有维度字段的值做排空处理nvl(qid,'unknow') as qid
案例:

spark-sql> with test1 as
         > (select '10001' as uid,'xiaomi' as qid
         > union all
         > select '10002' as uid,'xiaomi' as qid
         > union all
         > select '10003' as uid,'oppo' as qid
         > union all
         > select '10003' as uid,NULL as qid)
         > 
         > select 
         >   qid,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from 
         > (select 
         >   uid,
         >   qid,
         >   count(1) as pv
         > from 
         > (select 
         >   uid,
         >   nvl(qid,'unknow') as qid
         > from test1) t1
         > group by uid,qid
         > grouping sets
         > (
         >   (uid,qid),
         >   (uid)
         > )) t2
         > group by qid;
NULL    	3       4                                                               
oppo		1		1
xiaomi		2		2
unknow		1		1
Time taken: 3.47 seconds, Fetched 4 row(s)

这样qid为NULL值就以unknow显示出来了

总结:
普通group by方式:
优点:各个维度的uid表都落地了,如果查询某个特定的维度,效率会高一点;逻辑比较清晰,直观
缺点:代码量大,维护起来比较麻烦
grouping sets方式:
优点:代码简洁,比较好维护
缺点:数据量大的时候,消耗的资源比较多(grouping sets下面的维度可分段,解决数据量大的问题)

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值