我们先来看看一个实际生成中的问题:
现有一个用户访问记录表,用户在不同的渠道、操作系统、版本下的访问记录,
如下:
create table test.test10 as
select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver
union all
select '10002' as uid,'xiaomi' as qid,'android' as os,'1.2.1' as ver
union all
select '10003' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver
union all
select '10002' as uid,'appstore' as qid,'ios' as os,'1.2.0' as ver
union all
select '10001' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver
union all
select '10003' as uid,'appstore' as qid,'ios' as os,'1.1.9' as ver
union all
select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver
union all
select '10002' as uid,'oppo' as qid,'android' as os,'1.2.1' as ver;
问题:
- 每个渠道、操作系统、版本下访问次数和访问人数
- 每个操作系统下访问次数和访问人数
- 总的访问次数和访问人数
分析:
上面三个问题都是求访问次数和访问人数,涉及到人数,我们首先考虑的就是去重。
那怎么去重效率最高呢?
我们先来看看通过group by的方式:
- 先根据每个渠道、操作系统、版本去重
- 根据1的去重结果再根据操作系统去重
- 再根据2的结果到用户维度去重
去重:
根据每个渠道、操作系统、版本去重:
create table test.test10_qid_os_ver_uid as
select
uid,
qid,
os,
ver,
count(1) as pv
from test.test10
group by uid,qid,os,ver;
根据操作系统去重:
create table test.test10_os_uid as
select
uid,
os,
sum(pv) as pv
from test.test10_qid_os_ver_uid
group by uid,os;
根据用户去重:
create table test.test10_uid as
select
uid,
sum(pv) as pv
from test.test10_os_uid
group by uid;
聚合:
每个渠道、操作系统、版本聚合:
spark-sql> select
> qid,
> os,
> ver,
> count(1) as uv,
> sum(pv) as pv
> from test.test10_qid_os_ver_uid
> group by qid,os,ver;
appstore ios 1.1.9 1 1
appstore ios 1.2.0 1 1
xiaomi android 1.2.2 1 2
oppo android 1.2.3 2 2
xiaomi android 1.2.1 1 1
oppo android 1.2.1 1 1
Time taken: 6.842 seconds, Fetched 6 row(s)
每个操作系统聚合:
spark-sql> select
> os,
> count(1) as uv,
> sum(pv) as pv
> from test.test10_os_uid
> group by os;
android 3 6
ios 2 2
Time taken: 10.561 seconds, Fetched 2 row(s)
每个用户聚合:
spark-sql> select
> count(1) as uv,
> sum(pv) as pv
> from test.test10_uid;
3 8
Time taken: 3.711 seconds, Fetched 1 row(s)
看到上面这么多代码量是不是有点吓人,每个维度的去重,再每个维度的聚合。
如果不考虑代码的优化,可以直接从qid、os、ver这个最明细的表进行去重和聚合。
我们再来看看通过grouping sets的方式:
- 先根据各个需要的维度进行去重
- 再进行聚合
去重:
我们去重的维度:
- uid、qid、os、ver
- uid、os
- uid
spark-sql> select
> uid,
> qid,
> os,
> ver,
> count(1) as pv
> from test.test10
> group by uid,qid,os,ver
> grouping sets
> (
> (uid,qid,os,ver),
> (uid,os),
> (uid)
> )
> order by qid,os,ver;
10001 NULL NULL NULL 3
10002 NULL NULL NULL 3
10003 NULL NULL NULL 2
10001 NULL android NULL 3
10003 NULL android NULL 1
10002 NULL android NULL 2
10003 NULL ios NULL 1
10002 NULL ios NULL 1
10003 appstore ios 1.1.9 1
10002 appstore ios 1.2.0 1
10002 oppo android 1.2.1 1
10003 oppo android 1.2.3 1
10001 oppo android 1.2.3 1
10002 xiaomi android 1.2.1 1
10001 xiaomi android 1.2.2 2
Time taken: 4.051 seconds, Fetched 15 row(s)
注:
当grouping sets只根据uid维度去重的时候,qid、os、ver三个维度为NULL,即qid、os、ver三个维度为汇总
当grouping sets根据uid、os维度去重的时候,qid、ver这两个维度为NULL,即qid、ver两个维度为汇总
聚合:
根据uid去重后的结果进行聚合:
spark-sql> select
> qid,
> os,
> ver,
> count(1) as uv,
> sum(pv) as pv
> from
> (select
> uid,
> nvl(qid,'all') as qid,
> nvl(os,'all') as os,
> nvl(ver,'all') as ver,
> count(1) as pv
> from test.test10
> group by uid,qid,os,ver
> grouping sets
> (
> (uid,qid,os,ver),
> (uid,os),
> (uid)
> )) t2
> group by qid,os,ver
> order by qid,os,ver;
all all all 3 8
all android all 3 6
all ios all 2 2
appstore ios 1.1.9 1 1
appstore ios 1.2.0 1 1
xiaomi android 1.2.2 1 2
oppo android 1.2.3 2 2
xiaomi android 1.2.1 1 1
oppo android 1.2.1 1 1
Time taken: 8.257 seconds, Fetched 9 row(s)
注意:这里有个坑,如果qid、os、ver维度里面有一个NULL值,数据会出现异常情况
案例:
spark-sql> with test1 as
> (select '10001' as uid,'xiaomi' as qid
> union all
> select '10002' as uid,'xiaomi' as qid
> union all
> select '10003' as uid,'oppo' as qid
> union all
> select '10003' as uid,NULL as qid)
> select
> qid,
> count(1) as uv,
> sum(pv) as pv
> from
> (select
> uid,
> qid,
> count(1) as pv
> from test1
> group by uid,qid
> grouping sets
> (
> (uid,qid),
> (uid)
> )) t1
> group by qid;
NULL 4 5
oppo 1 1
xiaomi 2 2
Time taken: 6.396 seconds, Fetched 3 row(s)
qid维度里面有一个NULL值,造成结果是汇总数据有问题,本来uv是3,pv是4,现在出现了uv为4,pv为5,是因为grouing sets再处理qid维度时,算到汇总维度里了。我们怎么去解决这个问题呢?所有维度字段的值做排空处理nvl(qid,'unknow') as qid
案例:
spark-sql> with test1 as
> (select '10001' as uid,'xiaomi' as qid
> union all
> select '10002' as uid,'xiaomi' as qid
> union all
> select '10003' as uid,'oppo' as qid
> union all
> select '10003' as uid,NULL as qid)
>
> select
> qid,
> count(1) as uv,
> sum(pv) as pv
> from
> (select
> uid,
> qid,
> count(1) as pv
> from
> (select
> uid,
> nvl(qid,'unknow') as qid
> from test1) t1
> group by uid,qid
> grouping sets
> (
> (uid,qid),
> (uid)
> )) t2
> group by qid;
NULL 3 4
oppo 1 1
xiaomi 2 2
unknow 1 1
Time taken: 3.47 seconds, Fetched 4 row(s)
这样qid为NULL值就以unknow显示出来了
总结:
普通group by方式:
优点:各个维度的uid表都落地了,如果查询某个特定的维度,效率会高一点;逻辑比较清晰,直观
缺点:代码量大,维护起来比较麻烦
grouping sets方式:
优点:代码简洁,比较好维护
缺点:数据量大的时候,消耗的资源比较多(grouping sets下面的维度可分段,解决数据量大的问题)