1.应用背景:
实际生产中,各种指标的报表统计,往往都会涉及到多维分析,比如,统计日活数,日会话次数,日回头访客数,日新,日用户平均访问时长,访问深度……都需要从不同维度,各种角度去分析,如果上述维度分析需求,都逐个开发计算sql(逐个去group by聚合),工作繁冗!
那么,如何解决这个问题呢?
2.实现过程:
2.1 关键要点:
- 创建一个统一的目标维度分析聚合结果表,这个表应该包含所有的维度字段
- 利用hive的高阶聚合函数,在一个sql中,即可计算出所有可能的维度组合
2.2 Cube表模型:这种表,在业内通常被称之为: cube (多维数据立方体)
省 | 市 | 区 | 手机型号 | 操作系统 | App版本 | 下载渠道 | 小时段 | 日活总数 |
江西 | \n | \n | \n | \n | \n | \n | \n | 1000 |
江苏 | \n | \n | \n | \n | \n | \n | \n | 1500 |
河南 | \n | \n | \n | \n | \n | \n | \n | 1800 |
…… |
|
|
|
|
|
|
|
|
江西 | 九江 | \n | \n | \n | \n | \n | \n | 800 |
江西 | 赣州 | \n | \n | \n | \n | \n | \n | 600 |
江西 | 南昌 | \n | \n | \n | \n | \n | \n | 450 |
江西 | …… | \n | \n | \n | \n | \n | \n | 550 |
江苏 | 南通 | \n | \n | \n | \n | \n | \n | 660 |
江苏 | 苏州 | \n | \n | \n | \n | \n | \n | 540 |
江苏 | 徐州 | \n | \n | \n | \n | \n | \n | 400 |
江苏 | …… | \n | \n | \n | \n | \n | \n | 320 |
\n | \n | \n | MI6 | \n | \n | \n | \n | 1500 |
\n | \n | \n | MI8 | \n | \n | \n | \n | 2200 |
\n | \n | \n | MATE10 | \n | \n | \n | \n | 1800 |
\n | \n | \n | IPHONE6 | \n | \n | \n | \n | 1200 |
\n | \n | \n | …… | \n | \n | \n | \n | …… |
|
|
|
|
|
|
|
|
|
2.3 如果我要从上面的表中,获取到 各省份日活数,如何获取?
SELECT
province,
dau_cnt
FROM cube
WHERE province is not null and coalesce(city,district,devicetype,osname,....) is null
2.4 什么是维度的基数?
上述表的行数很大
比如按(省、市、区、手机型号、app版本、下载渠道、小时段)维度组合计算日活数,结果行数有: 省维度的基数 * 市维度的基数 * 区维度的基数 * ……
基数: 就是某个维度字段的去重值个数!
3. hive为解决以上问题所提供的高阶函数:
3.1 With cube函数:将所有可能要参与维度的字段都写在group by 后面,with cube 函数就会自动的将所有各种维度都统计出来
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
house_segment
WITH CUBE
;
3.2 Grouping sets函数:这个可以由用户自己决定需要哪些维度组合,将自己需要的维度组合都写在Grouping sets函数后
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
house_segment
GROUPING SETS(
(),
(province),
(province,city),
(province,city,district),
(device_type)
)
;
3.3 With rollup函数:主要针对层级维度的组合处理,假如现在的组合是省+市+区三个维度进行组合,with rollup 就会自动的一级一级往上卷,变成省+市,最后是省
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY province,city,district
WITH ROLLUP
;