需求背景:在进行统计分析的时候有时候会有类似这样的需求 比如求某个平台某一天所有的订单总和,或者淘宝所有pc 端的交易总和,这个时候我们可以基于原本基础的数据进行炸裂处理之后得出结结果值,方便后续进行查询,这种提前进行预聚合的思想长用于即席查询分析的场景中,比如为了适用于某张报表的多条件查询可以采用此种方式结合预聚合的方式进行操作。
数据说明:
+-----+-----------+------------+------------+---------+---------+
|id |device_type|business_gmv|order_source|pay_money|event_day|
+-----+-----------+------------+------------+---------+---------+
|10001|jingdong |16 |1 |1000 |20211020 |
|10004|jingdong |15 |1 |2000 |20211021 |
+-----+-----------+------------+------------+---------+---------+
使用炸裂函数进行如下操作:
select
device_type_all,
business_gmv_all,
order_source,
pay_money,
event_day_all
from order_hbi
LATERAL VIEW OUTER EXPLODE(array('ALL', device_type)) col1 AS device_type_all
LATERAL VIEW OUTER EXPLODE(array('ALL', business_gmv)) col1 AS business_gmv_all
LATERAL VIEW OUTER EXPLODE(array('ALL', event_day)) col1 AS event_day_all
order by event_day_all asc
语句说明:
1.首先将device_type 和ALL的组合数组 进行炸裂,那么这里原来的两行数据在各自加上ALL 之后会得到一共四行数据:
+---------------+----------------+------------+---------+-------------+
|device_type_all|business_gmv_all|order_source|pay_money|event_day_all|
+---------------+----------------+------------+---------+-------------+
|ALL |16 |1 |1000 |20211020 |
|jingdong |16 |1 |1000 |20211020 |
|ALL |15 |1 |2000 |20211021 |
|jingdong |15 |1 |2000 |20211021 |
+---------------+----------------+------------+---------+-------------+
2.将business_gmv 和ALL 的组合数组进行炸裂,则由上述的数据再次翻倍,即每一行在基于business_gmv 进行和ALL 的炸裂则得到如下的八行数据
+---------------+----------------+------------+---------+-------------+
|device_type_all|business_gmv_all|order_source|pay_money|event_day_all|
+---------------+----------------+------------+---------+-------------+
|ALL |ALL |1 |1000 |20211020 |
|ALL |16 |1 |1000 |20211020 |
|jingdong |ALL |1 |1000 |20211020 |
|jingdong |16 |1 |1000 |20211020 |
|ALL |ALL |1 |2000 |20211021 |
|jingdong |15 |1 |2000 |20211021 |
|ALL |15 |1 |2000 |20211021 |
|jingdong |ALL |1 |2000 |20211021 |
+---------------+----------------+------------+---------+-------------+
再基于上述结果进行event_day 的炸裂,则最终可以得到如下的结果的16行数据:
+---------------+----------------+------------+---------+-------------+
|device_type_all|business_gmv_all|order_source|pay_money|event_day_all|
+---------------+----------------+------------+---------+-------------+
|ALL |ALL |1 |1000 |20211020 |
|ALL |16 |1 |1000 |20211020 |
|jingdong |ALL |1 |1000 |20211020 |
|jingdong |16 |1 |1000 |20211020 |
|jingdong |15 |1 |2000 |20211021 |
|ALL |ALL |1 |2000 |20211021 |
|ALL |15 |1 |2000 |20211021 |
|jingdong |ALL |1 |2000 |20211021 |
|ALL |ALL |1 |1000 |ALL |
|ALL |16 |1 |1000 |ALL |
|jingdong |16 |1 |1000 |ALL |
|jingdong |15 |1 |2000 |ALL |
|jingdong |ALL |1 |1000 |ALL |
|ALL |ALL |1 |2000 |ALL |
|ALL |15 |1 |2000 |ALL |
|jingdong |ALL |1 |2000 |ALL |
+---------------+----------------+------------+---------+-------------+
基于上述结果统计的目的是可以统计多纬度的指标的聚合结果
4.加上group by 进行统计结果的分析,聚合订单金额
select
device_type_all,
business_gmv_all,
order_source,
sum(pay_money) pay_money,
event_day_all
from order_hbi
LATERAL VIEW OUTER EXPLODE(array('ALL', device_type)) col1 AS device_type_all
LATERAL VIEW OUTER EXPLODE(array('ALL', business_gmv)) col1 AS business_gmv_all
LATERAL VIEW OUTER EXPLODE(array('ALL', event_day)) col1 AS event_day_all
group by device_type_all,business_gmv_all,order_source,event_day_all
order by event_day_all asc
可以得到如下的结果:
+---------------+----------------+------------+---------+-------------+
|device_type_all|business_gmv_all|order_source|pay_money|event_day_all|
+---------------+----------------+------------+---------+-------------+
|jingdong |ALL |1 |1000.0 |20211020 |
|ALL |ALL |1 |1000.0 |20211020 |
|ALL |16 |1 |1000.0 |20211020 |
|jingdong |16 |1 |1000.0 |20211020 |
|jingdong |15 |1 |2000.0 |20211021 |
|ALL |15 |1 |2000.0 |20211021 |
|jingdong |ALL |1 |2000.0 |20211021 |
|ALL |ALL |1 |2000.0 |20211021 |
|ALL |16 |1 |1000.0 |ALL |
|jingdong |16 |1 |1000.0 |ALL |
|ALL |ALL |1 |3000.0 |ALL |
|ALL |15 |1 |2000.0 |ALL |
|jingdong |15 |1 |2000.0 |ALL |
|jingdong |ALL |1 |3000.0 |ALL |
+---------------+----------------+------------+---------+-------------+
比如我们可以直接从这个结果中选择所有设备类型(device_type_all)并且所有的bussiness_gmv 并且所有日期的订单总量则可以直接选取如下的数据,订单总金额为3000
|ALL |ALL |1 |3000.0 |ALL
也可以宣组jingdong 所有business_gmv_all 所有日期的订单总额则为:3000
|jingdong |ALL |1 |3000.0 |ALL |
假如不想要某个字段的聚合结果可以用 !=ALL 先过滤掉,比如:business_gmv_all !=‘ALL’
SELECT device_type_all,sum(pay_money) gmv,event_day_all
FROM (
select
device_type_all,business_gmv_all,order_source,
sum(pay_money) pay_money,
event_day_all
from order_hbi
LATERAL VIEW OUTER EXPLODE(array('ALL', device_type)) col1 AS device_type_all
LATERAL VIEW OUTER EXPLODE(array('ALL', business_gmv)) col1 AS business_gmv_all
LATERAL VIEW OUTER EXPLODE(array('ALL', event_day)) col1 AS event_day_all
group by device_type_all,business_gmv_all,order_source,event_day_all
order by event_day_all asc
) tmp
where business_gmv_all !='ALL'
group by device_type_all,event_day_all
得到如下结果:可以选取想要的数据避免重复计算,直接选取即可
+---------------+------+-------------+
|device_type_all|gmv |event_day_all|
+---------------+------+-------------+
|ALL |1000.0|20211020 |
|jingdong |1000.0|20211020 |
|jingdong |2000.0|20211021 |
|ALL |2000.0|20211021 |
|ALL |3000.0|ALL |
|jingdong |3000.0|ALL |
+---------------+------+-------------+