这次分享多维分析优化的另一种情况
【本文大纲】
1、描述问题背景
2、讲一下解决思路
3、解决办法(spark sql处理parquet row group原理及分区原理,参数测试,解决方案)
4、效果
1、描述问题
代码如下:
select
netease_user,
if(campaign_id is null, 'all', campaign_id) as campaign_id,
if(spec_id is null, 'all', spec_id) as spec_id,
if(app_bundle is null, 'all', app_bundle) as app_bundle,
if(render_name is null, 'all', render_name) as render_name,
platform,
sum(bidfloor) as success_bidfloor,
count(distinct clk_request_id) as click_pv,
count(distinct exp_deviceid) as exp_uv,
count(distinct exp_request_id) as exp_pv,
count(distinct clk_deviceid) as click_uv,
round(sum(case when winprice<0 then 0 else winprice end)/1000, 4) as cost
from
(
select distinct
nvl(netease_user , 'true') as netease_user,
nvl(render_name , 'null') as render_name,
platform,
nvl(campaign_id, 'null') as campaign_id,
nvl(spec_id, 'null') as spec_id,
nvl(app_bundle , 'null') as app_bundle,
clk_request_id, exp_deviceid, exp_request_id, clk_deviceid, winprice, bidfloor
from table_a where day = '20190815' and platform is not null
) tmp
group by netease_user, campaign_id, spec_id, app_bundle, render_name, platform
grouping sets(
( netease_user, platform),
( netease_user, platform, campaign_id),
( netease_user, platform, spec_id),
( netease_user, platform,app_bundle),
( netease_user, platform,render_name),
( netease_user, platform,campaign_id, spec_id),
( netease_user, platform,campaign_id, app_bundle),
( netease_user, platform,campaign_id, render_name),
( netease_user, platform, spec_id, app_bundle),
( netease_user, platform, spec_id, render_name),
( netease_user, platform, app_bundle, render_name),
( netease_user, platform, campaign_id, spec_id, app_bundle),
( netease_user, platform, spec_id, app_bundle, render_name),
( netease_user, platform, campaign_id, app_bundle, render_name),
( netease_user, platform, campaign_id, spec_id, render_name),
( netease_user, campaign_id, spec_id, app_bundle, render_name, platform)
)
;
整体逻辑与上一篇:【spark sql多维分析优化——细节是魔鬼】 差不多。
不同的是上一篇的基础表 table_a的总量很大&#