grouping sets和with cube以及array+explode做多维度聚合统计

1、grouping sets

原理:等同于先按不同维度group by再union all到一起

示例:

select
    nvl(union_id, 'All') as union_id,
    nvl(union_name, 'All') as union_name,
    sum(if(dt = '${hiveconf:part}', salary_wage_day, 0)) as total_this_day,
    sum(if(dt = getlastmonthpart('${hiveconf:part}', 3), salary_wage, 0)) as total_last_month_whole,
    sum(if(dt = '${hiveconf:part}', salary_wage, 0)) as total_this_month,
    sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), salary_wage, 0)) as total_last_month,
    sum(if(dt = '${hiveconf:part}', audio_gift, 0)) as gift_this_month,
    sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), audio_gift, 0)) as gift_last_month,
    sum(if(dt = '${hiveconf:part}', diamond_exchange, 0)) as exchange_this_month,
    sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), diamond_exchange, 0)) as exchange_last_month,
    sum(if(dt = '${hiveconf:part}', diamond_balance, 0)) as balance_this_month,
    sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), diamond_balance, 0)) as balance_last_month,
    sum(if(dt = '${hiveconf:part}', exchange_wallet, 0)) as ex_wallet_this_month,
    sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), exchange_wallet, 0)) as ex_wallet_last_month
from
    tl_north_star_de_vj_live_diamond_summary_month
where dt in ('${hiveconf:part}', getlastmonthpart('${hiveconf:part}', 2), getlastmonthpart('${hiveconf:part}', 3)) and country = 'ID'
group by union_id, union_name
    grouping sets((), (union_id, union_name))

1、()代表union_id和union_name都为null的总和

2、(union_id, union_name)代表没有进行任何的汇总统计,等同于group by union_id, union_name

3、(union_id)代表只对union_name做汇总,即会出现union_id有值,unino_name为null的情况

优点:便于控制不同的维度,比较灵活

缺点:维度多时需要一个个手写,麻烦且容易遗漏,而且Hive有些版本不支持此功能

2、with cube

原理:一样,还是等同于先按不同维度group by再union all到一起

示例:

select
    date_udf('${hiveconf:part}', '-') as ptdate,
    nvl(nvl(game_region_name, 'Other'), 'All') as game_region,
    nvl(case game_name
        when 'LuckyGift' then 'Lucky Gift'
        when 'GuessCar' then 'Guess Car'
        when 'GiftForging' then 'Gift Forging'
        when 'Draw' then 'Draw'
        when 'RandomGift' then 'Random Gift'
        when '占卜游戏' then '占卜游戏'
        else game_name
    end, 'All') as game_name,
    round(sum(day_amount)/143, 2) as day_amount,
    round(sum(day_consume)/143, 2) as day_consume,
    round(sum(month_amount)/143, 2) as month_amount,
    round(sum(month_consume)/143, 2) as month_consume,
    round(sum(last_month_amount)/143, 2) as last_month_amount,
    round((sum(month_amount)-sum(last_month_amount))/143, 2) as difference,
    round((sum(month_amount)-sum(last_month_amount))/abs(sum(last_month_amount)), 4) as chain_ratio,
    sum(bet_pv) as players_pv,
    size(collect_set(if(day_consume != 0, uid, null))) as players_uv
from
    (
    select
        uid,
        country,
        game_name,
        sum(if(dt = '${hiveconf:part}', bet_pv, 0)) as bet_pv,
        sum(if(dt = '${hiveconf:part}', profit_amount / if(game_type != 'silver_game', 1, 10), 0)) as day_amount,
        sum(if(dt = '${hiveconf:part}', bet_amount / if(game_type != 'silver_game', 1, 10), 0)) as day_consume,
        sum(if(dt between substr('${hiveconf:part}', 0, 6)||'01' and '${hiveconf:part}', profit_amount / if(game_type != 'silver_game', 1, 10), 0)) as month_amount,
        sum(if(dt between substr('${hiveconf:part}', 0, 6)||'01' and '${hiveconf:part}', bet_amount / if(game_type != 'silver_game', 1, 10), 0)) as month_consume,
        sum(if(dt between getlastmonthpart('${hiveconf:part}', 1) and getlastmonthpart('${hiveconf:part}', 2), profit_amount / if(game_type != 'silver_game', 1, 10), 0)) as last_month_amount
    from
        ads_user_game_summary
    where dt between getlastmonthpart('${hiveconf:part}', 1) and '${hiveconf:part}'
      and game_name in ('LuckyGift', 'GuessCar', 'GiftForging', 'Draw', 'RandomGift', '占卜游戏')
    group by uid,country,game_name
    ) country_game_info
left join
    dim_country_region_mapping as country_region
on country_game_info.country = country_region.country_code
    and country_region.dt = '${hiveconf:part}'
group by nvl(game_region_name, 'Other'),game_name
    with cube

with cube会直接对nvl(game_region_name, 'Other'),game_name进行多维度聚合,会分为以下四组:

1、nvl(game_region_name, 'Other'),game_name

2、nvl(game_region_name, 'Other'),null

3、null,game_name

4、null,null

优点:写起来比grouping sets方便简单

缺点:不灵活,只能所有维度分类都统计,最后不想要哪个再最外层做过滤,而且Hive有些版本不支持此功能

3、array+explode

原理:先为所有要汇总的维度构建array数组,例如array(nvl(t3.region, 'OTHER'), 'ALL'),最后再将所有构建的数组炸裂,变成两份,一份做正常统计,一份做汇总统计

示例:

select
    C1.region as region,
    C2.country as country,
    C3.gender as gender,
    C4.media_source as media_source,
    C6.campaign as campaign,
    C7.campaign_type as campaign_type,
    size(collect_set(afid)) as install_uv,
    size(collect_set(active_1)) as active_uv,
    if(size(collect_set(active_2)) = 0, null, size(collect_set(active_2))) / size(collect_set(active_1)) as retain_rate_2,
    if(size(collect_set(active_3)) = 0, null, size(collect_set(active_3))) / size(collect_set(active_1)) as retain_rate_3,
    if(size(collect_set(active_4)) = 0, null, size(collect_set(active_4))) / size(collect_set(active_1)) as retain_rate_4,
    if(size(collect_set(active_5)) = 0, null, size(collect_set(active_5))) / size(collect_set(active_1)) as retain_rate_5,
    if(size(collect_set(active_6)) = 0, null, size(collect_set(active_6))) / size(collect_set(active_1)) as retain_rate_6,
    if(size(collect_set(active_7)) = 0, null, size(collect_set(active_7))) / size(collect_set(active_1)) as retain_rate_7,
    if(size(collect_set(active_8)) = 0, null, size(collect_set(active_8))) / size(collect_set(active_1)) as retain_rate_8,
    if(size(collect_set(active_9)) = 0, null, size(collect_set(active_9))) / size(collect_set(active_1)) as retain_rate_9,
    if(size(collect_set(active_10)) = 0, null, size(collect_set(active_10))) / size(collect_set(active_1)) as retain_rate_10,
    if(size(collect_set(active_11)) = 0, null, size(collect_set(active_11))) / size(collect_set(active_1)) as retain_rate_11,
    if(size(collect_set(active_12)) = 0, null, size(collect_set(active_12))) / size(collect_set(active_1)) as retain_rate_12,
    if(size(collect_set(active_13)) = 0, null, size(collect_set(active_13))) / size(collect_set(active_1)) as retain_rate_13,
    if(size(collect_set(active_14)) = 0, null, size(collect_set(active_14))) / size(collect_set(active_1)) as retain_rate_14,
    if(size(collect_set(active_15)) = 0, null, size(collect_set(active_15))) / size(collect_set(active_1)) as retain_rate_15,
    if(size(collect_set(active_16)) = 0, null, size(collect_set(active_16))) / size(collect_set(active_1)) as retain_rate_16,
    if(size(collect_set(active_17)) = 0, null, size(collect_set(active_17))) / size(collect_set(active_1)) as retain_rate_17,
    if(size(collect_set(active_18)) = 0, null, size(collect_set(active_18))) / size(collect_set(active_1)) as retain_rate_18,
    if(size(collect_set(active_19)) = 0, null, size(collect_set(active_19))) / size(collect_set(active_1)) as retain_rate_19,
    if(size(collect_set(active_20)) = 0, null, size(collect_set(active_20))) / size(collect_set(active_1)) as retain_rate_20,
    if(size(collect_set(active_21)) = 0, null, size(collect_set(active_21))) / size(collect_set(active_1)) as retain_rate_21,
    if(size(collect_set(active_22)) = 0, null, size(collect_set(active_22))) / size(collect_set(active_1)) as retain_rate_22,
    if(size(collect_set(active_23)) = 0, null, size(collect_set(active_23))) / size(collect_set(active_1)) as retain_rate_23,
    if(size(collect_set(active_24)) = 0, null, size(collect_set(active_24))) / size(collect_set(active_1)) as retain_rate_24,
    if(size(collect_set(active_25)) = 0, null, size(collect_set(active_25))) / size(collect_set(active_1)) as retain_rate_25,
    if(size(collect_set(active_26)) = 0, null, size(collect_set(active_26))) / size(collect_set(active_1)) as retain_rate_26,
    if(size(collect_set(active_27)) = 0, null, size(collect_set(active_27))) / size(collect_set(active_1)) as retain_rate_27,
    if(size(collect_set(active_28)) = 0, null, size(collect_set(active_28))) / size(collect_set(active_1)) as retain_rate_28,
    if(size(collect_set(active_29)) = 0, null, size(collect_set(active_29))) / size(collect_set(active_1)) as retain_rate_29,
    if(size(collect_set(active_30)) = 0, null, size(collect_set(active_30))) / size(collect_set(active_1)) as retain_rate_30,
    0 as cost,
    0 as cpi,
    0 as retain_1_cost,
    0 as revenue,
    C5.is_new as is_new,
    install_day as p_date
from
    (
    select
        t0.install_day,
        t0.afid,
        array(nvl(t3.region, 'OTHER'), 'ALL') as region_array,
        array(nvl(t2.country, 'OTHER'), 'ALL') as country_array,
        array(if(t2.gender = 1, '男', '女'), 'ALL') as gender_array,
        array(nvl(t0.media_source, 'unknown'), 'ALL') as media_source_array,
        array(if(t0.install_day = t2.regist_day, '是', '否'), 'ALL') as is_new_array,
        array(nvl(t0.campaign, 'unknown'), 'ALL') as campaign_array,
        array(nvl(t0.campaign_type, 'unknown'), 'ALL') as campaign_type_array,
        t1.uid as active_1,
        if(t0.install_day = date_sub(t4.active_day,1), t1.uid, null) as active_2,
        if(t0.install_day = date_sub(t4.active_day,2), t1.uid, null) as active_3,
        if(t0.install_day = date_sub(t4.active_day,3), t1.uid, null) as active_4,
        if(t0.install_day = date_sub(t4.active_day,4), t1.uid, null) as active_5,
        if(t0.install_day = date_sub(t4.active_day,5), t1.uid, null) as active_6,
        if(t0.install_day = date_sub(t4.active_day,6), t1.uid, null) as active_7,
        if(t0.install_day = date_sub(t4.active_day,7), t1.uid, null) as active_8,
        if(t0.install_day = date_sub(t4.active_day,8), t1.uid, null) as active_9,
        if(t0.install_day = date_sub(t4.active_day,9), t1.uid, null) as active_10,
        if(t0.install_day = date_sub(t4.active_day,10), t1.uid, null) as active_11,
        if(t0.install_day = date_sub(t4.active_day,11), t1.uid, null) as active_12,
        if(t0.install_day = date_sub(t4.active_day,12), t1.uid, null) as active_13,
        if(t0.install_day = date_sub(t4.active_day,13), t1.uid, null) as active_14,
        if(t0.install_day = date_sub(t4.active_day,14), t1.uid, null) as active_15,
        if(t0.install_day = date_sub(t4.active_day,15), t1.uid, null) as active_16,
        if(t0.install_day = date_sub(t4.active_day,16), t1.uid, null) as active_17,
        if(t0.install_day = date_sub(t4.active_day,17), t1.uid, null) as active_18,
        if(t0.install_day = date_sub(t4.active_day,18), t1.uid, null) as active_19,
        if(t0.install_day = date_sub(t4.active_day,19), t1.uid, null) as active_20,
        if(t0.install_day = date_sub(t4.active_day,20), t1.uid, null) as active_21,
        if(t0.install_day = date_sub(t4.active_day,21), t1.uid, null) as active_22,
        if(t0.install_day = date_sub(t4.active_day,22), t1.uid, null) as active_23,
        if(t0.install_day = date_sub(t4.active_day,23), t1.uid, null) as active_24,
        if(t0.install_day = date_sub(t4.active_day,24), t1.uid, null) as active_25,
        if(t0.install_day = date_sub(t4.active_day,25), t1.uid, null) as active_26,
        if(t0.install_day = date_sub(t4.active_day,26), t1.uid, null) as active_27,
        if(t0.install_day = date_sub(t4.active_day,27), t1.uid, null) as active_28,
        if(t0.install_day = date_sub(t4.active_day,28), t1.uid, null) as active_29,
        if(t0.install_day = date_sub(t4.active_day,29), t1.uid, null) as active_30
    from    --30天买量信息
        (
        select
            date_udf(dt,'-') as install_day,
            appsflyer_id as afid,
            media_source,
            campaign,
            split(campaign,'-')[2] as campaign_type
        from
            dwd_af_event_detail
        where date_udf(dt,'-') between date_sub('${yesterday}',29) and '${yesterday}'
            and event_name='install'
            and media_source is not null
            and media_source not in ('None','organic','User_invite','ShareLive','null')
        )t0
    join   --uid
        (
        select
            uid,
            afid,
            to_date(create_time) as day_key
        from
            dim_uid_did_afid_mapping
        where date_udf(dt,'-')='${yesterday}'
        group by uid,afid,to_date(create_time)
        )t1
    on t0.afid=t1.afid and t0.install_day=t1.day_key
    join   ---用户基本信息
        (
        select
            uid,
            gender,
            country,
            to_date(create_time) as regist_day
        from
            ods_user_info
        where date_udf(dt,'-')='${yesterday}' and gender is not null
        )t2
    on t1.uid=t2.uid
    left join   --region
        (
        select
            country,
            region
        from
            dim_country_region_mapping
        where date_udf(dt,'-')='${yesterday}'
        )t3
    on t2.country=t3.country
    left join --活跃信息
        (
        select
            uid,
            date_udf(dt,'-') as active_day
        from
            ads_user_active_info
        where date_udf(dt,'-') between date_sub('${yesterday}',29) and '${yesterday}'
        )t4
    on t1.uid=t4.uid
    )t
lateral view explode(region_array)C1 as region
lateral view explode(country_array)C2 as country
lateral view explode(gender_array)C3 as gender
lateral view explode(media_source_array)C4 as media_source
lateral view explode(is_new_array)C5 as is_new
lateral view explode(campaign_array)C6 as campaign
lateral view explode(campaign_type_array)C7 as campaign_type
group by install_day,C1.region,C2.country,C3.gender,C4.media_source,C5.is_new,C6.campaign,C7.campaign_type

优点:

1、比with cube灵活,可以定制化想要汇总和不想要汇总的维度,无非就是是否要构建数组以及炸裂的问题,例如这里我就不想对install_day注册日期进行汇总

2、同时也比grouping sets写起来简单,不容易遗漏,可以说是很好的避免了with cube和grouping sets的缺点

3、另外如果汇总维度超过3个以上,性能有明显提升,要知道如果使用with cube,有两个维度,底层就是4段分组sql进行union all,而如果有三个维度,就是8段sql,有多少个维度就是2的多少指数倍,而这种方式,无论多少维度底层都只分一次组

缺点:

难以理解与掌握,新手容易出错,且当汇总维度低于3个时,写起来麻烦,性能也没有太大提升

4、总结

1、如果汇总维度低于3个,且需要定制化,则用grouping sets,如果超过3个,则用array+explode

2、如果汇总维度低于3个,且无需定制化,则用with cube,如果超过3个,则用array+explode

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值