Hive面试题1:复杂sql

1.等比例抽样

此场景在工作中遇到过,笔者原创。有用户表user,字段user_id, city。现运营同事要选10w人发调查问卷,要求人群的city分布,和全量用户的city分布一致。

with city_fenbu as (
    select city, user_cnt/ sum(user_cnt) over() as zhanbi 
    from (
        select city,count(user_id) as user_cnt
        from user_info 
        group by city
    ) t1
)
,user_shuffle as (
    select city, user_id, 
    row_number() over (partition by city order by rand()) as rk --用户集打乱
    from user_info
)
select aa.*, bb.*  
from user_shuffle aa join city_fenbu bb on aa.city=bb.city
where round(100000*bb.zhanbi) >= aa.rk;

2.获取连续活跃天数>=n天的用户集

经典场景,必掌握。有用户活跃日志表user_active_log, 字段:日期dt(yyyy-MM-dd),user_id。直接上sql

with uid_dt as (
    select dt, user_id
    from user_active_log
    group by dt,user_id
)
,uid_flg as (
    select user_id, dt, 
    date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_active_days as (
    select user_id, flg_dt, count(dt) as continue_active_days
    from uid_with_flg
    group by user_id, flg_dt
)
select user_id
from uid_active_days
group by user_id 
having max(continue_active_days) >=7;

引申题,求每个用户连续活跃区间的起止日期。

with uid_dt as (
    select dt, user_id
    from user_active_log
    group by dt,user_id
)
,uid_flg as (
    select user_id, dt, 
    date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_start_end_dt as (
    select user_id, dt, 
    first_value(dt) over (partition by user_id, flg_dt order by dt) as start_dt,--同一连续活跃区间的起始日期
    first_value(dt) over (partition by user_id, flg_dt order by dt desc) as end_dt --结束日期
)
select user_id,start_dt,end_dt 
from uid_start_end_dt
group b
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值