1.等比例抽样
此场景在工作中遇到过,笔者原创。有用户表user,字段user_id, city。现运营同事要选10w人发调查问卷,要求人群的city分布,和全量用户的city分布一致。
with city_fenbu as (
select city, user_cnt/ sum(user_cnt) over() as zhanbi
from (
select city,count(user_id) as user_cnt
from user_info
group by city
) t1
)
,user_shuffle as (
select city, user_id,
row_number() over (partition by city order by rand()) as rk --用户集打乱
from user_info
)
select aa.*, bb.*
from user_shuffle aa join city_fenbu bb on aa.city=bb.city
where round(100000*bb.zhanbi) >= aa.rk;
2.获取连续活跃天数>=n天的用户集
经典场景,必掌握。有用户活跃日志表user_active_log, 字段:日期dt(yyyy-MM-dd),user_id。直接上sql
with uid_dt as (
select dt, user_id
from user_active_log
group by dt,user_id
)
,uid_flg as (
select user_id, dt,
date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_active_days as (
select user_id, flg_dt, count(dt) as continue_active_days
from uid_with_flg
group by user_id, flg_dt
)
select user_id
from uid_active_days
group by user_id
having max(continue_active_days) >=7;
引申题,求每个用户连续活跃区间的起止日期。
with uid_dt as (
select dt, user_id
from user_active_log
group by dt,user_id
)
,uid_flg as (
select user_id, dt,
date_sub(dt, row_number() over (partition by imei order by dt)) as flg_dt, --取值也是一个日期,但实际意义不明。同一个连续活跃区间内,flg_dt相同
)
,uid_start_end_dt as (
select user_id, dt,
first_value(dt) over (partition by user_id, flg_dt order by dt) as start_dt,--同一连续活跃区间的起始日期
first_value(dt) over (partition by user_id, flg_dt order by dt desc) as end_dt --结束日期
)
select user_id,start_dt,end_dt
from uid_start_end_dt
group b