背景:
有一批样本,比如是1000w,包含了若干个车系,现在我想从这1000w里获取到每个车系40%的数据,为了保证样本的散列,随机获取。
实现方式:
方式一:
select
series,
brand
from
(select series,brand,
model,
count(1) over(partition by series) as count, --获取到各个分组的总数
row_number() over(partition by series order by rand()) rank
from t1
where series !='' and series is not null )t
where rank < count*0.4 --获取40%即可
;
方式二:
with
t_1 as (
select series, brand
row_number() over(partition by series order by rand()) as rank
from a
),
t_2 as (
select
series, max(rank)/40 as max_rank
from b
group by series
)
select a.*
from t_1 a
left join t_2 b
on a.series = b.series
and a.rn < b.max_rank