用hive实现了随机抽样中简单随机、系统和分层抽样的方式,记得抽样的概念还是初中数据接触的 其实很多时候不需要理论,想也是可以想到的,不过还是总结一下
- drop table songpo_test;
- create table if not exists songpo_test
- (
- refund_id string,
- user_id string,
- cat_id string,
- cat2_id string,
- org_id string,
- gmt_create string
- )
- partitioned by(pt string)
- row format delimited
- fields terminated by ','
- lines terminated by '\n' STORED AS SEQUENCEFILE;
1. 简单随机抽样 (rand()) 从表中,随机打标,排序,随机抽取100个用户数据
步骤 1). 给每行记录一个相同的标记
2). 排序,取前100
- sql:
- select *
- from(
- select user_id,flag from (select user_id,'1' as flag from songpo_test) x
- distribute by user_id sort by user_id,flag desc
- )a
- where row_number(user_id)<=100;
2.系统抽样 mod,rand() 依照userrid取模,分5组,每组随机抽取100个用户,实现如:
1). 依据user_id,取模,获取 mod_numd
2). 在mod_num组内然后随机排序,
3). 从各组取出20条
- sql:
- select *
- from(
- select refund_id,user_id,mod_num,rank_num from (select refund_id,user_id,cast(10+rand()*100 as double) rank_num,user_id%5 as mod_num from songpo_test)
- distribute by mod_num sort by mod_num,rank_num desc
- )a
- where row_number(mod_num)<=20;
3.分层抽样 按照每个组的记录数来分层抽样。假设需要抽取EXTRA_NUM条记录
1). 计算每个分区需要抽记录条数
2). 在mod_num组内然后随机排序,
3). 从各组取出cat_extra_num条
- drop table test_data_extra_indexs;
- create table test_data_extra_indexs as
- select a.cat_id,cat_num,all_num,cat_num/all_num as extra_lv,(cat_num/all_num)*'EXTRA_NUM' as cat_extra_num,c.refund_id,c.user_id,c.org_id from
- (select cat_id,count(1) as cat_num,'1' as key from songpo_test group by cat_id) a
- join
- (select '1' as key,count(1) as all_num from songpo_test) b
- on a.key=b.key
- join
- (select * from songpo_test) c
- on a.cat_id=c.cat_id;
- select *
- from(
- select refund_id,user_id,cat_id,mod_num,rank_num from
- select refund_id,user_id,cat_id,cast(10+rand()*100 as double) rank_num,user_id%5 as mod_num,cat_extra_num from(
- (select refund_id,user_id,cat_id,cast(10+rand()*100 as double) rank_num,user_id%5 as mod_num from test_data_extra_indexs) x
- )
- distribute by mod_num sort by mod_num,rank_num desc
- )a
- where row_number(mod_num)<=20;