hive 抽样统计

最新推荐文章于 2024-05-10 11:53:40 发布

liaoyu1204

最新推荐文章于 2024-05-10 11:53:40 发布

阅读量861

点赞数 1

用hive实现了随机抽样中简单随机、系统和分层抽样的方式，记得抽样的概念还是初中数据接触的其实很多时候不需要理论，想也是可以想到的，不过还是总结一下

0.测试表：

    Sql代码   
    
  
 drop table songpo_test;  
 create table if not exists songpo_test  
  (  
    refund_id string,  
    user_id string,  
    cat_id  string,  
    cat2_id  string,  
    org_id  string,  
    gmt_create string  
  )  
  partitioned by(pt string)  
  row format delimited  
  fields terminated by ','  
  lines terminated by '\n' STORED AS SEQUENCEFILE;  

   1. 简单随机抽样 (rand()) 从表中，随机打标，排序，随机抽取100个用户数据
        步骤 1). 给每行记录一个相同的标记
             2). 排序,取前100

    Sql代码   
    
  
 sql:  
 select *  
 from(  
     select user_id,flag from (select user_id,'1' as flag from songpo_test) x  
     distribute by user_id sort by user_id,flag desc  
     )a  
 where row_number(user_id)<=100;  

   2.系统抽样 mod,rand() 依照userrid取模，分5组，每组随机抽取100个用户，实现如：
                1). 依据user_id，取模，获取 mod_numd
                2). 在mod_num组内然后随机排序，
                3). 从各组取出20条

    Sql代码   
    
  
  sql:  
  select *  
   from(  
       select refund_id,user_id,mod_num,rank_num from (select refund_id,user_id,cast(10+rand()*100 as double) rank_num,user_id%5 as mod_num from songpo_test)   
       distribute by mod_num sort by mod_num,rank_num desc  
       )a  
 where row_number(mod_num)<=20;  

3.分层抽样   按照每个组的记录数来分层抽样。假设需要抽取EXTRA_NUM条记录
                1). 计算每个分区需要抽记录条数
                2). 在mod_num组内然后随机排序，
                3). 从各组取出cat_extra_num条

    Sql代码   
    
  
       drop table test_data_extra_indexs;  
       create table test_data_extra_indexs as  
       select a.cat_id,cat_num,all_num,cat_num/all_num as extra_lv,(cat_num/all_num)*'EXTRA_NUM'  as cat_extra_num,c.refund_id,c.user_id,c.org_id from   
       (select cat_id,count(1) as cat_num,'1' as key from songpo_test group by cat_id) a   
       join  
       (select '1' as key,count(1) as all_num from songpo_test) b   
       on a.key=b.key  
       join  
       (select * from songpo_test) c   
       on a.cat_id=c.cat_id;  
   
   
       select *  
       from(  
         select refund_id,user_id,cat_id,mod_num,rank_num from   
         select refund_id,user_id,cat_id,cast(10+rand()*100 as double) rank_num,user_id%5 as mod_num,cat_extra_num from(  
         (select refund_id,user_id,cat_id,cast(10+rand()*100 as double) rank_num,user_id%5 as mod_num from test_data_extra_indexs)  x  
         )  
         distribute by mod_num sort by mod_num,rank_num desc  
       )a  
 where row_number(mod_num)<=20;  

liaoyu1204

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hive 抽样统计

用hive实现了随机抽样中简单随机、系统和分层抽样的方式，记得抽样的概念还是初中数据接触的其实很多时候不需要理论，想也是可以想到的，不过还是总结一下 0.测试表： Sql代码 drop table songpo_test; create table if not exists songpo_test ( refund_id
复制链接

扫一扫