hive查询案例

最新推荐文章于 2021-11-28 16:16:25 发布

奇妙探险家

最新推荐文章于 2021-11-28 16:16:25 发布

阅读量370

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/u013760453/article/details/89067682

版权

hive 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

按一列分组按另一列排序

直接使用row_number即可达到分组排序效果

select id,up,row_number() over(partition by substring(id,1,2) order by up) 
from temp.setup_cleanup ;

id              up      row_number
13760778710     120     1
13926435656     132     2
13480253104     180     3
13926251106     240     4
13719199419     240     5
13826544101     264     6
15989002119     1938    1
15920133257     3156    2
15013685858     3659    3

分组排序求topN（子查询中取rank编号，外部再筛选）

select * from 
(select id,row_number() over(partition by substring(id,1,3) order by up) as rank
from temp.setup_cleanup) a 
where rank<=3;

id              up      row_number
13760778710     120     1
13926435656     132     2
13480253104     180     3
15989002119     1938    1
15920133257     3156    2
15013685858     3659    3

分组聚合并求各组占比

type    qty
a       1
a       2
b       3
c       5
a       6
c       3

select type,sum(qty),
sum(sum(qty)) over(partition by 1),
sum(qty)/sum(sum(qty)) over(partition by 1)  
from temp.x group by type;

type    sum     total   per
c       8       20      0.4
b       3       20      0.15
a       9       20      0.45

explain该语句分析执行过程

可以发现是先执行了select type,sum(qty) from x group by type;，再对这个结果集做开窗计算。

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 60)
        Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 60)
      DagName: hadoop_20190407123138_845b100c-26b2-48f8-bfe3-a8a2e0b6e29b:19
      Vertices:
		//(Map 1+Reducer 2)执行了select type,sum(qty) from x group by type;
        Map 1    
            Map Operator Tree:
                TableScan
                  alias: x   //此处是从表x读取数据
                  Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                  Select Operator   //读取了两列type、qty
                    expressions: type (type: string), qty (type: int)
                    outputColumnNames: type, qty
                    Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator   //注意此处是在map端做了第一次group by sum聚合，相当于combine
                      aggregations: sum(qty)
                      keys: type (type: string)
                      mode: hash
                      outputColumnNames: _col0, _col1   //最终生成了col0:type,col1:sum(qty)
                      Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)
        Reducer 2    //注意此处是在reduce端做了第二次group by sum聚合
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0)
                keys: KEY._col0 (type: string)    
                mode: mergepartial
                outputColumnNames: _col0, _col1    
                Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: 1 (type: int)    
                  sort order: +    
                  Map-reduce partition columns: 1 (type: int)    
                  Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col0 (type: string), _col1 (type: bigint)
        //此处是对前面汇总结果做了窗口计算
	Reducer 3
            Reduce Operator Tree:
              Select Operator
                expressions: VALUE._col0 (type: string), VALUE._col1 (type: bigint)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                PTF Operator
                  Function definitions:    //开始开窗计算，为每一行计算开窗汇总值，得到sum_window_0
                      Input definition
                        input alias: ptf_0
                        output shape: _col0: string, _col1: bigint
                        type: WINDOWING
                      Windowing table definition
                        input alias: ptf_1
                        name: windowingtablefunction
                        order by: 1 ASC NULLS FIRST
                        partition by: 1    //对(col0,col1)按1分组，对应partition by 1
                        raw input shape:
                        window functions:    
                            window function definition
                              alias: sum_window_0
                              arguments: _col1
                              name: sum
                              window function: GenericUDAFSumLong
                              window frame: PRECEDING(MAX)~FOLLOWING(MAX)
                  Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
		//整理输出结果_col0, _col1, sum_window_0, _col1/sum_window_0
                  Select Operator    
                    expressions: _col0 (type: string), _col1 (type: bigint), sum_window_0 (type: bigint), (UDFToDouble(_col1) / UDFToDouble(sum_window_0)) (type: double)
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

连续三天消费>100

stdate          num
2019-01-01      12
2019-01-02      135
2019-01-03      129
2019-01-04      0
2019-01-05      166
2019-01-06      110
2019-01-07      178
2019-01-08      198
2019-01-09      13
2019-01-10      178
2019-01-11      190
2019-01-12      121
2019-01-13      16

select s1.stdate,s1.num from temp.xzq_y s1
LEFT JOIN temp.xzq_y s2 on s1.stdate=date_add(s2.stdate,-2)
LEFT JOIN temp.xzq_y s3 on s1.stdate=date_add(s3.stdate,-1)
LEFT JOIN temp.xzq_y s4 on s1.stdate=date_add(s4.stdate,1)
LEFT JOIN temp.xzq_y s5 on s1.stdate=date_add(s5.stdate,2)
where (s1.num>100 and s2.num>100 and s3.num>100)
or (s1.num>100 and s3.num>100 and s4.num>100)
or (s1.num>100 and s4.num>100 and s5.num>100);

客户留存率

create table temp.a as
select DISTINCT cust_id,acct_id from temp.x where stdate='2019-01-01';
create table temp.b as
select DISTINCT cust_id,acct_id from temp.x where stdate='2019-01-02';

select s1.cust_id,sum(case when s2.cust_id is not null then 1 else 0 end)/count(s1.cust_id) 
from temp.a s1
LEFT JOIN temp.b s2
on s1.cust_id=s2.cust_id and s1.acct_id=s2.acct_id
group by s1.cust_id;

多行收集为数组：collect_set()/collect_list() 去重/不去重

select type,concat_ws(',',collect_list(cast(qty as string))) from temp.x group by type;

a       1
a       2          a  1,2,6  
b       3   ==>>   b  3
c       5          c  5,3
a       6
c       3

hive实现随机前缀二次聚合group by count

select split(s2.year_key,'_')[0] as y,sum(cnt) from -- 去除后缀第二次聚合
(select year_key,count(1) cnt from -- 第一次聚合
(select concat(substring(date_id,1,4),'_',round(rand()*6,0)) as year_key from ka.tb_prod) s1 -- 加随机后缀
group by year_key) s2
GROUP BY split(s2.year_key,'_')[0];