hive窗口函数

最新推荐文章于 2024-06-04 15:38:45 发布

键盘 | 书生

最新推荐文章于 2024-06-04 15:38:45 发布

阅读量72

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/weixin_43976998/article/details/118067544

版权

hive 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

                    
                        
                    
                    1.Windowing functions 
  lag(col,n,‘x’)：往前第n行数据,如果前面没有用’x’代替，The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
 Returns null when the lag for the current row extends before the beginning of the window.
lead(col,n,‘x’)：往后第n行数据，The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.
 Returns null when the lead for the current row extends beyond the end of the window.
first_value：This takes at most two parameters. The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.
last_value：详见数仓-ADS-会话
 例：
1.查询2019年5月购买过商品的客人和总人数
	select name,count(1) over() // count的窗口为整个表
	from business where substring(orderdate1,7)="2019-05" group by name;
2.查询顾客的购买明细及月购买总额
	select name,orderdate,cost, sum(cost) over(distribute by name,month(orderdate)) // sum窗口为分区月份
	from business;
3.上述的场景,月购买总额按照日期进行累加
	// 对姓名和月份分组后按时间排序，然后对每个分组内数据累加
	 select *, sum(cost) over(partition by name,month(orderdate) order by orderdate) from business;
4.查询顾客上次的购买时间(类似电商网站分析跳转率，上一次和下一次访问页面)
	select name,orderdate,cost,lag(orderdate,1) over(distribute by name sort by orderdate) from business;
5.查询前20%时间的订单信息
	select * from(
	select name,orderdate,cost,ntile(5) over(sort by orderdate) gid // 排序并均分为5组
	from business
	) t where t.gid=1; // 取第一组
 
2.The OVER clause 
  可以使用over的聚合函数 
    COUNT SUM MIN MAX AVG
 
over(分区，排序，)： 
    分区partitioning 
      distribute|partition by month(date)：按月份分区，窗口为分区月份
添加排序后，函数窗口为分区第一行到当前
 
排序order by
窗口window 
      指定分析函数工作的数据窗口大小，函数窗口可以随着行的变而变化
rows between x and y 
        current row：当前行
n preceding：往前n行数据
n following：往后n行数据
unbounded preceding表示从分区第一行开始
unbounded following表示到分区最后一行
 
 
 
 
3.Analytics functions 
  RANK 
    相同的排序是一样的，但是下一个小的会跳着排序
 
ROW_NUMBER 
    顺序排序
 
DENSE_RANK 
    相同的排序相同，下一个小的会紧挨着排序
 
CUME_DIST 
    小于等于当前值的行数/分组内总行数
 
PERCENT_RANK 
    分组内当前行的RANK值-1/分组内总行数-1
 
NTILE 
    用于将分组数据按照顺序切分成n片，返回当前记录所在的切片值
NTILE不支持ROWS BETWEEN
如果切片不均匀，默认增加第一个切片的分布
经常用来取前30% 带有百分之多少比例的记录什么的