Hive3窗口函数

概念

我们都知道在sql中有一类函数叫做聚合函数例如sum()、avg()、max()等等这类函数可以将多行数据按照规则聚集为一行一般来讲聚集后的行数是要少于聚集前的行数的但是有时我们想要既显示聚集前的数据又要显示聚集后的数据这时我们便引入了窗口函数

在深入研究Over字句之前,一定要注意:在SQL处理中,窗口函数都是最后一步执行,而且仅位于Order by字句之前。

--准备数据

jack,2015-01-01,10

tony,2015-01-02,15

jack,2015-02-03,23

tony,2015-01-04,29

jack,2015-01-05,46

jack,2015-04-06,42

tony,2015-01-07,50

jack,2015-01-08,55

mart,2015-04-08,62

mart,2015-04-09,68

neil,2015-05-10,12

mart,2015-04-11,75

neil,2015-06-12,80

mart,2015-04-13,94

--创建表

use kb23hivedb;
create table t_window(
    name string,
    orderdate date,
    cost int
)
row format delimited fields terminated by ',';

加载本地数据
load data local inpath '/opt/kb23/oider.txt' into table t_window;

查询所有
select * from t_window;

补充:聚合函数

Hive聚合运算-group by

Hive聚合运算-having

Hive聚合运算-基础聚合函数(max, min, count, sum, avg,Collect_set,collect_list

1group by 用于分组

1、Hive基本内置聚合函数与group by一起使用

2、如果没有指定group by子句,则默认聚合整个表

3、group by支持使用case when或表达式

(2)having:对group by 聚合结果的条件过滤

1、可以避免在group by 之后使用子查询

2、Having之后可以使用表达式,但不建议使用

(3)基础聚合函数

  1. max、min、count、sum、avg
  2. Collect_set、collect_list

注:

一般与group by 一起使用,可应用于列或表达式,对null 的count聚合为0(即select count(null)=0)。

--查询2015-04的数据的名字name和总数count
select name,count(*) over() from t_window where substring(orderdate,1,7)='2015-04';

--去重 distinct
select distinct name,count(*) over() from t_window where substring(orderdate,1,7)='2015-04';


--分组 group by
select name,count(*) over() from t_window where substring(orderdate,1,7)='2015-04' group by name;

窗口函数

概述:

窗口函数是一组特殊函数。扫描多个输入行来计算每个输出值,为每行数据生成一行结果,可以通过窗户函数来实现复杂的计算和聚合。

语法:

Function(arg1。。。argn) over(【partion by()】【order by()】【window_clause】)

①Partition by 类似于group by ,未指定则按整个结果集

②只有指定order by 子句之后才能进行窗口定义

③可同时使用多个窗口函数

④过滤窗口函数计算结果必须在外面一层

按功能可划分为:

排序、聚合、分析

聚合类

--窗口函数+求每月总分 olap分析
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from t_window;

--1、分区,不排序,计算(sum、min、max、avg、count)

select name,orderdate,cost,
       sum(cost) over(partition by month(orderdate)) sumcost,
       min(cost) over(partition by month(orderdate)) mincost,
       max(cost) over(partition by month(orderdate)) maxcost,
       avg(cost) over(partition by month(orderdate)) avgcost,
       count(cost) over(partition by month(orderdate)) countcost
from t_window;

--2、分区,排序,计算(sum、min、max、avg、count)

select name,orderdate,cost,
       sum(cost) over(partition by month(orderdate) order by orderdate) sumcost,
       min(cost) over(partition by month(orderdate) order by orderdate) mincost,
       max(cost) over(partition by month(orderdate) order by orderdate) maxcost,
       avg(cost) over(partition by month(orderdate) order by orderdate) avgcost,
       count(cost) over(partition by month(orderdate) order by orderdate) countcost
from t_window;

排序类

--排序类
--row_number、rank、dense_rank、percent_rank、ntile
select name,orderdate,cost,
       --row_number()对所有数值输出不同的序号,序号唯一连续
       --排名【123456...】
    row_number() over () as rn,
       --分组,后排名【123456...】
    row_number() over (order by name) as rn2,
       --rank() 对相同数值,输出相同的序号,下一个序号跳过
       --占位符【1111166699】
    rank() over (order by name) as rn3,
       --dense_rank() 对相同数值,输入相同的序号,下一个序号连续
       --占位符,允许有并列存在【1111122233】
    dense_rank() over (order by name) as rn4,
       --percent_rank() (目前排名-1)/(总行数-1),值相对于一组值的百分比排名
       --分区、排序、求占比【归一化:最低0,最高1,其他平分】
    percent_rank() over (partition by name order by cost) as rn5
from t_window;


--ntile(n) 切片
--将有序的数据集合平均分配到n个桶中,将桶号分配给每一行,根据桶号,选取前或后n分之几的数据
select name,orderdate,cost,
       --分成两份
    ntile(2) over() as nt1,
       --分成三份
    ntile(3) over() as nt2,
       --先分区,在分区内部分成三分
    ntile(3) over(partition by name) as nt3,
       --先排序,再分成三份
    ntile(3) over(order by cost) as nt4,
       --分区、排序、切三份
    ntile(3) over(partition by name order by cost) as nt5
from t_window;

分析类

--分析类
-- cume_dist\lag\lead\first_value\last_value
--lag和lead
select name,orderdate,cost,
       --查看前一次或上一次时间和cost,没有值默认为null
    lag(orderdate,1) over (partition by name order by orderdate)as predate,
    lag(cost,1) over(partition by name order by orderdate) as precost,
       --定义空值的默认值为1900-01-01
    lag(orderdate,1,'1900-01-01') over (partition by name order by orderdate)as predate_DY,
    lag(cost,1,'hello') over(partition by name order by orderdate) as precost_DY,
       --前前次/上上次的日期和cost
    lag(orderdate,2) over(partition by name order by orderdate) as pre2_date,
    lag(cost,2) over(partition by name order by orderdate) as pre2_cost,
       ---后一次的日期和cost
    lead(orderdate,1) over(partition by name order by orderdate) as nextdate,
    lead(cost,1) over(partition by name order by orderdate) as nextcost
from t_window;


--first_value、last_value
select name,orderdate,cost,
       --分区、排序、第一个
    first_value(orderdate) over (partition by name order by orderdate) as time1,
       --分区、排序、自己
    last_value(orderdate) over (partition by name order by orderdate) as time1
from t_window;

        窗口函数--窗口定义

(1)窗口定义由【<window_clause>】子句描述

用于进一步细分结果并应用分析函数

(2)支持两类窗口定义

行类型窗口:根据当前行或之后的行号确定的窗口

范围类型窗口:取分组内的值在指定范围区间内的行

(3)Pank、ntile、dense_rank、cume_dist、percent_rank、lead、lag和ronumber函数不支持与窗口子句一起使用

        行类型窗口

--rows between XX and XX
select name,orderdate,cost,
       --全局所有数据参加运算
       sum(cost) over() as sum1,
       --按名字分组,求每个名字分别的花销
       sum(cost) over(partition by name) as sum2_1,
       sum(cost) over(partition by name order by orderdate rows between unbounded preceding and unbounded following) as sum2_2,
       --按名字分组,再按月份排序,求第一个月份至当前月份的消费总和
       sum(cost) over(partition by name order by orderdate) as sum3,
       --效果同sum3
       --unbounded起点 unbounded preceding最前面一行为起点   current row当前行为终点 current following当前行后为终点
       sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row ) as sum4,
       --计算前一行和当前行的开销总和即求最近两个月开销
       --1 preceding 表示前一行
       sum(cost) over(partition by name order by orderdate rows between 1 preceding and current row) as sum5,
       --求当前前一行,当前行,当前后一行的三个的开销总和
       sum(cost) over(partition by name order by orderdate rows between 1 preceding and 1 following) as sum6,
       --求当前行至最后一行的总和
       sum(cost) over(partition by name order by orderdate rows between current row and unbounded following) as sum7
from t_window;

        范围类型窗口

注:该范围值/区间必须是数字或日期类型,目前只支持一个order by列
--range between XX and XX

select name,orderdate,cost,
    max(cost) over(order by name range between 50 preceding and 100 following) as maxValue
from t_window;

面试题:

排序时如何把NULL值排到最后面?(降序)

with
     tb as (
        select name,orderdate,cost,
        lag(orderdate,1) over (partition by name order by orderdate)as predate
        from t_window
     )
select * from tb order by predate desc;

拓展:排序时如何把NULL值排到最前面?(升序)

with
     tb as (
        select name,orderdate,cost,
        lag(orderdate,1) over (partition by name order by orderdate)as predate
        from t_window
     )
select * from tb order by predate asc;

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值