Hive的分析窗口函数SUM,AVG,MIN,MAX的例子

15 篇文章 1 订阅
13 篇文章 12 订阅

Hive中提供了越来越多的分析函数,用于完成负责的统计分析。抽时间将所有的分析窗口函数理一遍,将陆续发布。

今天先看几个基础的,SUM、AVG、MIN、MAX。

用于实现分组内所有和连续累积的统计。

目录

1、数据准备

2、SUM(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

3、窗户函数定义:

4、求不同贷款金额区间的合同数占比

5、其他AVG,MIN,MAX,和SUM用法一样。

5.1、AVG(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

5.2、MiN(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

5.3、MAX(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

6、SQL复杂场景实现求“实时/最大在线人数”

6.1、背景

6.2、数据

6.3、解法1:(join关联)

6.4、解法2 (sum() over())


1、数据准备

CREATE EXTERNAL TABLE lxw1234 (

cookieid string,

createtime string, --day

pv INT

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

stored as textfile location '/tmp/lxw11/';

 

DESC lxw1234;

cookieid STRING

createtime STRING

pv INT

 

hive> select * from lxw1234;

OK

cookie1 2015-04-10 1

cookie1 2015-04-11 5

cookie1 2015-04-12 7

cookie1 2015-04-13 3

cookie1 2015-04-14 2

cookie1 2015-04-15 4

cookie1 2015-04-16 4

2、SUM(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

SUM — 注意,结果和ORDER BY相关,默认为升序

SELECT cookieid,

createtime,

pv,

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

FROM lxw1234;

 

cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

-----------------------------------------------------------------------------

cookie1 2015-04-10 1 1 1 26 1 6 26

cookie1 2015-04-11 5 6 6 26 6 13 25

cookie1 2015-04-12 7 13 13 26 13 16 20

cookie1 2015-04-13 3 16 16 26 16 18 13

cookie1 2015-04-14 2 18 18 26 17 21 10

cookie1 2015-04-15 4 22 22 26 16 20 8

cookie1 2015-04-16 4 26 26 26 13 13 4

pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
pv2: 同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10

3、窗户函数定义:

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;

关键是理解ROWS BETWEEN含义,也叫做WINDOW子句
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

4、求不同贷款金额区间的合同数占比

sum(contract_cnt)  over()  相当于 sum(contract_cnt) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),相当于把列 contract_cnt中的所有值相加

-- 第一版中间SQL
select amt, count(contract_id) as contract_cnt
from 
(
    select 
    case when principal<=1000000 then "0-1W" 
    when principal>1000000 and principal<=5000000 then "1-5W" 
    when principal>5000000 and principal<=10000000 then "5-10W" 
    when principal>10000000 and principal<=15000000 then "10-15W" 
    when principal>15000000 and principal<=20000000 then "15-20W" 
    end as amt,
    principal,contract_id
    from ods.ods_user_user_contract_d
    where contract_status != 4 
    and product_id in  (2,4,5,7,8,9,16,17,18) 
    and from_unixtime(cast(create_time/1000 as int),'yyyy-MM-dd')<current_date
    and dt=DATE_FORMAT(date_sub(current_date(),1),'yyyyMMdd')  
) a 
group by  amt

--输出结果
amt	contract_cnt
1-5W	5110
0-1W	5414
10-15W	1161
5-10W	3460
15-20W	3194
Time taken: 1.983 seconds, Fetched 5 row(s)

--最终版本SQL

select amt,contract_cnt,sum(contract_cnt)  over() ,contract_cnt/sum(contract_cnt)  over() as contract_ratio
from
(
   select amt, count(contract_id) as contract_cnt
   from 
   (
       select 
       case when principal<=1000000 then "0-1W" 
       when principal>1000000 and principal<=5000000 then "1-5W" 
       when principal>5000000 and principal<=10000000 then "5-10W" 
       when principal>10000000 and principal<=15000000 then "10-15W" 
       when principal>15000000 and principal<=20000000 then "15-20W" 
       end as amt,
       principal,contract_id
       from ods.ods_user_contract_d
       where contract_status != 4 
       and product_id in  (2,4,5,7,8,9,16,17,18) 
       and from_unixtime(cast(create_time/1000 as int),'yyyy-MM-dd')<current_date
       and dt=DATE_FORMAT(date_sub(current_date(),1),'yyyyMMdd')  
   ) a 
   group by  amt
) t

--输出结果
amt	contract_cnt	sum(contract_cnt) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)	contract_ratio
1-5W	5110	18339	0.27864114728174927
0-1W	5414	18339	0.2952178417580021
10-15W	1161	18339	0.06330770489121544
5-10W	3460	18339	0.1886689568678772
15-20W	3194	18339	0.174164349201156
Time taken: 6.033 seconds, Fetched 5 row(s)


5、其他AVG,MIN,MAX,和SUM用法一样。

5.1、AVG(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

SELECT cookieid,

createtime,

pv,

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

FROM lxw1234;

cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

-----------------------------------------------------------------------------

cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144

cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667

cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0

cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25

cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335

cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0

cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0

5.2、MiN(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

SELECT cookieid,

createtime,

pv,

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

FROM lxw1234;

 

cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

-----------------------------------------------------------------------------

cookie1 2015-04-10 1 1 1 1 1 1 1

cookie1 2015-04-11 5 1 1 1 1 1 2

cookie1 2015-04-12 7 1 1 1 1 1 2

cookie1 2015-04-13 3 1 1 1 1 1 2

cookie1 2015-04-14 2 1 1 1 2 2 2

cookie1 2015-04-15 4 1 1 1 2 2 4

cookie1 2015-04-16 4 1 1 1 2 2 4

5.3、MAX(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例

----MAX

SELECT cookieid,

createtime,

pv,

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

FROM lxw1234;

 

cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

-----------------------------------------------------------------------------

cookie1 2015-04-10 1 1 1 7 1 5 7

cookie1 2015-04-11 5 5 5 7 5 7 7

cookie1 2015-04-12 7 7 7 7 7 7 7

cookie1 2015-04-13 3 7 7 7 7 7 4

cookie1 2015-04-14 2 7 7 7 7 7 4

cookie1 2015-04-15 4 7 7 7 7 7 4

cookie1 2015-04-16 4 7 7 7 4 4 4

6、SQL复杂场景实现求“实时/最大在线人数”

6.1、背景

能不能用SQL求某一天哪个时刻进行中的订单量最多,这个数是多少?我寻思挺有意思,就细想了一下。其实思考下可以发现,如果要求一段时间内的同时处于服务过程中的最大订单量,相当于也要知道每个时刻的数量,所以求最大和求实时是等同的。

这个问题在不同的业务场景下有不同的意义,比如一个游戏的同时在线人数,比如一个服务器的实时并发数,比如一个仓库的货物积压数量等等。

一般而言对于需要知道并发数的业务场景都会通过实时监控来完成,而不会通过sql进行离线计算。但本着深挖sql潜能的想法,如下提供一个不切实际的解法与一个真实可行的方法。

6.2、数据

假设我们的业务场景是回溯某一天的某游戏的最大同时在线人数,并有表connection_detail记录用户每一次上线和下线时间:

user_id    login_time    logout_time    dt
213142    2019-01-01 12:21:22    2019-01-01 13:45:21    20190101
412321    2019-01-01 13:35:11    2019-01-01 16:01:49    20190101
…    …    …    …

6.3、解法1:(join关联)

假设我们实时有一个变量online_cnt记录着当前在线人数,那么什么情况下这个数会发生变化呢?聪明的你一定想到是当有用户登录或者有用户登出的时刻。说到这思路应该就清晰了,我们需要去求每一个用户登入或登出时刻的在线人数。从另一个角度,我们需要知道每当一个用户登录或登出的时刻,用多少人还在线上。
 


select 
    max(online_cnt) as max_online_cnt
from
(
    select 
        b.`timestamp`,
        count(1) as online_cnt --该时刻的在线人数
    from 
    (
        select
            `timestamp`
        from 
        (
            select
                unix_timestamp(login_time) as `timestamp`, --将时间转换为unix时间戳
            from
                connection_detail
            where
                dt = '20190101' --限定某一天
            union all
            select
                unix_timestamp(logout_time) as `timestamp`,
            from
                connection_detail
            where
                dt = '20190101'
        )a --取出每个并发数变化的时刻
        group by
            `timestamp` 
    )b --去重,可以思考下不去重会有什么影响
    left join 
    (
        select
            unix_timestamp(login_time) as login_timestamp,
            unix_timestamp(login_time) as logout_timestamp, --将时间转换为unix时间戳
        from
            connection_detail
        where
            dt = '20190101'
    )c
    --此处的关联为笛卡尔积
    where
        b.`timestamp`>=c.login_timestamp
        and b.`timestamp`<=c.logout_timestamp
    group by
        b.`timestamp`
)d

不愿吐槽这个方法,一是逻辑上比较绕,二是在真实的业务数据下根本跑不出来,假设原表中有n行数据,则在计算过程中数据将一度膨胀到O(n^2)。

6.4、解法2 (sum() over())

先抛开sql,来考虑实时计算中我们怎么处理该问题。是不是我们会实时记录着一个变量online_cnt,当一个用户登录时加1,用户下线时减1?

再假如我让你用python离线处理这个问题你会怎么做。应该先把登录时间和登出时间这两列打散成行,然后进行排序,最后一行行读入,当读到一个登录时间时online_cnt加1,读到登出时间时online_cnt减1。

回到sql,我们来尝试用SQL实现上述逻辑。我们给登录时间加一个数值标记1,登出时间加标记-1。然后对排序后的数据求和该字段,最终得到我们的结果。

select
    max(max_index)
from 
(
    select
        sum(index) over(order by `timestamp`) as max_index --排序后第一行到本行的和
    from
    (
        select
            order_id,
            unix_timestamp(login_time) as `timestamp`,
            1 as index
        from
            connection_detail
        where
            dt = '20190101'
            and is_td_finish = 1
        union all
        select
            order_id,
            unix_timestamp(logout_time) as `timestamp`,
            -1 as index
        from
            connection_detail
        where
            dt = '20190101'
    )a  --将登录时间和登出时间多列成多行
)b

可能阻碍大家想到这一逻辑的点在于sum() over()这一用法,该窗口函数对每一行数据实现了计算第一行到该行的求和计算。

该代码对于千万量级的数据sparksql计算了65秒,属于一个可以接受的范围。

理解了上述代码的同学可以发现过程中我们一度得到了每个时刻的在线人数(子查询b)。对这一数据进行可视化可以直观了解服务器的负载变化情况。

参考:http://lxw1234.com/archives/2015/04/176.htm
           https://blog.csdn.net/Adrian_Wang/article/details/89840671

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值