Hive中提供了越来越多的分析函数,用于完成负责的统计分析。抽时间将所有的分析窗口函数理一遍,将陆续发布。
今天先看几个基础的,SUM、AVG、MIN、MAX。
用于实现分组内所有和连续累积的统计。
目录
2、SUM(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
5.1、AVG(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
5.2、MiN(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
5.3、MAX(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
1、数据准备
CREATE EXTERNAL TABLE lxw1234 (
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/tmp/lxw11/';
DESC lxw1234;
cookieid STRING
createtime STRING
pv INT
hive> select * from lxw1234;
OK
cookie1 2015-04-10 1
cookie1 2015-04-11 5
cookie1 2015-04-12 7
cookie1 2015-04-13 3
cookie1 2015-04-14 2
cookie1 2015-04-15 4
cookie1 2015-04-16 4
2、SUM(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
SUM — 注意,结果和ORDER BY相关,默认为升序
SELECT cookieid,
createtime,
pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1 1 26 1 6 26
cookie1 2015-04-11 5 6 6 26 6 13 25
cookie1 2015-04-12 7 13 13 26 13 16 20
cookie1 2015-04-13 3 16 16 26 16 18 13
cookie1 2015-04-14 2 18 18 26 17 21 10
cookie1 2015-04-15 4 22 22 26 16 20 8
cookie1 2015-04-16 4 26 26 26 13 13 4
pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
pv2: 同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10
3、窗户函数定义:
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
4、求不同贷款金额区间的合同数占比
sum(contract_cnt) over() 相当于 sum(contract_cnt) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),相当于把列 contract_cnt中的所有值相加
-- 第一版中间SQL
select amt, count(contract_id) as contract_cnt
from
(
select
case when principal<=1000000 then "0-1W"
when principal>1000000 and principal<=5000000 then "1-5W"
when principal>5000000 and principal<=10000000 then "5-10W"
when principal>10000000 and principal<=15000000 then "10-15W"
when principal>15000000 and principal<=20000000 then "15-20W"
end as amt,
principal,contract_id
from ods.ods_user_user_contract_d
where contract_status != 4
and product_id in (2,4,5,7,8,9,16,17,18)
and from_unixtime(cast(create_time/1000 as int),'yyyy-MM-dd')<current_date
and dt=DATE_FORMAT(date_sub(current_date(),1),'yyyyMMdd')
) a
group by amt
--输出结果
amt contract_cnt
1-5W 5110
0-1W 5414
10-15W 1161
5-10W 3460
15-20W 3194
Time taken: 1.983 seconds, Fetched 5 row(s)
--最终版本SQL
select amt,contract_cnt,sum(contract_cnt) over() ,contract_cnt/sum(contract_cnt) over() as contract_ratio
from
(
select amt, count(contract_id) as contract_cnt
from
(
select
case when principal<=1000000 then "0-1W"
when principal>1000000 and principal<=5000000 then "1-5W"
when principal>5000000 and principal<=10000000 then "5-10W"
when principal>10000000 and principal<=15000000 then "10-15W"
when principal>15000000 and principal<=20000000 then "15-20W"
end as amt,
principal,contract_id
from ods.ods_user_contract_d
where contract_status != 4
and product_id in (2,4,5,7,8,9,16,17,18)
and from_unixtime(cast(create_time/1000 as int),'yyyy-MM-dd')<current_date
and dt=DATE_FORMAT(date_sub(current_date(),1),'yyyyMMdd')
) a
group by amt
) t
--输出结果
amt contract_cnt sum(contract_cnt) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) contract_ratio
1-5W 5110 18339 0.27864114728174927
0-1W 5414 18339 0.2952178417580021
10-15W 1161 18339 0.06330770489121544
5-10W 3460 18339 0.1886689568678772
15-20W 3194 18339 0.174164349201156
Time taken: 6.033 seconds, Fetched 5 row(s)
5、其他AVG,MIN,MAX,和SUM用法一样。
5.1、AVG(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
SELECT cookieid,
createtime,
pv,
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144
cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667
cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0
cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25
cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335
cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0
cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0
5.2、MiN(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
SELECT cookieid,
createtime,
pv,
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1 1 1 1 1 1
cookie1 2015-04-11 5 1 1 1 1 1 2
cookie1 2015-04-12 7 1 1 1 1 1 2
cookie1 2015-04-13 3 1 1 1 1 1 2
cookie1 2015-04-14 2 1 1 1 2 2 2
cookie1 2015-04-15 4 1 1 1 2 2 4
cookie1 2015-04-16 4 1 1 1 2 2 4
5.3、MAX(xxx) OVER(PARTITION BY yyy ORDER BY zzz) 示例
----MAX
SELECT cookieid,
createtime,
pv,
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1 1 7 1 5 7
cookie1 2015-04-11 5 5 5 7 5 7 7
cookie1 2015-04-12 7 7 7 7 7 7 7
cookie1 2015-04-13 3 7 7 7 7 7 4
cookie1 2015-04-14 2 7 7 7 7 7 4
cookie1 2015-04-15 4 7 7 7 7 7 4
cookie1 2015-04-16 4 7 7 7 4 4 4
6、SQL复杂场景实现求“实时/最大在线人数”
6.1、背景
能不能用SQL求某一天哪个时刻进行中的订单量最多,这个数是多少?我寻思挺有意思,就细想了一下。其实思考下可以发现,如果要求一段时间内的同时处于服务过程中的最大订单量,相当于也要知道每个时刻的数量,所以求最大和求实时是等同的。
这个问题在不同的业务场景下有不同的意义,比如一个游戏的同时在线人数,比如一个服务器的实时并发数,比如一个仓库的货物积压数量等等。
一般而言对于需要知道并发数的业务场景都会通过实时监控来完成,而不会通过sql进行离线计算。但本着深挖sql潜能的想法,如下提供一个不切实际的解法与一个真实可行的方法。
6.2、数据
假设我们的业务场景是回溯某一天的某游戏的最大同时在线人数,并有表connection_detail记录用户每一次上线和下线时间:
user_id login_time logout_time dt
213142 2019-01-01 12:21:22 2019-01-01 13:45:21 20190101
412321 2019-01-01 13:35:11 2019-01-01 16:01:49 20190101
… … … …
6.3、解法1:(join关联)
假设我们实时有一个变量online_cnt记录着当前在线人数,那么什么情况下这个数会发生变化呢?聪明的你一定想到是当有用户登录或者有用户登出的时刻。说到这思路应该就清晰了,我们需要去求每一个用户登入或登出时刻的在线人数。从另一个角度,我们需要知道每当一个用户登录或登出的时刻,用多少人还在线上。
select
max(online_cnt) as max_online_cnt
from
(
select
b.`timestamp`,
count(1) as online_cnt --该时刻的在线人数
from
(
select
`timestamp`
from
(
select
unix_timestamp(login_time) as `timestamp`, --将时间转换为unix时间戳
from
connection_detail
where
dt = '20190101' --限定某一天
union all
select
unix_timestamp(logout_time) as `timestamp`,
from
connection_detail
where
dt = '20190101'
)a --取出每个并发数变化的时刻
group by
`timestamp`
)b --去重,可以思考下不去重会有什么影响
left join
(
select
unix_timestamp(login_time) as login_timestamp,
unix_timestamp(login_time) as logout_timestamp, --将时间转换为unix时间戳
from
connection_detail
where
dt = '20190101'
)c
--此处的关联为笛卡尔积
where
b.`timestamp`>=c.login_timestamp
and b.`timestamp`<=c.logout_timestamp
group by
b.`timestamp`
)d
不愿吐槽这个方法,一是逻辑上比较绕,二是在真实的业务数据下根本跑不出来,假设原表中有n行数据,则在计算过程中数据将一度膨胀到O(n^2)。
6.4、解法2 (sum() over())
先抛开sql,来考虑实时计算中我们怎么处理该问题。是不是我们会实时记录着一个变量online_cnt,当一个用户登录时加1,用户下线时减1?
再假如我让你用python离线处理这个问题你会怎么做。应该先把登录时间和登出时间这两列打散成行,然后进行排序,最后一行行读入,当读到一个登录时间时online_cnt加1,读到登出时间时online_cnt减1。
回到sql,我们来尝试用SQL实现上述逻辑。我们给登录时间加一个数值标记1,登出时间加标记-1。然后对排序后的数据求和该字段,最终得到我们的结果。
select
max(max_index)
from
(
select
sum(index) over(order by `timestamp`) as max_index --排序后第一行到本行的和
from
(
select
order_id,
unix_timestamp(login_time) as `timestamp`,
1 as index
from
connection_detail
where
dt = '20190101'
and is_td_finish = 1
union all
select
order_id,
unix_timestamp(logout_time) as `timestamp`,
-1 as index
from
connection_detail
where
dt = '20190101'
)a --将登录时间和登出时间多列成多行
)b
可能阻碍大家想到这一逻辑的点在于sum() over()这一用法,该窗口函数对每一行数据实现了计算第一行到该行的求和计算。
该代码对于千万量级的数据sparksql计算了65秒,属于一个可以接受的范围。
理解了上述代码的同学可以发现过程中我们一度得到了每个时刻的在线人数(子查询b)。对这一数据进行可视化可以直观了解服务器的负载变化情况。
参考:http://lxw1234.com/archives/2015/04/176.htm
https://blog.csdn.net/Adrian_Wang/article/details/89840671