HiveSQL：求累计访问量

最新推荐文章于 2022-12-31 13:06:19 发布

小手追梦

最新推荐文章于 2022-12-31 13:06:19 发布

阅读量1.1k

点赞数

分类专栏： hive 文章标签： dba 数据库 database

本文链接：https://blog.csdn.net/epitomizelu/article/details/122223126

版权

hive 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

数据

userId	visitDate	visitCount
u01	2017/1/21	5
u02	2017/1/23	6
u03	2017/1/22	8
u04	2017/1/20	3
u01	2017/1/23	6
u01	2017/2/21	8
U02	2017/1/23	6
U01	2017/2/22	4

需求一：逐行求相同用户的累计访问次数

在这里插入图片描述
有两种思路：

1，开窗

关键在于开窗后 rows between ... and的使用

rows between unbounded preceding and current row表示从第一行到当前行

select userId,
visitDate,
sum(visitCount) over(partition by userId order by visitDate asc rows between unbounded preceding and current row )
 from visit;

2，join

这种解法比较巧妙，类似于求成绩Top3的思路。

select 
  tmp1.userId,
  tmp2.visitDate,
  sum(tmp1.visitCount) as visitCount
from visit tmp1
left join  visit tmp2
on tmp1.userId = tmp2.userId
where tmp1.visitDate <= tmp2.visitDate
group by userId,tmp2.visitDate

下面以userId=u01的数据为例，分步骤详细结合代码和图示详细解释

ID	日期	访问次数
u01	2017/1/21	5
u01	2017/1/23	6
u01	2017/2/21	8

join的效果

先不group by，且看看join的效果是咋样的。
注意join的条件和后面的where条件，特别是要结合后面结果表，体会下where条件的巧妙

select 
 *
from visit tmp1
left join  visit tmp2
on tmp1.userId = tmp2.userId
where tmp1.visitDate <= tmp2.visitDate

JOIN+where过滤后结果如下：

tmp1.ID	tmp1.日期	tmp1. 访问次数	tmp2.ID	tmp2.日期	tmp2.访问次数
u01	2017/1/21	5	u01	2017/1/21	5
u01	2017/1/21	5	u01	2017/1/23	6
u01	2017/1/23	6	u01	2017/1/23	6
u01	2017/1/21	5	u01	2017/2/21	8
u01	2017/1/23	6	u01	2017/2/21	8
u01	2017/2/21	8	u01	2017/2/21	8

大家仔细观察，如果以tmp2.visitDate分组、对tmp1.visitCount求和会怎样？

结论是：恰好计算出从第一天到当前天的访问次数的总和。

再次强调，理解where条件非常关键。

需求二：按月求相同用户的累计访问次数

其实这个需求和上个需求思路差不多，不过是统计口径有差别，上面是安天，这里是按月，所以要先将日期转换为月份，然后按月group by统计出每个月的总访问量，之后就和上面的需求的思路一致了。

1，日期转换为月，并按月汇总

日期转换为月使用了from_unixtime和unix_timestamp两个函数。

select
	userId,
	from_unixtime(unix_timestamp(visitDate, 'yyyy/MM/dd'), 'yyyy-MM') visitMonth,
	sum(visitCount) vc
from
	visit
group by
	userId,
	from_unixtime(unix_timestamp(visitDate, 'yyyy/MM/dd'), 'yyyy-MM')

2，接下来的处理如第一个需求有两种思路

开窗

with tmp as (
select
	userId,
	from_unixtime(unix_timestamp(visitDate, 'yyyy/MM/dd'), 'yyyy-MM') visitMonth,
	sum(visitCount) vc
from
	visit
group by
	userId,
	from_unixtime(unix_timestamp(visitDate, 'yyyy/MM/dd'), 'yyyy-MM'))
select
	tmp1.userId,
	tmp2.visitMonth,
	sum(tmp1.vc)
from
	tmp tmp1
left join tmp tmp2 on
	tmp1.userId = tmp2.userId
where
	tmp1.visitMonth <= tmp2.visitMonth
group by
	tmp1.userId,
	tmp2.visitMonth;

JOIN

select
	userId,
	visitMonth,
	sum(vc) over(partition by userId
order by
	userId,
	visitMonth rows between unbounded preceding and current row)
from
	(
	select
		userId,
		from_unixtime(unix_timestamp(visitDate, 'yyyy/MM/dd'), 'yyyy-MM') visitMonth,
		sum(visitCount) vc
	from
		visit
	group by
		userId,
		from_unixtime(unix_timestamp(visitDate, 'yyyy/MM/dd'), 'yyyy-MM') )tmp;

开窗和JOIN区别

二者都能得出正确的结果，在计算引擎为MapReduce的情况下，开窗比JOIN要快，从执行计划可以看出缘由，JOIN有多次Shuffle，开窗只有一次。

小手追梦

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
HiveSQL：求累计访问量

数据userId visitDate visitCountu01 2017/1/21 5u02 2017/1/23 6u03 2017/1/22 8u04 2017/1/20 3u01 2017/1/23 6u01 2017/2/21 8U02 2017/1/23 6U01 2017/2/22 4需求一：逐行求相同用户的累计访问次数
复制链接

扫一扫