关于hive统计周wau、保留率需求的几种思路

最新推荐文章于 2024-05-19 23:10:15 发布

扫大街的程序员

最新推荐文章于 2024-05-19 23:10:15 发布

阅读量7.1k

点赞数 1

分类专栏： hadoop&hive

本文链接：https://blog.csdn.net/moon_yang_bj/article/details/18049155

版权

hadoop&hive 专栏收录该内容

23 篇文章 2 订阅

订阅专栏

说明：程序有快慢之分，业务却有千变万化之状，不同的需求对症下药，选择合适的解法。

wau：指的是用户一周之内登陆的独立用户id数

wau保留率：指的是后续每周登陆的独立用户id与第一周登陆独立用户id的join，相当于第一周登陆，后续每周也登陆的用户。

需求：

统计从2013七月份开始每周2013-07-01~2013-07-07,2013-07-08~2013-07-14，。。。2013-12-16~2013-12-22,2013-12-23~2013-12-29的wau，
以及相对于第一周7.1-7.7的保留率，累积25周

统计的方法有很多种，在此给出常见的三种解法，以后碰见类似的dau、mau保留率问题都可以借鉴。（写hql就是写MR，以这种思路做统计，总是别有心意）

解法1:稳扎稳打

1、统计每周wau
2、统计每周相对于7.1-7.7的保留率
sql如下：

#第一周wau
hive -e "use acorn_3g;
	select count(distinct uid) 
	from tmp_user_info 
	where log_date>='2013-07-01' and log_date<='2013-07-07' and uid>0 and type='client';"
#第二周wau
hive -e "use acorn_3g;
	select count(distinct uid) 
	from tmp_user_info 
	where log_date>='2013-07-08' and log_date<='2013-07-14' and uid>0 and type='client';" 190s
#第二周保留
hive -e "use acorn_3g;
	select count(distinct t.uid) 
	from (select distinct uid 
		from tmp_user_info 
		where log_date='2013-12-01' and uid>0 and type='client'
		) t 
	join (select distinct uid 
		from tmp_user_info 
		where log_date='2013-12-02' and uid>0 and type='client') m 
	on t.uid=m.uid;" 510s

逻辑清晰，简单明了，如上需要51条sql语句，100多次io迭代，耗时（190+510）*25=17500s

解法2:借助外部力量

1、统计每周的wau用户id

2、借助shell命令，用户id做join

#第一周登陆用户id
select distinct uid from  tmp_user_info where log_date>='2013-07-01' and log_date<='2013-07-07' and uid>0 and type='client'>  0701-0707.txt 180s


#第二周登陆用户id
select distinct uid from  tmp_user_info where log_date>='2013-07-08' and log_date<='2013-07-14' and uid>0 and type='client'>  0708-0714.txt

#周wau
wc -l 0701-07-07.txt

#周保留率，通过comm命令，拿到交集
comm 0701-0707.txt 0708-0714.txt -12| wc -l

借助shell脚本，25条sql语句，25周的数据，通过如上，至少50次的io迭代，耗时180s*25=4500s

解法3:四两拨千斤

1、根据需求，把符合条件的所有用户id生成临时表

2、按周统计wau，wau保留率

#临时表
hive -hiveconf hive.exec.compress.output=true -hiveconf io.seqfile.compression.type=BLOCK -e "use acorn_3g;
create table tmp_xinyan_client_wau as  
select cast ((datediff(log_date,'2013-07-01')/7) as int) as week,uid as user_id 
	from tmp_user_info 
	where log_date>='2013-07-01' and log_date<='2013-12-29' and uid>0 and type='client' 
	group by cast ((datediff(log_date,'2013-07-01')/7) as int),uid;"  420s

#一次性统计25周wau
hive -e "use acorn_3g;select week,count(1) from  tmp_xinyan_client_wau group by week" > wau.txt  48s

#一次性统计25周保留率
hive -e "set hive.mapjoin.smalltable.filesize=250000000;
	use acorn_3g;
	select  /* + MAPJOIN(t7)*/ tu.week,count(1) from 
		(select user_id from tmp_xinyan_client_wau where week=0)  t7  
		join tmp_xinyan_client_wau tu 
		on t7.user_id = tu.user_id group by tu.week;"  360s

如上3条sql语句，3次IO迭代，通过减少io的迭代次数，耗时860s,可以快速完成需求，也为大家偷偷懒，节省时间，记得删除临时表。

扫大街的程序员

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
3
评论
关于hive统计周wau、保留率需求的几种思路

说明：程序有快慢之分，业务却有千变万化之状，不同的需求对症下药，选择合适的解法。wau：指的是用户一周之内登陆的独立用户id数wau保留率：指的是后续每周登陆的独立用户id与第一周登陆独立用户id的join，相当于第一周登陆，后续每周也登陆的用户。需求：统计从2013七月份开始每周2013-07-01~2013-07-07,2013-07-08~2013-07-
复制链接

扫一扫

专栏目录