Hive的row_number() over(partition by 字段) 字段过长导致统计误差

有如下统计的hql:

select *,
	row_number() over(partition by stay_time_concat order by time asc) as stay_time_rn
from (
	select *,
		row_number() over(partition by event_id) as event_id_rn,
		row_number() over(partition by concat(j_client_id,buy_adslot_id,event_type) order by time asc) as j_client_id_rn,
		row_number() over(partition by concat(j_visitor_id,page_id,event_type) order by event_type) as j_visitor_id_rn,
		row_number() over(partition by concat(j_redirect_id,page_id,event_type) order by event_type) as j_redirect_id_rn,
		case 
			when cast(stay_time AS int)>=0 and cast(stay_time AS int) <= 4999 and event_type in ('stay_time', 'page_view') 
				then concat(j_client_id,buy_adslot_id,'event_type','0s_day') 
			when cast(stay_time AS int)>=5000 and event_type in ('stay_time', 'page_view') and cast(stay_time AS int) <= 9999 
				then concat(j_client_id,buy_adslot_id,'event_type','5s_day') 
			when cast(stay_time AS int)>=10000 and event_type in ('stay_time', 'page_view') 
				then concat(j_client_id,buy_adslot_id,'event_type','10s_day') 
			else null 
		end as stay_time_concat
	from tmp.stay_time
	where data_date>=${bizStartDate} and data_date<=${bizEndDate}
		and from_unixtime(floor(time/1000),'yyyyMMdd')=${etlDate}
)t1 where event_id_rn=1 and page_id is not null

统计时候,由于使用了concat函数,字段长度过长,row_number() over(partition by 字段)会导致一些细微统计误差。
使用concat函数后,stay_time_concat的字符串长度过长,导致统计有误差。

18005e6f2b67f9ff24a2e724222fd8b17885565386692209event_type0s_day

改造后的统计代码如下:

select *,
	row_number() over(partition by j_client_id,buy_adslot_id,stay_time_concat order by time asc) as stay_time_rn
from (
	select *,
		row_number() over(partition by event_id) as event_id_rn,
		row_number() over(partition by j_client_id,buy_adslot_id,event_type order by time asc) as j_client_id_rn,
		row_number() over(partition by j_visitor_id,page_id,event_type order by event_type) as j_visitor_id_rn,
		row_number() over(partition by j_redirect_id,page_id,event_type order by event_type) as j_redirect_id_rn,
		case 
			when cast(stay_time AS int)>=0 and cast(stay_time AS int) <= 4999 and event_type in ('stay_time', 'page_view') then '0s_day'
			when cast(stay_time AS int)>=5000 and event_type in ('stay_time', 'page_view') and cast(stay_time AS int) <= 9999 then '5s_day'
			when cast(stay_time AS int)>=10000 and event_type in ('stay_time', 'page_view') then '10s_day'
			else null 
		end as stay_time_concat
	from tmp.stay_time
	where data_date>=${bizStartDate} and data_date<=${bizEndDate}
		and from_unixtime(floor(time/1000),'yyyyMMdd')=${etlDate}
)t1 where event_id_rn=1 and page_id is not null

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值