Hive multi-distinct可能带来的性能恶化

最新推荐文章于 2024-04-18 09:22:50 发布

forever_ai

最新推荐文章于 2024-04-18 09:22:50 发布

阅读量1.6k

点赞数

分类专栏： hive

hive 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

转载： http://wolfskin.blog.163.com/blog/static/2081731282013812104016406/

目前的HIve版本已经支持mult-distinct的特性，这个在使用的使用的会比较方便，平时经常同时统计PV，UV，VV之类，不过一般都只统计一两天的数据，虽然每天的数据都是上亿条的但也特别感觉有啥不妥。

不过最近接到一个BT的需求，现有的系统中的归并数据都无法满足，只好写HiveQL从最基础的日志表中归并，可惜杯具的是一次要归并一个月的数据，并且会使用到multi-distinct特性，此时multi-distinct带来的性能恶化终于显现出来，恶心到我了，当reduce到100%的时候，就基本再也不动了，一次维持100%几个小时，查看JOB的状态发现就卡在最后一个reduce上，查了些资料后基本确定为，由于用了multi-reduce造成最后的计算count(distinct())过程都放在了一个reduce中，相当于最后变成了单机版处理一个月如此大的数据，不慢才奇怪了。

经过一定研究后发现后还是可以通过空间换时间的方法来解决这个问题，也就是把所有的count(distinct())都转换成单一的计算sum(1) 的过程，从而绕过multi-distinct。经过修改之后，HiveQL在性能上整整提高了2到3倍，也就是说时间只需要原来的一半不到甚至更快，具体没有几时，反正感觉不是一般的快。

下面就把修改的基本思想用简单的例子演示一下：

假设有表sample_tb，其中包含字段pvid, cookie_id，而需求则是要统计PV，UV，VV，最简单的写法就是

select sum(1) as pv
    , count(distinct pvid) as vv
    , count(distinct cookie_id) as uv
from sample_tb;

但是这样就涉及到multi-distinct，当数据特大的时候就可能带来性能的恶化，解决方案就是去掉multi-distinct，最后全部用sum(1)来达到目的

第一步对pvid, cookie_id进行去重

create table sample_step_1
as
select pvid, cookie_id
    , sum(1) as pv
from sample_tb
group by pvid, cookie_id;

第二步以空间换时间，扩充原数据，借用union all把需要做distinct的字段给扩充起来，并使用rownumber=1来达到去重的目的，如果不计算PV的话则可以直接用group by，而绕开rownumber=1的去重目的。

create          



     as ">select type, type_value, pv, rownumber(type,type_value) as rn ">from ( select type, type_value, pv from ( select 'pvid'as type, pvid as type_value, pv from  sample_step_1 union all select 'cookie_id'as type, cookie_id as type_value, pv from  sample_step_1 ) a distribute by type, type_value sort by  type, type_value 0)">) a;

第三步就是使用sum()来替换count(distinct())，而计算各种值了

select
   sum(case when type='pvid' then pv else cast(0 as bigint) end) as pv,
   sum(case when type='pvid' and rn=1 then 1 else 0 end) as vv,
   sum(case when type='cookie_id' and rn=1 then 1 else 0 end) as uv
from sample_step_2;