hive语句优化-通过groupby实现distinct（数据量特别大的时候，使用distinct去重容易导致数据倾斜）

最新推荐文章于 2022-06-20 10:06:54 发布

圣☞摧枯拉朽

最新推荐文章于 2022-06-20 10:06:54 发布

阅读量2.2k

点赞数 1

hive语句优化-通过groupby实现distinct

同事写了个hive的sql语句，执行效率特别慢，跑了一个多小时程序只是map完了，reduce进行到20%。
该Hive语句如下：

select count(distinct ip)
from (select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d

分析：select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"这个语句筛选出来的数据约有10亿条，select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"约有10亿条条，select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 筛选出来的数据约有10亿条，总的数据量大约30亿条。这么大的数据量，使用disticnt函数，所有的数据只会shuffle到一个reducer上，导致reducer数据倾斜严重。
解决办法：
首先，通过使用groupby，按照ip进行分组。改写后的sql语句如下：

select count(*)
from
(select ip
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d
group by ip ) b

然后，合理的设置reducer数量，将数据分散到多台机器上。set mapred.reduce.tasks=50;
经过优化后，速度提高非常明显。整个作业跑完大约只需要20多分钟的时间。

数据少的时候，第一条比第二条优，数据大的时候第二条比第一条好一些，体现在不会出现内存溢出。 count(distinct) 需要将排序key全部加载到内存，再比较去重统计； count group by 是先对key 排序，再统计，其实消耗的时间会更多些，可以减少oom的问题。

分析：

1. 数据少的时候，第一条比第二条优，数据大的时候第二条比第一条好一些，体现在不会出现内存溢出。 count(distinct) 需要将排序key全部加载到内存，再比较去重统计； count group by 是先对key 排序，再统计，其实消耗的时间会更多些，可以减少oom的问题

2.
现在的hive 引擎并不会有这么大问题，tez或者spark 都能直接全部加到内存；
count distinct 可能会发生数据倾斜；
下面这个得 job数目也会增加只是一种对distinct的优化用时间换来的；

圣☞摧枯拉朽

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
hive语句优化-通过groupby实现distinct（数据量特别大的时候，使用distinct去重容易导致数据倾斜）

hive语句优化-通过groupby实现distinct 同事写了个hive的sql语句，执行效率特别慢，跑了一个多小时程序只是map完了，reduce进行到20%。该Hive语句如下： select count(distinct ip) from (select ip as ip from comprehensive.f_client_boot_daily where year...
复制链接

扫一扫