hive-数据倾斜

非本人文章

已于 2022-03-13 23:17:15 修改

阅读量1.8k

点赞数

分类专栏： hive 文章标签： hive big data

于 2022-03-13 23:14:13 首次发布

本文链接：https://blog.csdn.net/onlybymyself/article/details/123468691

版权

hive 专栏收录该内容

6 篇文章

订阅专栏

关于数据倾斜

在弄清什么是数据倾斜之前,我想让大家看看数据分布的概念:

原理：简单的讲，数据倾斜就是我们在计算数据的时候，数据的分散度不够，导致大量的数据集中到了一台或者几台机器上计算，这些数据的计算速度远远低于平均计算速度，导致整个计算过程过慢。

表现：任务进度长时间维持在 99%或者 100%的附近，查看任务监控页面，发现只有少量 reduce 子任务未完成

数据倾斜的场景：

4-1:大小表关联（小表为25M）

解决方案：使用map join解决小表关联大表造成的数据倾斜问题。这个方法使用的频率很高。

目前我们已经在公共参数中配置了该参数的优化，所以这种情况已经可以自动规避了,如果默认25M不够用可以调整参数

set hive.mapjoin.smalltable.filesize，默认为25000000（25M），可以根据情况去调节，但是建议不大于100M，容易出现内存溢出。

4-2:关联条件种出现''或者null的情况

select
a.*
from
(
select * from log where dt='2022-01-23'
) a
left join
(
select * from users where dt='2022-01-23'
) b on a.user_id = b.user_id

方法1：
select
a.*
from
(
select * from log where dt='2022-01-23' and user_id<>''
) a
left join
(
select * from users where dt='2022-01-23'
) b on a.user_id = b.user_id

方法2：
select
a.*
from
(
select * from log where dt='2022-01-23'
) a
left join
(
select * from users where dt='2022-01-23'
) b on case when nvl(a.user_id,'') then concat('hive',rand()) else a.user_id = b.user_id

4-3:大表和大表关联出现的数据倾斜

4-4:无group by 的count(distinct)产生的数据倾斜

当count(distinct)无group by条件时，只会产生一个reduce处理，所以导致数据倾斜

select count(distinct user_id) from shuidi_dwb.dwb_sdb_order_info_full_d where dt='2022-01-01';查询时间过长或者报错

方法：select count(1) from (select user_id from shuidi_dwb.dwb_sdb_order_info_full_d where dt='2022-01-01 group by user_id) t;

4-5:带group by的count(distinct)产生的数据倾斜

count(distinct)，在数据量大的情况下，容易数据倾斜，因为 count(distinct)是按 group by 字段分组，按 distinct 字段排序

select os_type,count(distinct crypto_mobile) from shuidi_sdm.sdm_user_account_full_d where dt='2022-01-23'
group by os_type;

方法 :

select
os_type
,sum(partial_uv)
from
(
select
os_type,
random_key,
sum(1) as partial_uv -- 对 user_id 进行计数，是局部聚合结果
from (
select os_type,crypto_mobile,cast(rand() * 1000 as int) as random_key from shuidi_sdm.sdm_user_account_full_d where dt='2022-01-23'
group by os_type,crypto_mobile
) t
group by os_type,random_key

) t
group by
os_type;