hive的数据倾斜

最新推荐文章于 2023-04-25 15:04:55 发布

流浮影

最新推荐文章于 2023-04-25 15:04:55 发布

阅读量115

点赞数

分类专栏： hadoop hive 文章标签： hive hadoop

本文链接：https://blog.csdn.net/weixin_44273391/article/details/101107024

版权

hadoop 同时被 2 个专栏收录

30 篇文章 0 订阅

订阅专栏

hive

11 篇文章 0 订阅

订阅专栏

hive的数据倾斜

数据倾斜：

由于key分布不均匀造成的数据向一个方向偏离的现象
本身数据就倾斜：
join语句容易造成
count(distinct col)很容易造成倾斜
group by 也可能造成
注意hive的倾斜join 
key在reduce端的分配不均匀
倾斜现象：
卡在某一个reduce任务。
解决方法：
1.找到造成倾斜的key，然后再通过hql语句避免（查看日志是哪个task失败----》找到该task中关联字段、group by\join的字段count(distrinct col)---->抽样字段个数--->判断是否是倾斜的key） 单独拿出来处理，然后在和正常的结果进行union all
2.倾斜的key加随机数（加的随机数不能造成二次倾斜、保证加随机数不能影响原有的业务）

 select 
 t2.*
 from t_user2 t2
 join t_user2 t1
 on t2.id = t1.id
 ;
3.设置相关倾斜的属性
hive.map.aggr=true;
hive.groupby.skewindata=false;  (建议开启)
hive.optimize.skewjoin=false;
skewjoin 先关属性查看：
skew 相关的属性：
4.如果以上都不行，则需要从新查看业务，优化语句流程

set hive.skewjoin.key

hive.skewjoin.key=100000

skewjoin相关属性查看：

hive.skewjoin.key=100000

 <name>hive.skewjoin.key</name>
    <value>100000</value>
    <description>
      Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
      we think the key as a skew join key. 
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.map.tasks</name>
    <value>10000</value>
    <description>
      Determine the number of map task used in the follow up map join job for a skew join.
      It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.min.split</name>
    <value>33554432</value>
    <description>
      Determine the number of map task at most used in the follow up map join job for a skew join by specifying 
      the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
    </description>

33554432/1024/1024=32MB

skew相关属性查看：

流浮影

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive的数据倾斜

hive的数据倾斜数据倾斜：由于key分布不均匀造成的数据向一个方向偏离的现象本身数据就倾斜：join语句容易造成count(distinct col)很容易造成倾斜group by 也可能造成注意hive的倾斜join key在reduce端的分配不均匀倾斜现象：卡在某一个reduce任务。解决方法：1.找到造成倾斜的key，然后再通过hql语句避免（查看日志是哪个tas...
复制链接

扫一扫