【Hive】hive 数据倾斜、优化策略、hive执行过程、垃圾回收

最新推荐文章于 2024-08-01 08:57:51 发布

时间的美景

最新推荐文章于 2024-08-01 08:57:51 发布

阅读量1k

点赞数 2

分类专栏： Hadoop Hive 文章标签： hive hadoop 大数据

本文链接：https://blog.csdn.net/jiajane/article/details/103974270

版权

本文详细探讨了Hive中的数据倾斜问题，包括其定义、常见情况、原因和解决方案。此外，还深入分析了Hive的执行过程，介绍了优化策略，如合理设置MapReduce任务数量、数据分桶、分区、文件存储格式等。最后，讨论了Hive的垃圾回收及其对数据安全性的影响。

摘要由CSDN通过智能技术生成

1. 数据倾斜

1.1 什么是数据倾斜？

由于数据分布不均匀，造成数据大量的集中到一点，造成数据热点
Hadoop 框架的特性
- 不怕数据大，怕数据倾斜
- Jobs 数比较多的作业运行效率相对比较低，如子查询比较多
- sum,count,max,min 等聚集函数，通常不会有数据倾斜问题
主要表现：
任务进度长时间维持在 99%或者 100%的附近，查看任务监控页面，发现只有少量 reduce子任务未完成，因为其处理的数据量和其他的 reduce 差异过大。
单一 reduce 处理的记录数和平均记录数相差太大，通常达到好几倍之多，最长时间远大
于平均时长。

1.2 容易数据倾斜情况

操作	情形	后果
join	其中一个表小但是key集中	分发到某一个或几个reduce中的数据远高于平均值
join	大表与大表，但是分桶的判断字段0或空值过多	这些空值由一个reduce处理，非常慢
group by	group by维度过小，某值的数据过多	处理某值的reduce非常耗时
count distinct	某特殊值过多	处理此特殊值的reduce耗时

reduce join hive执行join的时候如果执行的是reduce join极易数据倾斜的
group by 不和聚集函数搭配使用的时候
select count(*) from course; 全局计数 (一个reduce task中)
count(distinct)，在数据量大的情况下，容易数据倾斜，因为 count(distinct)是按 group
by 字段分组，按 distinct 字段排序
小表关联超大表 join

1.3 产生数据倾斜的原因

key 分布不均匀
业务数据本身的特性
建表考虑不周全
某些 HQL 语句本身就存在数据倾斜

1.4 不会产生数据倾斜的情况

不执行MR任务的情况
fetch过程不会转MR
配置参数：

<property>
   <name>hive.fetch.task.conversion</name>
   <value>more</value>
   <description>
     可取值[none, minimal, more].
     Some select queries can be converted to single FETCH task minimizing latency.
     Currently the query should be single sourced not having any subquery and should not have
     any aggregations or distincts (which incurs RS), lateral views and joins.
     0. none : disable hive.fetch.task.conversion
     1. minimal : SELECT *, FILTER on partition columns, LIMIT only
		也就是：select * ,过滤条件是分区字段，limit   
     2. more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
		也就是：select 所有字段，按照所有字段过滤，limit
   </description>
 </property>

group by和聚合函数（sum count max min）一起使用
group by和以上的聚合函数一起使用的时候会默认在map端执行一次combiner（局部聚合：减少reducetask的数据量，这个时候reduce端接受的数据就会大大减少一般不会出现数据倾斜
select id,count(*) from course group by id;
map join
mapjoin reducejoin 优缺点

1.5 业务场景

1.5.1 空值产生的数据倾斜

场景描述：

user	用户信息表  userid  一个用户注册这个表就会多一条数据 userid!=null
log		日志信息表  userid  用户只要有一个行为就多一条数据  如果未登录用户操作 userid=null  这种数据数据特别多5T

操作:

select * from user a join log b on a.userid=b.userid;

map-key: userid
	userid 大量的null值   这些null值最终都会到一个reducetask中
	就会造成有null的这个reducetask的数据量很大
	100reducetask
	99reducetask   0.05	t   
	1reduce    5t   
	
	reducetask    99%    
	reducetask    99% 
	reducetask    99%

解决方案：
方案一：null值不参与关联

select 
* 
from user a join 
(select * from log where userid is not null) b 
on a.userid=b.userid;
--需要null的数据
select 
* 
from user a join 
(select * from log where userid is not null) b 
on a.userid=b.userid 
union all 
select * from log where userid is null;

方案二：赋予空值新的key值

--所有的null值全部到一个reducetask中   进行关联的时候  null值分开 null---变身
--null123.hash%reduce  null456  null112
select 
* 
from user a 
join log b 
on case when b.userid is null then concat_ws(b.userid,cast(rand()*1000 as int)) else b.userid end = a.userid;

总结：
方法 2 比方法 1 效率更好，不但 IO 少了，而且作业数也少了，
方案 1 中，log 表读了两次，jobs 肯定是 2，而方案 2 是 1。
这个优化适合无效 id（比如-99，’’，null）产生的数据倾斜，把空值的 key 变成一个字符串加上一个随机数，就能把造成数据倾斜的数据分到不同的 reduce 上解决数据倾斜的问题

改变之处：
使本身为 null 的所有记录不会拥挤在同一个 reduceTask 了，会由于有替代的随机字符串值，而分散到了多个 reduceTask 中了，由于 null 值关联不上，处理后并不影响最终结果。

1.5.2 不同数据类型关联产生数据倾斜

场景说明：

--用户表中 user_id 字段为 int，log 表中 user_id 为既有 string 也有 int 的类型，
user userid  int 
log  userid  string 
-- 当按照两个表的 user_id 进行 join 操作的时候，默认的 hash 操作会按照 int 类型的 id 进
-- 行分配，这样就会导致所有的 string 类型的 id 就被分到同一个 reducer 当中

解决方案：
把数字类型 id 转换成 string 类型的 id

select * from user a left outer join log b on b.user_id = cast(a.user_id as string)

1.5.1 大小表关联查询产生数据倾斜

大小或小 小关联（小表<=23.8M）
在hive中默认的是map端的join，小表不得超过23.8M左右

<property>
	<name>hive.auto.convert.join</name>
	<value>true</value>
	<description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size
	指定是否启动mapjoin
	</description>
	</property>
	
<property>
	<name>hive.mapjoin.smalltable.filesize</name>
	<value>25000000</value>
	<description>
	  The threshold for the input file size of the small tables; if the file size is smaller 
	  than this threshold, it will try to convert the common join into map join
	  mapjoin对于小表的大小的限定  默认大小不得超过23.8M左右
	  较小表大小不超过23.8M 执行的都是mapjoin
	</description>
</property>

大* 中（中表：超过23.8M）放在缓存中足够的
默认执行的是reducejoin(容易产生数据倾斜的)
强制执行mapjoin /+mapjoin(需要放在缓存中的中表)/
```
-- user  1T   log 300M
select 
/*+mapjoin(a)*/ 
* 
from log a join user b on a.userid=b.userid;
--执行的仍然是mapjoin
```