Hive 调优_调hive-CSDN博客

本文链接：https://blog.csdn.net/qq_36382679/article/details/106994818

Hive 调优

在这里插入图片描述

1 Fetch 抓取机制

功能：能不使用MapReduce执行的尽量不使用MapReduce执行。

属性

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have
      any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>

参数

set hive.fetch.task.conversion;
+----------------------------------+--+
|               set                |
+----------------------------------+--+
| hive.fetch.task.conversion=more  |
+----------------------------------+--+

在more属性下一般有3种情况不走mr程序直接使用fetch抓取
- select查询–select * from student;
- 字段查询–select sno,sname,sage from student;
- limit限制查询–select sno,sname,sage from student limit 5;

2 MapReduce本地模式

功能：如果要执行MapReduce程序能本地执行的尽量不使用yarn集群执行。

MapReduce执行模式

local

本地模式  使用单机进程模拟运行环境  单机版程序

yarn

集群模式  使用yarn进行分布式计算 调度资源

决定MapReduce是什么模式的参数是?
```
mapreduce.framework.name= local|yarn
```

hive提供参数智能切换本地模式和集群模式

参数：set hive.exec.mode.local.auto=true

切换条件

The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
The total number of reduce tasks required is 1 or 0.

1、job数据量大小 少于128M
2、maptask的个数 少于 4个
3、reducetask个数 要么是0 or 1

#上述3个条件满足的时候 hive将会自动执行mr本地模式。
#如果有一个不满足 hive将会切换成为yarn集群模式执行。

3 hive join优化

前置知识

Map side join

如果能够满足map端join hive会自动转换成map join 
step1: 启动本地mr程序 把表的数据进行分布式缓存
step2: 启动只要mapper阶段mr程序 读大表数据跟分布式缓存进行join  输出结果

reduce side join（hive common join）

以join的字段作为key 所有的数据发送到reduce中 相同的key的数据来到同一个reduce的同一组 完成数据之间的join操作。

map端join 开启参数

（1）设置自动选择 mapjoin
set hive.auto.convert.join = true; 默认为 true
（2）大表小表的阈值设置：
set hive.mapjoin.smalltable.filesize= 25000000;  （23.84M）

#在实际开发中 如果小表满足于上述阈值  hive会自动尝试将join转换成为map端join
#在此过程中 我们需要设置就是 你认为的小表 多少为小。
#因此在实际使用中，只要根据业务把握住小表的阈值标准即可，hive会自动帮我们完成 mapjoin，提高执行的效率

大表join大表

如果业务确实需要大表join大表才能够支撑此时的问题就是做不做。

如果要进行重点考虑空key问题的处理。

空key的过滤
	SELECT * FROM nullidtable WHERE id IS NOT NULL
空key的转换
	CASE WHEN a.id IS NULL THEN 'hive' ELSE
	CASE WHEN a.id IS NULL THEN concat('hive', rand()) ELSE

大小表小大表
- 想怎么写就怎么写注意控制一个参数
```
set hive.mapjoin.smalltable.filesize= 25000000;
```
- 如果小表满足上述参数 hive会尝试转换成为map端join 提高效率

4 数据倾斜问题–group by优化

参数

（1）是否在 Map 端进行聚合，默认为 True
set hive.map.aggr = true;
（2）在 Map 端进行聚合操作的条目数目
set hive.groupby.mapaggr.checkinterval = 100000;
（3）有数据倾斜的时候进行负载均衡（默认是 false）
set hive.groupby.skewindata = true;

解读

如果数据量较小的情况下 hive会尝试在map端进行聚合操作

在reduce聚合的时候如果有数据倾斜问题需要开启负载均衡的参数。

hive将会启动两个mr程序来处理数据
step1
	将倾斜的数据随机的发送到不同的reduce中（打散）
step2
	将上一步的结果进行最终的汇总聚合 得出最终的结果。

5 maptask ruducetask并行度问题

maptask

逻辑切片机制决定。

小文件场景合并  大文件场景进行调整block size。

reducectask
```
（1）每个 Reduce 处理的数据量默认是 256MB
hive.exec.reducers.bytes.per.reducer=256123456
（2）每个任务最大的 reduce 数，默认为 1009
hive.exec.reducers.max=1009
（3）mapreduce.job.reduces
该值默认为-1，由 hive 自己根据任务情况进行判断。
```
- 如果用户不设置第三个参数 hive将会自动评估reducetask个数。–-reduce阶段输入的数据量。
- 如果用户设置 mapreduce.job.reduces 设置为几 reducetask个数就是几。
  
  你设置的不一定生效。比如order by 在编译期间 hive还会考虑sql的逻辑优先满足逻辑正确。

6 执行计划

语法：explain + sql
梳理hive sql底层执行计划是否和sql逻辑层面想法一样。
并行执行机制
- sql底层会分为多个不同stage阶段 stage之间不管有没有依赖关系 hive默认都是依次执行。
- 开启并行执行机制
```
set hive.exec.parallel=true; 
set hive.exec.parallel.thread.number=16;
```
- 弊端：并行执行的瞬狙集群的资源使用率将会升高。

7 hive严格模式

属性决定

<property>
    <name>hive.mapred.mode</name>
    <value>nonstrict</value>
    <description>
      The mode in which the Hive operations are being performed. 
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
  </property>

nonstrict：非严格模式
strict：严格模式，以下的SQL语句不会被执行
- 不允许笛卡尔积的产生(限制笛卡尔积查询不加条件的join)
  - 关联时不指定条件
- 对分区表的数据查询时不指定分区(分区表不加where过滤)
- 比较bigint类型和string类型
- 比较bigint类型和double类型
- 全局排序不用limit(order by 不加limit限制)

默认hive是非严格模式意味着只要sql正确 hive都会执行

8 jvm重用机制

hive sql会转化成为MapReduce来执行的。
MapReduce中不管是maptask 还是reducetask 都是java进程运行在jvm上。
默认情况下一个jvm只运行一个task .
开启重用机制可以允许在jvm上运行多个task 提高jvm利用效率。
```
set mapred.job.reuse.jvm.num.tasks=10;
```
MapReduce推测机制
```
找出跑得慢的task  为其启动备份task.
两个task处理相同的数据 相同的逻辑。
谁先处理完 谁的结果作为最终结果。
```
默认是开启的，企业中通常建议关闭。推测是一个不确定的动作也可能慢的也不出错就会造成资源浪费。