Hive SQL优化

最新推荐文章于 2024-07-01 13:18:56 发布
allen sue
最新推荐文章于 2024-07-01 13:18:56 发布
阅读量283
点赞数
分类专栏： # hive
本文链接：https://blog.csdn.net/fish2009122/article/details/106355012
版权
hive 专栏收录该内容
7 篇文章 0 订阅
订阅专栏
                    
                        
                    
                    参考文档 
Hive之——Hive SQL优化
常规优化以及执行计划解析
 
基本思想 
HQL优化，本质是对MR/RDD的优化，可以通过explain hql来查看执行计划
 
常见思路 
表优化 
  分区（分区多，文件多） 
    静态分区
动态分区set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
 
 
分桶 
    规则：对分桶字段值进行哈希，哈希值除以桶的个数求余，余数决定了该条记录在哪个桶中，也就是余数相同的在一个桶中
好处 
      方便抽样
提高join的效率：若join的两张表有相同的列，且该列均进行了分桶，则join时将相同值的桶进行join操作即可，大大减少join的数据量
 
 set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;  # 开启强制排序，插数据到表中会进行强制排序，默认false
 
 
query优化 
  IO 
    只查询需要的列
尽可能用分区表，避免全表扫描
 
数据倾斜 
    joinset hive.optimize.skewjoin=true;  # 如果是join过程中出现倾斜 应该设置为true
set hive.skewjoin.key=100000;   # 这个是join的键对应的记录条数，超过这个值则会进行优化
 
group byset hive.group.skewindata=true;   # 如果是group by过程出现倾斜，应该设置为true
set hive.groupby.mapaggr.checkinterval=100000;  # 这个是group的键对应的记录条数超过这个值则会进行优化
 
count distinct：执行的MR是以GroupBy分组,再对distinct列排序,然后输出交给Reduce，所以，在reduce之前，本地的map已经完成预计算，并且提前做了一次聚合运算，如果此时的distinct造成了数据不平衡，则reduce时就会出现计算的数据量有大有小，即数据倾斜 
      解决方法：考虑用group by的子查询来替换count(distinct)# 替换前
select count(distinct id) from tablename;

select a, sum(b), count(distinct c), count(distinct d) from test group by a;

# 替换后
select count(1) from (select id from tablename group by id) tmp;

select a, sum(b) as b, count(c) as c, count(d) as d
from(
select a,0 as b, c, null as d from test group by a,c
union all
select a,0  as b, null as c, d from test group by a, d
union all
select a,b,null as c, null as d from test
)tmp1 group by a;          ```
 
 
常见值(null， 0， 1， -1， -99等默认值)：如果常见值的占比比较大时，较容易出现数据倾斜
 
表关联 
    大表放后：MR从后往前构建数据
同列关联：如可能,用同一列关联 同列关联,无论关联多少表都是一个Map搞定,如果不是同列,就会新开一个MR
mapjoin 
      关联操作中有一张小表
不等值的连接操作
 set hive.auto.convert.join=true;  #  hive.mapjoin.smalltable.filesize默认值是25mb，小表小于25mb自动启动mapjoin
 
bucket join 
      两个表以相同方式划分桶
两个表的桶个数是倍数关系
 # 优化前
select m.cid, u.id
from order m  join customer u on m.cid = u.id
where m.dt='2018-06-08'

# 优化后： where条件放在map端，而不是reduce端
select m.cid, u.id
from  (select cid from order where dt = '2018-06-08') m join customer u on m.cid = u.id;
 
 
Hive job优化 
    并行化执行：关联性不大的阶段，可以并行化执行set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
 
本地化执行set hive.exec.mode.local.auto=true;
 
      join满足如下条件才可用本地模式 
        job的输入数据大小必须小于参数：hive.exec.mode.local.auto.inputbytes.max(默认128MB)
job的map数必须小于参数：hive.exec.mode.local.auto.tasks.max(默认4)
job的reduce数必须为0或1
 
 
小文件合并：合并文件数由mapred.max.split.size限制的大小决定 
      合并输入set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
 
合并输出set hive.merge.smallfiles.avgsize=256000000;  # 当输出文件平均大小小于该值，启动新job合并文件
set hive.merge.size.per.task=64000000;  # 合并之后的文件大小
 
 
 
JVM重利用：Job长时间保留slot,直到作业结束。若有较多任务和较多小文件的任务时，该参数很有意义：减少执行时间。但此值不宜过大，有些作业会有reduce任务，一旦没有完成，则map占用slot不会释放，其它作业需要等待set mapred.job.reuse.jvm.num.tasks=20;
 
压缩数据 
    中间压缩：处理hive查询的多个job之间的数据，进行压缩set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
 
输出压缩set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compression.type=BLOCK;
 
 
Hive Map优化 
    map参数 
      原则 
        如果想增加map个数，则设置mapred.map.tasks为一个较大的值
如果想减少map个数，则设置mapred.min.split.size为一个较大的值
输入文件size巨大，但不是小文件：增大mapred.min.split.size的值
输入文件数量巨大，且都是小文件，就是单个文件的size小于blockSize：增大mapred.min.split.size不可行，需要使用CombineFileInputFormat将多个input path合并成一个InputSplit送给mapper处理，从而减少mapper的数量
 
参数# 默认map个数
default_num = total_size / block_size;

# 期望大小
goal_num = mapred.map.tasks;

# 设置处理的文件大小
split_size=max(mapred.min.split.size, block_size)
split_num = total_size / split_size;

# 计算map个数
compute_map_num = min(split_num, max(default_num, goal_num))          ```
 
 
map端聚合set hive.map.aggr=true;
 
推测执行mapred.map.tasks.speculative.execution  # 默认为true
 
 
shuffle优化 
    map端参数# 默认100M。map节点运行没完成时，若内存数据过多，该设置就是内存缓冲的大小，在shuffle之前该项定了map输出结果在内存占用buffer的大小，当buffer达到阈值，则启动后台线程对buffer内存sort，然后spill到硬盘
io.sort.mb
  
# 上面参数buffer的阈值，默认0.8（80%）
io.sort.spill.percent  

# 默认值3。当spill数量不低于该值时，则combiner函数会在merge产生结果文件之间运行
min.num.spill.for.combine  

# 默认10。当一个map task执行完成后，本地磁盘上有若干spill文件，map task最后一件事就是执行merge sort，执行时每次同时打开多个spill文件，同时打开的文件数量由该值决定。说明：打开的文件越多，不一定merge sort就越快，也要根据数据情况适当的调整
io.sort.factor  

# 默认值0.05。io.sort.mb中用来保存map output记录边界的百分比，其他缓存用来保存数据
io.sort.record.percent  
 
reduce端参数# 默认5，reduce copy数据的线程数
mapred.reduce.parallel.copies  

# 默认 300(s)。reduce下载线程最大等待时间
mapred.reduce.copy.backoff  

# 默认0.7(70%)。Reduce用来存放从Map节点取过来的数据所用的内存占堆内存的比例
mapred.job.shuffle.input.buffer.percent  

# 默认值0。sort完成后reduce计算阶段用来缓存数据的百分比
mapred.job.reduce.input.buffer.percent  
 
 
reduce优化 
    reduce操作 
      聚合函数：sum, count, distinct …
高级查询：group by; join; distribute by; cluster by; order by…
 
推测执行：默认为truemapred.reduce.tasks.speculative.execution  # hadoop
hive.mapred.reduce.tasks.speculative.execution  # hive，二者效果一样，二选一即可
 
reduce优化set mapred.reduce.tasks=10;  # 直接设置
hive.exec.reducers.max   # 默认999
hive.exec.reducers.bytes.per.reducer   # 默认1G
 
计算公式numTasks = min(maxReducers, input.size / perReducer)
maxReducers = hive.exec.reducers.max
perReducer=hive.exec.reducers.bytes.per.reducer
 
 
相关参数 
    hive# Hive 自动转换联接无条件(mapjoin)大小，默认20M
hive.auto.convert.join.noconditionaltask.size

#  小文件平均大小合并阈值。默认16M
hive.merge.smallfiles.avgsize

# Spark 执行程序最大 Java 堆栈大小。默认256M
spark.executor.memory

# Spark 驱动程序最大 Java 堆栈大小。默认256M
spark.driver.memory

# Spark 驱动程序内存开销。默认26M
spark.yarn.driver.memoryOverhead

# Spark 驱动程序内存开销。默认26M
spark.yarn.executor.memoryOverhead
 
yarn# 每个作业的 Reduce 任务的默认数量，默认1
mapreduce.job.reduces

# 容器内存。默认8G
yarn.nodemanager.resource.memory-mb

# 最大容器内存。默认64G
yarn.scheduler.maximum-allocation-mb

#  最大容器虚拟 CPU 内核数量。默认32
yarn.scheduler.maximum-allocation-vcores
 
 
 
 
Explain执行计划 
不走MR 
  SQLexplain
select date, week, week_num, quarter, rank, pay_amt, pct_rank 
from home_page_pay_rank
where year='2020'
 
explain1	  STAGE DEPENDENCIES:
2	  Stage-0 is a root stage
3	
4	STAGE PLANS:
5	  Stage: Stage-0
6	    Fetch Operator
7	      limit: -1
8	      Processor Tree:
9	        TableScan
10	          alias: home_page_pay_rank
11	          filterExpr: (year = '2020') (type: boolean)
12	          Select Operator
13	            expressions: date (type: string), week (type: string), week_num (type: string), quarter (type: string), rank (type: int), pay_amt (type: string), pct_rank (type: string)
14	            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
15	            ListSink
16	
 
 

                
allen sue
关注
0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive SQL优化

参考文档Hive之——Hive SQL优化常规优化以及执行计划解析基本思想HQL优化，本质是对MR/RDD的优化，可以通过explain hql来查看执行计划常见思路表优化分区（分区多，文件多）静态分区动态分区set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;分桶规则：对分桶字段值进行哈希，哈希值除以桶的个数求余，余数决定了该条记录在哪个桶中，
复制链接

扫一扫
专栏目录