Hive SQL优化

最新推荐文章于 2024-07-01 13:18:56 发布

徐小慧_Blog

最新推荐文章于 2024-07-01 13:18:56 发布

阅读量581

点赞数 2

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/weixin_42073408/article/details/118308249

版权

hive 专栏收录该内容

6 篇文章 2 订阅

订阅专栏

一、Hive优化目标在有限的资源下，提高执行效率

二、Hive执行
HQL——> Job——> Map/Reduce

三、执行计划
查看执行计划

explain [extended] hql

四、Hive表优化
1、分区
静态分区转动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

2、分桶

set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;

3、数据
相同数据尽量聚集在一起

五、Hive查询操作优化
1、join优化

hive.optimize.skewjoin=true;

如果是join过程中出现倾斜应该设置为true

set hive.skewjoin.key=100000;

这个是join的键对应的记录条数，超过这个值则会进行优化
2、mapjoin

set hive.auto.convert.join=true;

hive.mapjoin.smalltable.filesize

默认值是25mb

select /*+mapjoin(A)*/f.a, f.b from A t join B f on (f.a=t.a)

3、mapjoin的使用场景
关联操作中有一张表非常小
不等值的连接操作
4、bucket join
两个表以相同方式划分桶
两个表的桶个数是倍数关系

create table order(cid int, price float) clustered by(cid) into 32 buckets;
create table customer(id int, first string) clustered by(id) into 32 buckets;
select price from order t join customers s on t.cid = s.id

join优化前

select m.cid, u.id from order m join customer u on m.cid = u.id where m.dt='2018-06-08'

join优化后

select m.cid, u.id from (select cid from order where dt = '2018-06-08')m join customer u on m.cid = u.id;

5、group by优化

hive.group.skewindata=true;

如果是group by过程出现倾斜，应该设置为true

set hive.groupby.mapaggr.checkinterval=100000;

这个是group的键对应的记录条数超过这个值则会进行优化

6、count distinct优化
优化前

select count(distinct id) from tablename;

优化后

select count(1) from (select distinct id from tablename) tmp;

select count(1) from (select id from tablename group by id) tmp;

优化前

select a, sum(b), count(distinct c), count(distinct d) from test group by a;

优化后

select a, sum(b) as b, count(c) as c, count(d) as d  from(select a,0 as b, c, null as d from test group by a,c union all select a,0 as b, null as c, d from test group by a, d union all select a,b,null as c, null as d from test )tmp1 group by a;

六、Hive job优化1、并行化执行
每个查询被Hive转化为多个阶段，有些阶段关联性不大，则可以并行化执行，减少执行时间

set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;

2、本地化执行

set hive.exec.mode.local.auto=true;

当一个join满足如下条件才能真正使用本地模式：
Job的输入数据大小必须小于参数

hive.exec.mode.local.auto.inputbytes.max(默认128MB)
Job的map数必须小于参数

hive.exec.mode.local.auto.tasks.max(默认4)
Job的reduce数必须为0或者13、job合并输入小文件

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

合并文件数由mapred.max.split.size限制的大小决定
4、job合并输出小文件

set hive.merge.smallfiles.avgsize=256000000;

当输出文件平均大小小于该值，启动新job合并文件

set hive.merge.size.per.task=64000000;

合并之后的文件大小七、JVM重利用

set mapred.job.reuse.jvm.num.tasks=20;

JVM重利用可以是Job长时间保留slot,直到作业结束，这在对于有较多任务和较多小文件的任务时非常有意义的，减少执行时间。但是这个值不能设置过大，因为有些作业会有reduce任务，如果reduce任务没有完成，则map任务占用的slot不能释放，其他的作业可能就需要等待。

八、压缩数据
中间压缩就是处理hive查询的多个job之间的数据，对于中间压缩，最好选择一个节省CPU耗时的压缩方式

set hive.exec.compress.intermediate=true;

set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

set hive.intermediate.compression.type=BLOCK;

hive查询最终的输出也可以压缩

set hive.exec.compress.output=true;

set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

set mapred.output.compression.type=BLOCK;

九、Hive Map优化
set mapred.map.tasks=10; 无效
1、Map个数计算流程

(1) 默认map个数

default_num = total_size / block_size;

(2)期望大小

goal_num = mapred.map.tasks;

(3)设置处理的文件大小

split_size=max(mapred.min.split.size, block_size)
split_num = total_size / split_size;

(4)计算map个数

compute_map_num = min(split_num, max(default_num, goal_num))

设置map个数总结如下：
(1) 如果想增加map个数，则设置mapred.map.tasks为一个较大的值
(2) 如果想减少map个数，则设置mapred.min.split.size为一个较大的值
情况1：输入文件size巨大，但不是小文件增大mapred.min.split.size的值
情况2：输入文件数量巨大，且都是小文件，就是单个文件的size小于blockSize。
这种情况通过增大mapred.min.split.size不可行，需要使用CombineFileInputFormat将多个input path合并成一个InputSplit送给mapper处理，从而减少mapper的数量。

2、Map端聚合

set hive.map.aggr=true;

3、推测执行

mapred.map.tasks.speculative.execution

十、Hive Shuffle优化Map端

io.sort.mb

io.sort.spill.percent

min.num.spill.for.combine

io.sort.factor

io.sort.record.percentReduce端

mapred.reduce.parallel.copies

mapred.reduce.copy.backoff

io.sort.factor

mapred.job.shuffle.input.buffer.percent

mapred.job.reduce.input.buffer.percent

十一、Hive Reduce优化
需要reduce操作的查询
聚合函数
sum,count,distinct…
高级查询

group by, join, distribute by, cluster by…
order by 比较特殊，只需要一个reduce推测执行

mapred.reduce.tasks.speculative.execution

hive.mapred.reduce.tasks.speculative.execution

Reduce优化

set mapred.reduce.tasks=10;
```直接设置

hive.exec.reducers.max 默认999
hive.exec.reducers.bytes.per.reducer 默认1G
计算公式
numTasks = min(maxReducers, input.size / perReducer)
maxReducers = hive.exec.reducers.max
perReducer=hive.exec.reducers.bytes.per.reducer

徐小慧_Blog

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Hive SQL优化

一、Hive优化目标在有限的资源下，提高执行效率二、Hive执行HQL——> Job——> Map/Reduce三、执行计划查看执行计划explain [extended] hql四、Hive表优化1、分区静态分区转动态分区set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;2、分桶set hive.enforce.bucketing=true;set
复制链接

扫一扫

专栏目录