hive调优

最新推荐文章于 2023-02-27 10:45:28 发布

流浮影

最新推荐文章于 2023-02-27 10:45:28 发布

阅读量161

点赞数

分类专栏： hadoop hive 文章标签： hadoop hive

本文链接：https://blog.csdn.net/weixin_44273391/article/details/101106517

版权

hadoop 同时被 2 个专栏收录

30 篇文章 0 订阅

订阅专栏

hive

11 篇文章 0 订阅

订阅专栏

hive调优

1、环境方面：服务器的配置、容器的配置、环境搭建

2、具体软件配置参数：

3、代码级别的优化：

执行计划

explain 和 explain extended ：

explain select * from text1;

explain extended select * from text1;
explain extended
select
d.deptno as deptno,
d.dname as dname
from dept d
union all
select
d.dname as dname,
d.deptno as deptno
from dept d
;

stage 相当于一个job，一个stage可以是limit、也可以是一个子查询、也可以是group by等。
hive默认一次只执行一个stage，但是如果stage之间没有相互依赖，将可以并行执行。
任务越复杂，hql代码越复杂，stage越多，运行的时间一般越长。

join

hive的查询永远是小表(结果集)驱动大表(结果集)
hive中的on的条件只能是等值连接
注意hive是否配置普通join转换成map端join、以及mapjoin小表文件大小的阀值

limit的优化

hive.limit.row.max.size=100000
hive.limit.optimize.limit.file=10
hive.limit.optimize.enable=false  (如果limit较多时建议开启)
hive.limit.optimize.fetch.max=50000

本地模式

hive.exec.mode.local.auto=false (建议打开)
hive.exec.mode.local.auto.inputbytes.max=134217728  (128M)
hive.exec.mode.local.auto.input.files.max=4

并行执行

hive.exec.parallel=false(建议开启)
hive.exec.parallel.thread.number=8;

严格模式

hive.mapred.mode=nonstrict
严格模式阻挡5类查询：
1.笛卡尔积：
2.分区表没有分区字段过滤
3.order by 不带limit查询
4.（bigint（8）和String比较）
5.（bigint和double比较）

mapper和reducer的个数

mapper和reducer的个数：
不是mapper和redcuer个数越多越好，也不是越少越好。

将小文件合并处理(将输入类设置为：CombineTextInputFormat)
通过配置将小文件合并：
mapred.max.split.size=256000000   
mapred.min.split.size.per.node=1
mapred.min.split.size.per.rack=1
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

手动设置：
set mapred.map.tasks=2;

reducer的个数(自动决定和手动设置)：
mapred.reduce.tasks=-1
hive.exec.reducers.max=1009

jvm重用

配置jvm重用：
mapreduce.job.jvm.numtasks=1   ###
mapred.job.reuse.jvm.num.tasks=1

数据倾斜：

数据倾斜：由于key分布不均匀造成的数据向一个方向偏离的现象。
本身数据就倾斜
join语句容易造成
count(distinct col) 很容易造成倾斜
group by 也可能会造成

找到造成数据倾斜的key，然后再通过hql语句避免。
hive.map.aggr=true
hive.groupby.skewindata=false  (建议开启)
hive.optimize.skewjoin=false
  Whether to enable skew join optimization. 
  The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
  processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
  job, process those skewed keys. The same key need not be skewed for all the tables, and so,
  the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
  map-join.

10、索引是一种hive的优化：

11、分区本身就是hive的一种优化：

12、job的数量：
一般是一个查询产生一个job，然后通常情况一个job、可以是一个子查询、一个join、一个group by 、一个limit等一些操作。

流浮影

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hive调优

hive调优1、环境方面：服务器的配置、容器的配置、环境搭建2、具体软件配置参数：3、代码级别的优化：执行计划explain 和 explain extended ：explain select * from text1;explain extended select * from text1;explain extendedselectd.deptno as deptno...
复制链接

扫一扫