hive调优

最新推荐文章于 2023-04-18 15:44:44 发布

稳哥的哥

最新推荐文章于 2023-04-18 15:44:44 发布

阅读量336

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/shufangreal/article/details/103846688

版权

Hive 专栏收录该内容

24 篇文章 1 订阅

订阅专栏

hive的常规优化

1.数据量小的时候，将map-reudce作业放在本地工作站执行,这样只会启动一个reducer，数据量大的时候不可取

SET mapreduce.framework.name = local;
SET mapred.local.dir = ‘/tmp/<username>/mapred/local’ # 应指向在本地计算机上有效的路径（否则，用户将获得分配本地磁盘空间的异常。）

2.从0.7版开始，Hive还支持一种以本地模式自动运map\reduce作业的模式。相关的选项有`hive.exec.mode.local.auto`，`hive.exec.mode.local.auto.inputbytes.max`和`hive.exec.mode.local.auto.tasks.max`

SET hive.exec.mode.local.auto = true ;   # 默认情况下为false
 # 作业的总输入大小小于：（hive.exec.mode.local.auto.inputbytes.max默认为128MB）
 # 映射任务的总数小于：（hive.exec.mode.local.auto.tasks.max默认为4个）
 # 所需的缩减任务总数为1或0。

3.hive的日志目录默认是/tmp/<*user.name*>/hive.log，可通过conf/hive-log4j.properties中的

hive.log.dir=<other_location>来进行重新指定

-  bin/hive --hiveconf hive.root.logger=INFO,console  //for HiveCLI (deprecated)
-  bin/hiveserver2 --hiveconf hive.root.logger=INFO,console

4.https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started（hive on spark）

5.https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties (hive configuration set)

6.手动设置每个作业的reducer的个数

SET mapred.reduce.tasks = 5;  # hive默认为-1，为-1时hive自动算出reducer的数量

SET hive.exec.reducers.bytes.per.reducer = 1G;
#0.14.0之前默认为1G，加入inputsize = 1G，那么分配1个reducer，0.14.0版本及以上，默认为256M，假如inputsize=1G，将会使用4个reducer

hive.exec.reducers.max 
# 每个作业能分配的最大reducer的数量，
# 假如为-1时，自动分配最多这么多reducer，0.14.0之前默认999之个，后默认1009个

hive.jar.path  # defualt NULL ，在单独的jvm中提交作业时使用的hive_cli.jar的位置。
hive.aux.jars.path # default NULL，UDF与SerDes实现的jar包位置

hive.hadoop.supports.splittable.combineinputformat
# 默认 false
# 是否对小文件进行聚合，这是针对数据源头的文件进行合并

11.利用map端进行预聚合

hive.map.aggr  # 默认 true
# 是否在Hive的group by查询中使用map端预聚合。

hive.groupby.mapaggr.checkinterval # 默认100000行
# 当输入达到100000行了，在map端聚合1次

hive.new.job.grouping.set.cardinality # 默认为30，当group by的字段个数>30-1时，会分配新的mr任务来重新均匀分配data
hive.map.aggr.hash.force.flush.memory.threshold # 默认0.9
# map端组聚合哈希表使用的最大内存。如果内存使用量大于此数量，则强制刷写数据。

12.

hive.groupby.skewindata

13.是否启用索引自动使用

hive.optimize.index.filter   # 默认为 false

14.默认启用谓词下推

SET hive.optimize.ppd = true # 默认 true
# 同时需要 SET hive.optimize.index.filter = true ，以在ppd中使用特定于文件格式的索引

hive.join.cache.size # 默认25000 连接表（流表除外）中应在内存中缓存多少行。

16.指定小表的大小

hive.smalltable.filesize  （old）
	or 
hive.mapjoin.smalltable.filesize（new）  # 默认25000000 byte/ 25M
# 小表的输入文件大小的阈值（以字节为单位）；如果文件大小小于此阈值，它将尝试将普通join转换为map join。

hive.mapjoin.bucket.cache.size # default 100
# map join table中的每个key中应将多少value缓存在内存中。

19.https://cwiki.apache.org/confluence/display/Hive/Tutorial (hive用户手册)

20.job的并行执行优化，一般用于没有依赖关系的stages或者jobs

hive.exec.parallel # default false ，允许job并行执行
hive.exec.parallel.thread.number # default 8，允许多少个job并行执行

21.

hive.merge.mapfiles； # 默认 true，在map端的末尾并小文件
hive.merge.smallfiles.avgsize # 16M 
# 当作业的平均输出文件大小小于此数目时，Hive将启动另一个map-reduce作业，以将输出文件合并为更大的文件。# # 如果hive.merge.mapfiles为true，则仅对仅地图作业执行此操作；
# 如果hive.merge.mapredfiles为true，则仅对map-reduce作业执行此操作。

hive.merge.size.per.task # 默认 256M ，作业结束时合并文件的大小
hive.merge.mapredfiles # 在map-reduce作业结束时合并小文件

22.将普通join转化成mapjoin

hive.auto.convert.join = true # 默认 false
# Hive是否启用基于输入文件大小的有关将普通联接转换为mapjoin的优化。

23.动态分区（dml、ddl）

hive.exec.dynamic.partition=true； # 默认 false
hive.exec.dynamic.partition.mode=nonstrict；# 默认strict
hive.exec.max.dynamic.partitions =1000；# 默认1000

24.一个mr-job过程中允许产生的最多文件个数

hive.exec.max.created.files=100000 # default 100000
# 一个job中 map、reduce过程中允许产生的最多文件个数

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.mapjoin.optimized.hashtable

稳哥的哥

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive调优

hive的常规优化1.数据量小的时候，将map-reudce作业放在本地工作站执行,这样只会启动一个reducer，数据量大的时候不可取SET mapreduce.framework.name = local;SET mapred.local.dir = ‘/tmp/<username>/mapred/local’ # 应指向在本地计算机上有效的路径（否则，用户将获得分配本地磁盘...
复制链接

扫一扫