Hive学习笔记(四)-- Hive参数优化总结

Hive的参数配置
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
MapReduce的参数配置
https://hadoop.apache.org/docs/r3.0.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

一、通用参数优化

1.1 启用数据压缩

  • 目的:减少存储和IO
  • 压缩Hive输出和中间结果
    • hive.exec.compress.output=true
  <property>
    <name>hive.exec.compress.output</name>
    <value>false</value>
    <description>
      This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. 
      The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
    </description>
  </property>
  • hive.exec.compress.intermediate=true
  <property>
    <name>hive.exec.compress.intermediate</name>
    <value>false</value>
    <description>
      This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. 
      The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
    </description>
  </property>
  • 设置Hive中间表存储格式
    • hive.query.result.fileformat=SequenceFile
<property>
  <name>hive.query.result.fileformat</name>
  <value>SequenceFile </value>
  <description>File format to use for a query's intermediate results. Options are TextFile, SequenceFile, and RCfile. Default value is changed to SequenceFile since Hive 2.1.0</description>
</property>
<property>

1.2 Job执行优化

1.并行执行多个job

  • hive.exec.parallel=true (default false)
  <property>
    <name>hive.exec.parallel</name>
    <value>false</value>
    <description>Whether to execute jobs in parallel</description>
  </property>
  <property>
  • hive.exec.parallel.thread.number=8 (default 8)
  <property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
    <description>How many jobs at most can be executed in parallel</description>
  </property>

注:hive的物理执行任务默认情况下是一个job执行完了之后再执行其它job,这种情况下如果想加快hive job的执行的话,可以采用并行的方式执行

2.本地执行模式

  • hive.exec.mode.local.auto=true
  <property>
    <name>hive.exec.mode.local.auto</name>
    <value>false</value>
    <description>Let Hive determine whether to run in local mode automatically</description>
  </property>
  • hive.exec.mode.local.auto.inputbytes.max (default 128MB)
  <property>
    <name>hive.exec.mode.local.auto.inputbytes.max</name>
    <value>134217728</value>
    <description>When hive.exec.mode.local.auto is true, input bytes should less than this for local mode.</description>
  </property>
  • hive.exec.mode.local.auto.input.files.max(default 4)
  <property>
    <name>hive.exec.mode.local.auto.input.files.max</name>
    <value>4</value>
    <description>When hive.exec.mode.local.auto is true, the number of tasks should less than this for local mode.</description>
  </property>

注:在做测试的时候,如果任务量较小的时候可以使用本地执行模式,以上三个条件任意不满足的话就会提交的远程去执行

1.3 选择合适的引擎

  • TEZ
  • Spark
    在这里插入图片描述

[1] Apache Tez 了解 https://www.cnblogs.com/rongfengliang/p/6991020.html
[2] hive on tez详细配置和运行测试 https://blog.csdn.net/duguyiren3476/article/details/46349177

1.4 MapReduce参数优化

  1. Map阶段优化
  • num_map_tasks切割大小影响参数

    • mapreduce.input.fileinputformat.split.maxsize 默认:INT-MAX
    • mapreduce.input.fileinputformat.split.minsize 默认: 0
    • dfs.block.size默认:128M
  • 列裁剪 hive.optimize.cp=true

  <property>
    <name>hive.optimize.cp</name>
    <value>true</value>
    <description>Whether to enable column pruner.</description>
  </property>
  • map端聚合 hive.map.aggr=true
  <property>
    <name>hive.map.aggr</name>
    <value>true</value>
    <description>Whether to use map-side aggregation in Hive Group By queries</description>
  </property>
  • Map端谓语下推 hive.optimize.ppd =true
  <property>
    <name>hive.optimize.ppd</name>
    <value>true</value>
    <description>Whether to enable predicate pushdown</description>
  </property>
  1. Reduce阶段优化
  • mapred.reduce.tasks直接设置
  • num_reduce_tasks大小影响参数
    • hive.exec.reducers.max 默认:1099
  <property>
    <name>hive.exec.reducers.max</name>
    <value>1009</value>
    <description>
      max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is
      negative, Hive will use this one as the max number of reducers when automatically determine number of reducers.
    </description>
  </property>
  • hive.exec.reducers.bytes.per.reducer 默认:1G
<property>
  <name>hive.exec.reducers.bytes.per.reducer</name>
  <value>1,000,000,000 </value>
  <description>
    Size per reducer. The default in Hive 0.14.0 and earlier is 1 GB, that is, if the input size is 10 GB then 10 reducers will be used. In Hive 0.14.0 and later the default is 256 MB, that is, if the input size is 1 GB then 4 reducers will be used.
  </description>
</property>
  • 切割算法
    • numRTasks=min(maxReducers,input.size/perReducer)
      • maxReducers = ${hive.exec.reducers.max}
      • perReducer= ${hive.exec.reducers.bytes.per.reducer}
  1. Shuffle阶段优化
  • 目的:压缩中间数据,从而减少磁盘操作以及减少网络传输数据量。
  • 配置方法
    • mapreduce.map.output.compress设为true
    • mapreduce.map.output.compress.codec
      • org.apache.hadoop.io.compress.LzoCodec
      • org.apache.hadoop.io.compress.SnappyCodec
        在这里插入图片描述

二、Join优化

hive查询过程中会大量使用join,join的优化可以提高hive的查询性能,一般的reduce端join如下图所示:

在这里插入图片描述

如果需要处理的表数据很大,在reduce阶段做join,需要将大量的数据传输给reducer,这样在shuffle过程中会导致网络io很高。所以下面将介绍两种map端join:

2.1 Map Join

一个表的数据量很大而另一个表的数据量不大,这个时候可以把小表的数据拷贝到各个mapTask中,然后在map的内存中加载,这样再和大表的数据分片做join,这样就可以避免将大表的数据在网络中进行shuffle,减少网络的开销,提升整个执行速度,如下为相应的参数设置:
1.hive.auto.convert.join=true (default false)

  <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
    <description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size</description>
  </property>

2.hive.mapjoin.smalltable.filesize=600M(default 25M)

  <property>
    <name>hive.mapjoin.smalltable.filesize</name>
    <value>25000000</value>
    <description>
      The threshold for the input file size of the small tables; if the file size is smaller 
      than this threshold, it will try to convert the common join into map join
    </description>
  </property>

3.强制指定对a表做 map join
select /+MAP JOIN(a)+/…a join b
在这里插入图片描述

2.2 Bucket Map Join

  • set hive.optimize.bucketmapjoin=true
  <property>
    <name>hive.optimize.bucketmapjoin</name>
    <value>false</value>
    <description>Whether to try bucket mapjoin.</description>
  </property>
  • map join一起工作
  • 所有要join的表必须分桶,大表的桶的个数是小表的整数倍
  • 做了bucket的列必须等于join的列
    在这里插入图片描述
  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值