Hive学习笔记（四）-- Hive参数优化总结

最新推荐文章于 2024-06-20 19:44:08 发布

Mr_Wuuuuuuu

最新推荐文章于 2024-06-20 19:44:08 发布

阅读量1.1k

点赞数

分类专栏： Hive 文章标签： Hive

本文链接：https://blog.csdn.net/wwyzxb/article/details/87991858

版权

Hive 专栏收录该内容

7 篇文章 2 订阅

订阅专栏

Hive的参数配置
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
MapReduce的参数配置
https://hadoop.apache.org/docs/r3.0.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

一、通用参数优化

1.1 启用数据压缩

目的：减少存储和IO
压缩Hive输出和中间结果
- hive.exec.compress.output=true

  <property>
    <name>hive.exec.compress.output</name>
    <value>false</value>
    <description>
      This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. 
      The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
    </description>
  </property>

hive.exec.compress.intermediate=true

  <property>
    <name>hive.exec.compress.intermediate</name>
    <value>false</value>
    <description>
      This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. 
      The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
    </description>
  </property>

设置Hive中间表存储格式
- hive.query.result.fileformat=SequenceFile

<property>
  <name>hive.query.result.fileformat</name>
  <value>SequenceFile </value>
  <description>File format to use for a query's intermediate results. Options are TextFile, SequenceFile, and RCfile. Default value is changed to SequenceFile since Hive 2.1.0</description>
</property>
<property>

1.2 Job执行优化

1.并行执行多个job

hive.exec.parallel=true (default false)

  <property>
    <name>hive.exec.parallel</name>
    <value>false</value>
    <description>Whether to execute jobs in parallel</description>
  </property>
  <property>

hive.exec.parallel.thread.number=8 (default 8)

  <property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
    <description>How many jobs at most can be executed in parallel</description>
  </property>

注：hive的物理执行任务默认情况下是一个job执行完了之后再执行其它job，这种情况下如果想加快hive job的执行的话，可以采用并行的方式执行

2.本地执行模式

hive.exec.mode.local.auto=true

  <property>
    <name>hive.exec.mode.local.auto</name>
    <value>false</value>
    <description>Let Hive determine whether to run in local mode automatically</description>
  </property>

hive.exec.mode.local.auto.inputbytes.max (default 128MB)

  <property>
    <name>hive.exec.mode.local.auto.inputbytes.max</name>
    <value>134217728</value>
    <description>When hive.exec.mode.local.auto is true, input bytes should less than this for local mode.</description>
  </property>

hive.exec.mode.local.auto.input.files.max(default 4)

  <property>
    <name>hive.exec.mode.local.auto.input.files.max</name>
    <value>4</value>
    <description>When hive.exec.mode.local.auto is true, the number of tasks should less than this for local mode.</description>
  </property>

注：在做测试的时候，如果任务量较小的时候可以使用本地执行模式，以上三个条件任意不满足的话就会提交的远程去执行

1.3 选择合适的引擎

TEZ
Spark

[1] Apache Tez 了解 https://www.cnblogs.com/rongfengliang/p/6991020.html
[2] hive on tez详细配置和运行测试 https://blog.csdn.net/duguyiren3476/article/details/46349177

1.4 MapReduce参数优化

Map阶段优化

num_map_tasks切割大小影响参数
- mapreduce.input.fileinputformat.split.maxsize 默认：INT-MAX
- mapreduce.input.fileinputformat.split.minsize 默认: 0
- dfs.block.size默认:128M
列裁剪 hive.optimize.cp=true

  <property>
    <name>hive.optimize.cp</name>
    <value>true</value>
    <description>Whether to enable column pruner.</description>
  </property>

map端聚合 hive.map.aggr=true

  <property>
    <name>hive.map.aggr</name>
    <value>true</value>
    <description>Whether to use map-side aggregation in Hive Group By queries</description>
  </property>

Map端谓语下推 hive.optimize.ppd =true

  <property>
    <name>hive.optimize.ppd</name>
    <value>true</value>
    <description>Whether to enable predicate pushdown</description>
  </property>

Reduce阶段优化

mapred.reduce.tasks直接设置
num_reduce_tasks大小影响参数
- hive.exec.reducers.max 默认:1099

  <property>
    <name>hive.exec.reducers.max</name>
    <value>1009</value>
    <description>
      max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is
      negative, Hive will use this one as the max number of reducers when automatically determine number of reducers.
    </description>
  </property>

hive.exec.reducers.bytes.per.reducer 默认:1G

<property>
  <name>hive.exec.reducers.bytes.per.reducer</name>
  <value>1,000,000,000 </value>
  <description>
    Size per reducer. The default in Hive 0.14.0 and earlier is 1 GB, that is, if the input size is 10 GB then 10 reducers will be used. In Hive 0.14.0 and later the default is 256 MB, that is, if the input size is 1 GB then 4 reducers will be used.
  </description>
</property>

切割算法
- numRTasks=min(maxReducers,input.size/perReducer)
  - maxReducers = ${hive.exec.reducers.max}
  - perReducer= ${hive.exec.reducers.bytes.per.reducer}

Shuffle阶段优化

目的：压缩中间数据，从而减少磁盘操作以及减少网络传输数据量。
配置方法
- mapreduce.map.output.compress设为true
- mapreduce.map.output.compress.codec
  - org.apache.hadoop.io.compress.LzoCodec
  - org.apache.hadoop.io.compress.SnappyCodec

二、Join优化

hive查询过程中会大量使用join，join的优化可以提高hive的查询性能，一般的reduce端join如下图所示：

在这里插入图片描述

如果需要处理的表数据很大，在reduce阶段做join，需要将大量的数据传输给reducer，这样在shuffle过程中会导致网络io很高。所以下面将介绍两种map端join：

2.1 Map Join

一个表的数据量很大而另一个表的数据量不大，这个时候可以把小表的数据拷贝到各个mapTask中，然后在map的内存中加载，这样再和大表的数据分片做join，这样就可以避免将大表的数据在网络中进行shuffle，减少网络的开销，提升整个执行速度，如下为相应的参数设置：
1.hive.auto.convert.join=true (default false)

  <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
    <description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size</description>
  </property>

2.hive.mapjoin.smalltable.filesize=600M(default 25M)

  <property>
    <name>hive.mapjoin.smalltable.filesize</name>
    <value>25000000</value>
    <description>
      The threshold for the input file size of the small tables; if the file size is smaller 
      than this threshold, it will try to convert the common join into map join
    </description>
  </property>

3.强制指定对a表做 map join
select /+MAP JOIN(a)+/…a join b
在这里插入图片描述

2.2 Bucket Map Join

set hive.optimize.bucketmapjoin=true

  <property>
    <name>hive.optimize.bucketmapjoin</name>
    <value>false</value>
    <description>Whether to try bucket mapjoin.</description>
  </property>

map join一起工作
所有要join的表必须分桶，大表的桶的个数是小表的整数倍
做了bucket的列必须等于join的列

Mr_Wuuuuuuu

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Hive学习笔记（四）-- Hive参数优化总结

一、通用参数优化1.1 启用数据压缩1.2 Job执行优化1.3 选择合适的引擎1.4 MapReduce参数优化Map阶段优化Reduce阶段优化Shuffle阶段优化二、Join优化2.1 Map Join2.2 Bucket Map Join...
复制链接

扫一扫

专栏目录