hive常用几种优化策略

最新推荐文章于 2024-07-29 10:11:46 发布

三井08

最新推荐文章于 2024-07-29 10:11:46 发布

阅读量657

点赞数 1

文章标签： hadoop

本文链接：https://blog.csdn.net/sinat_36572927/article/details/111067113

版权

本文介绍了Hive的多种优化策略，包括fetch抓取、本地模式、开启mapjoin、避免count(distinct)、合理设置map和reduce个数、小文件合并、并行执行、JVM重用、预聚合以及数据倾斜的优化，旨在提升Hive查询效率。

摘要由CSDN通过智能技术生成

hive优化的几种方式;

1.fetch抓取

意思是有的hql语句可以不使用mapreduce计算, 例如 select * from table1;这种情况下,hive直接读取table对应存储目录下的文件,然后输出;

如下配置可以开启,

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>

设置成noe,select * from table 也会走map reduce,速度极慢,

2.本地模式

大多数的Hadoop Job是需要Hadoop提供的完整的可扩展性来处理大数据集的。不过，有时Hive的输入数据量是非常小的。在这种情况下，为查询触发执行任务消耗的时间可能会比实际job的执行时间要多的多。对于大多数这种情况，Hive可以通过本地模式在单台机器上处理所有的任务(不经过yarn了)。对于小数据集，执行时间可以明显被缩短。

用户可以通过设置hive.exec.mode.local.auto的值为true，来让Hive在适当的时候自动启动这个优化。

set hive.exec.mode.local.auto=true; //开启本地mr

//设置local mr的最大输入数据量，当输入数据量小于这个值时采用local mr的方式，默认为134217728，即128M

set hive.exec.mode.local.auto.inputbytes.max=50000000;