Hive-企业级调优-(本地模式/JVM重用/严格模式/并行模式)

最新推荐文章于 2022-11-13 22:02:34 发布

梦里Coding

最新推荐文章于 2022-11-13 22:02:34 发布

阅读量326

点赞数

分类专栏： Hive 文章标签： hive big data hadoop

本文链接：https://blog.csdn.net/weixin_43586713/article/details/120855986

版权

Hive 专栏收录该内容

43 篇文章 5 订阅

订阅专栏

企业级调优详细讲解

执行计划(Explain)
Fetch 抓取
本地模式
JVM重用
严格模式
并行模式

执行计划(Explain)

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
没有走MR任务的:

hive (default)> explain select * from emp;
OK
Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp
          Select Operator
            expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            ListSink

Time taken: 4.468 seconds, Fetched: 15 row(s)

走MR任务的:

OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20211020101254_ea47aaf7-89cf-444f-b3cb-37c114e3b0db:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: emp
                  Statistics: Num rows: 53 Data size: 646 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: deptno (type: int), sal (type: double)
                    outputColumnNames: deptno, sal
                    Statistics: Num rows: 53 Data size: 646 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: avg(sal)
                      keys: deptno (type: int)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 53 Data size: 646 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 53 Data size: 646 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: struct<count:bigint,sum:double,input:double>)
        Reducer 2 
            Reduce Operator Tree:
              Group By Operator
                aggregations: avg(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 26 Data size: 316 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 26 Data size: 316 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.49 seconds, Fetched: 54 row(s)

Fetch 抓取

Fetch 抓取是指，Hive 中对某些情况的查询可以不必使用 MapReduce 计算。例如：SELECT * FROM employees;在这种情况下，Hive 可以简单地读取 employee 对应的存储目录下的文件，然后输出查询结果到控制台。

在hive-default.xml.template 文件中 hive.fetch.task.conversion 默认是 more，老版本 hive 默认是 minimal，该属性修改为 more 以后，在全局查找、字段查找、limit 查找等都不走 mapreduce。

<property>
	<name>hive.fetch.task.conversion</name>
	<value>more</value>
	<description>
	Expects one of [none, minimal, more].
	Some select queries can be converted to single FETCH task minimizing latency.
	Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
	1.	none : disable hive.fetch.task.conversion
	2.	minimal : SELECT STAR, FILTER on partition columns, LIMIT only
	3.	more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
	</description>
</property>

（1）把 hive.fetch.task.conversion 设置成 none，然后执行查询语句，都会执行 mapreduce程序。

hive (default)> set hive.fetch.task.conversion=none; 
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;

（2）把 hive.fetch.task.conversion 设置成 more，然后执行查询语句，如下查询方式都不会执行 mapreduce 程序

hive (default)> set hive.fetch.task.conversion=more; 
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;

本地模式

大多数的 Hadoop Job 是需要 Hadoop 提供的完整的可扩展性来处理大数据集的。不过，有时 Hive 的输入数据量是非常小的。在这种情况下，为查询触发执行任务消耗的时间可能会比实际 job 的执行时间要多的多。对于大多数这种情况，Hive 可以通过本地模式在单台机器上处理所有的任务。对于小数据集，执行时间可以明显被缩短。
用户可以通过设置 hive.exec.mode.local.auto 的值为 true，来让 Hive 在适当的时候自动启动这个优化。

set hive.exec.mode.local.auto=true; //开启本地 mr
//设置 local mr 的最大输入数据量，当输入数据量小于这个值时采用 local mr 的方式，默认 为 134217728，即 128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//设置 local mr 的最大输入文件个数，当输入文件个数小于这个值时采用 local mr 的方式，默 认为 4
set hive.exec.mode.local.auto.input.files.max=10;

案例实操:
（1）关闭本地模式（默认是关闭的），并执行查询语句

hive (default)> select count(*) from emp group by deptno;

（2）开启本地模式，并执行查询语句

hive (default)> set hive.exec.mode.local.auto=true;
hive (default)> select count(*) from emp group by deptno;

在本地测试了一下,如果不开启local模式,直接执行的话,所消耗的时间为77s,开启local模式之后,执行的时间才19s,可见时间的差异是巨大的。所以温江数据量比较小的时候,采用local模式,速度要快很多。

注意:对于大量的小文件来说,可以采取两种策略来解决:
1.输入端采用combineTextInputFormat,，它可以将多个小文件从逻辑上规划到一个切片中，这样，多个小文件就可以交给一个 MapTask 处理。
2.开启本地模式,在单台机器上跑MR任务,时间可以明显被缩短。

JVM重用

JVM 重用只对MR引擎是有效的，因为MR任务中的mapTask 和ReduceTask都各自运行在一个独立的JVM进程中，同时「每个MapTask/ReduceTask都要经历申请资源 -> 运行task -> 释放资源的过程」。强调一点：每个MapTask/ReduceTask运行完毕所占用的资源必须释放，并且这些释放的资源不能够为该任务中其他task所使用。

所以开启JVM重用在一定程度上能缓解MapReduce让每个task动态申请资源且运行完后马上释放资源带来的性能开销

但是JVM重用并不是多个task可以并行运行在一个JVM进程中，而是「对于同一个job，一个JVM上最多可以顺序执行的task数目」，这个需要配置参数mapred.job.reuse.jvm.num.tasks，默认1。通常在10-20之间，具体多少需要根据具体业务场景测试得出。

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit. 
  </description>
</property>

这个功能的缺点是，开启JVM重用将一直占用使用到的task插槽，以便进行重用，直到任务完成后才能释放。如果某个“不平衡的”job中有某几个reduce task执行的时间要比其他Reduce task消耗的时间多的多的话，那么保留的插槽就会一直空闲着却无法被其他的job使用，直到所有的task都结束了才会释放。

对于Spark 引擎来说，每次 MapReduce 操作是基于线程的，只在启动 Executor 时启动一次 JVM，内存的 Task 操作是在线程复用的。每次启动 JVM 的时间可能就需要几秒甚至十几秒。

严格模式

严格模式：防止用户执行一些影响比较大的sql set hive.mapred.mode = strict（默认为strict）

1、分区表的查询必须where 分区，也就是说不允许扫描所有的分区。

2、使用了order by 语句的查询，必须使用limit 语句。因为order by为了执行排序过程会将所有的结果数据分发到同一个Reducer中进行处理，强制要求用户增加这个LIMIT语句可以防止Reducer额外执行很长一段时间。

3、限制笛卡尔积的查询。对关系型数据库非常了解的用户可能期望在执行JOIN查询的时候不使用ON语句而是使用where语句，这样关系数据库的执行优化器就可以高效地将WHERE语句转化成那个ON语句。不幸的是，Hive并不会执行这种优化，因此，如果表足够大，那么这个查询就会出现不可控的情况。

并行模式

针对于不同业务场景SQL语句的执行情况，有些场景下SQL的执行是需要分割成几段去执行的，而且期间并不全是存在依赖关系。默认情况下，hive只会一段一段的执行mapreduce任务。使用并行的好处在于可以让服务器可以同时去执行那些不相关的业务场景

// 开启任务并行执行
set hive.exec.parallel=true;
// 同一个sql允许并行任务的最大线程数
set hive.exec.parallel.thread.number=8;

以下sql 中union all前后的2个查询操作并无直接关联，因此没有必要顺序执行，因此优化的思路是让这2个查询操作并行执行。

select a.id,b.name
form
(
a union b
)

梦里Coding

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Hive-企业级调优-(本地模式/JVM重用/严格模式/并行模式)

企业级调优详细讲解执行计划(Explain)二级目录三级目录执行计划(Explain)二级目录三级目录
复制链接

扫一扫

专栏目录