Hive基础知识(二一):优化篇-执行计划与Fetch抓取

1. 执行计划(Explain)

1)基本语法

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query

2)案例实操

(1)查看下面这条语句的执行计划

没有生成 MR 任务的

hive (default)> explain select * from emp;
OK
Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage
​
STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp
          Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
            ListSink
​
Time taken: 14.564 seconds, Fetched: 17 row(s)

有生成 MR 任务的

hive (default)> explain select deptno, avg(sal) avg_sal from emp group by deptno;
OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1  #stage-1依赖于stage-0阶段
​
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree: #Map操作
          TableScan
            alias: emp #扫描的表
            Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
            Select Operator #查询的操作
              expressions: sal (type: double), deptno (type: int)
              outputColumnNames: sal, deptno
              Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
              Group By Operator  #读取的字段
                aggregations: sum(sal), count(sal)
                keys: deptno (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: double), _col2 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree: #reduce操作
        Group By Operator
          aggregations: sum(VALUE._col0), count(VALUE._col1)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), (_col1 / _col2) (type: double)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 6460 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
​
  Stage: Stage-0 #第二个阶段
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
​
Time taken: 2.475 seconds, Fetched: 53 row(s)

(2)查看详细执行计划(多加一个extends)

hive (default)> explain extended select * from emp;
hive (default)> explain extended select deptno, avg(sal) avg_sal from emp group by deptno;

2. Fetch 抓取

Fetch 抓取是指,Hive 中对某些情况的查询可以不必使用 MapReduce 计算。例如:SELECT * FROM employees;在这种情况下,Hive 可以简单地读取 employee 对应的存储目录下的文件, 然后输出查询结果到控制台。

在 hive-default.xml.template 文件中 hive.fetch.task.conversion 默认是 more,老版本 hive 默认是 minimal,该属性修改为 more 以后,在全局查找、字段查找、limit 查找等都不走 mapreduce

<property>
 <name>hive.fetch.task.conversion</name>
 <value>more</value>
 <description>
 Expects one of [none, minimal, more].Some select queries can be converted to single FETCH task minimizing 
latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and 
virtual columns --虚拟字段)
 </description>
</property>

1)案例实操:

(1)把 hive.fetch.task.conversion 设置成 none,然后执行查询语句,都会执行 mapreduce 程序。

hive (default)> set hive.fetch.task.conversion=none;
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;

(2)把 hive.fetch.task.conversion 设置成 more,然后执行查询语句,如下查询方式都不 会执行 mapreduce 程序。

hive (default)> set hive.fetch.task.conversion=more; 
hive (default)> select * from emp; 
hive (default)> select ename from emp; 
hive (default)> select ename from emp limit 3;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值