hive调优系列—hive执行计划

主要介绍hive执行计划的基本使用。

目录

一、基本语法

二、explain基本用法

三、explain dependency基本用法


一、基本语法

explain
查看执行计划的基本信息;

explain dependency
查看输入的各种属性;

explain authorization
查看SQL操作相关权限的信息;

explain vectorization
查看SQL的向量化描述信息,从 Hive 2.3.0 开始支持;

explain cbo
输出由Calcite优化器生成的计划。CBO 从 Hive 4.0.0 版本开始支持;

explain ast
输出查询的抽象语法树。AST 在 Hive 2.1.0 版本删除了,存在bug,转储AST可能会导致OOM错误,将在4.0.0版本修复;

explain extended
查看执行计划的扩展信息;

二、explain基本用法

1. explain案例1 普通函数(or none)

select-from-where类型
sql如下


explain
select 
nvl(region,'unknown') region,
age*2 age,
concat('prefix_',name) name
from test_user where age >=20;

执行结果如下

# stage依赖
STAGE DEPENDENCIES:
  Stage-0 is a root stage

# stage详细信息
STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        # 表扫描
        TableScan
          # 表名
          alias: test_user
          # 当前阶段的统计信息,有行数、数据量
          Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
          # 当前阶段进行过滤
          Filter Operator
            # 过滤谓词
            predicate: (age >= 20) (type: boolean)
            # 过滤后的统计信息
            Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
            # 列投影
            Select Operator
              expressions: NVL(region,'unknown') (type: string), (age * 2) (type: int), concat('prefix_', name) (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
              ListSink

执行计划分很多层,由外向内查看。
最外层分为以下两部分
STAGE DEPENDENCIES:描述了stage之间的依赖关系
STAGE PLANS:stage详细信息
可以看到,主要信息都在 STAGE PLANS内,内部主要也分以下两部分
Map Operator Tree:MAP端的执行计划树
Reduce Operator Tree:Reduce端的执行计划树
其他关键字见执行计划内的注释。
本案例为最简单的select-from-where类型,从执行计划里可以看到,只有map阶段,控制好数据量的话,可以实现本地执行

2. explain案例2 聚合函数

select-agg_func-from-where-groupby类型
sql如下

explain select age,count(*) from test_user where age>20 group by age;

执行结果如下

# 描述了stage之间的依赖关系
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

# stage详细信息
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      # MAP端的执行计划树
      Map Operator Tree:
          # 对alias: test_user进行表扫描
          TableScan
            alias: test_user
            # 当前阶段的统计信息,有行数、数据量
            Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
            # 对TableScan的表进行过滤
            Filter Operator
              # 过滤条件,过滤谓词
              # 有些sql隐藏的过滤条件在这里都可以看到,如join是否对null的过滤
              predicate: (age > 20) (type: boolean)
              # 过滤后的信息
              Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
              # 这里开启了map端聚合,所以会在map端先聚合
              # 查看配置sql如下
              # set hive.map.aggr;
              # 对结果进行groupby
              Group By Operator
                # 聚合方式
                aggregations: count()
                # 分组key,分组字段
                keys: age (type: int)
                # 聚合模式
                # hash:随机聚合,hash partition;
                # partial:局部聚合;
                # final:最终聚合
                mode: hash
                outputColumnNames: _col0, _col1
                # 聚合后的信息
                Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
                # map端的输出操作,也是一个operator
                Reduce Output Operator
                  # map阶段输出的kv数据的key
                  key expressions: _col0 (type: int)
                  # 升序排序
                  sort order: +
                  # map阶段输出时,会进行分区字段
                  Map-reduce partition columns: _col0 (type: int)
                  # 输出的信息
                  Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
                  # map阶段输出的kv数据的value
                  value expressions: _col1 (type: bigint)
      # Reduce端的执行计划树
      Reduce Operator Tree:
        # groupby操作
        Group By Operator
          # 聚合方式
          aggregations: count(VALUE._col0)
          # 分组字段
          keys: KEY._col0 (type: int)
          # 聚合方式:局部合并
          # complete:表示都在reduce端聚合
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            # 输出是否压缩
            compressed: false
            Statistics: Num rows: 1 Data size: 276 Basic stats: COMPLETE Column stats: NONE
            # 当前操作表的信息
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      # -1为不限制条数
      limit: -1
      Processor Tree:
        ListSink

3. explain案例3 窗口函数

sql如下


explain
select id,row_number() over(partition by region order by age) rown
from test_user;

执行结果如下


# stage依赖
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

# stage详细信息
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      # map阶段操作树
      Map Operator Tree:
          # 表扫描
          TableScan
            # 表名
            alias: test_user
            Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
            # 输出到reduce操作
            Reduce Output Operator
              # 输出到reduce的kv中的key
              # 可以看到over函数中的region和age都在
              key expressions: region (type: string), age (type: int)
              # 两个加号,表示对两个字段排序
              sort order: ++
              # 输出时只会对region字段进行mr分区
              Map-reduce partition columns: region (type: string)
              Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
              # 输出到reduce的kv中的value
              # 可以理解为id字段是被打了标记
              value expressions: id (type: int)
      # 向量化执行
      Execution mode: vectorized
      # reduce阶段操作树
      Reduce Operator Tree:
        # 列投影
        Select Operator
          # VALUE._col0 为map输出的value
          # KEY.reducesinkkey0 为map输出的第一个key,即region
          # KEY.reducesinkkey1 为map输出的第二个key,即age
          expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: int)
          outputColumnNames: _col0, _col1, _col3
          Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
          # Partitioned table Function, Window Functions
          PTF Operator
            Function definitions:
                Input definition
                  input alias: ptf_0
                  output shape: _col0: int, _col1: string, _col3: int
                  type: WINDOWING
                Windowing table definition
                  input alias: ptf_1
                  name: windowingtablefunction
                  # 排序字段
                  order by: _col3 ASC NULLS FIRST
                  # 窗口分区字段
                  partition by: _col1
                  raw input shape:
                  window functions:
                      window function definition
                        alias: row_number_window_0
                        # 窗口函数方法
                        name: row_number
                        # row_number对应的java实现
                        window function: GenericUDAFRowNumberEvaluator
                        # 窗口函数的frame参数
                        window frame: PRECEDING(MAX)~FOLLOWING(MAX)
                        isPivotResult: true
            Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
            # 窗口函数后的列投影
            Select Operator
              expressions: _col0 (type: int), row_number_window_0 (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 4 Data size: 1105 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

4. explain案例4 join(非mapjoin)

sql如下

set hive.auto.convert.join=false;

explain
select a.id from test_partition1 a
inner join test_partition2 b 
on a.id=b.id;

执行结果如下

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      # map阶段计划
      Map Operator Tree:
          # 第一个表扫描,也就是左表
          TableScan
            alias: a
            Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
            # 数据过滤
            Filter Operator
              # 隐藏的过滤条件
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
              # 列投影
              Select Operator
                expressions: id (type: int)
                outputColumnNames: _col0
                Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
          # 第二个表扫描,也就是右表
          TableScan
            alias: b
            Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
            # 数据过滤
            Filter Operator
              # 隐藏的过滤条件
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int)
                outputColumnNames: _col0
                Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
      # reduce阶段计划
      Reduce Operator Tree:
        # join操作
        # 说明案例关联发生在reduce阶段
        Join Operator
          condition map:
               # 内连接
               Inner Join 0 to 1
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
          outputColumnNames: _col0
          Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

5. explain案例5 join(mapjoin)

sql如下

set hive.auto.convert.join=true;

explain
select a.id from test_partition1 a
inner join test_partition2 b 
on a.id=b.id;

执行结果如下


STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    # 本地作业
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:b
          # fetch数据
          Fetch Operator
            limit: -1
      # map本地操作数
      Alias -> Map Local Operator Tree:
        $hdt$_1:b
          # 表扫描
          TableScan
            alias: b
            # 统计信息
            Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
            # 隐藏的过滤条件
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int)
                outputColumnNames: _col0
                Statistics: Num rows: 5 Data size: 40 Basic stats: COMPLETE Column stats: NONE
                # 列投影后的数据存入了HashTable
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      # 本stage只有map阶段
      # 说明mapjoin后,数据都在本地处理完毕
      # 不需要进行shuffle与reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int)
                outputColumnNames: _col0
                Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: NONE
                # mapjoin操作
                Map Join Operator
                  # 内连接
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col0
                  Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
                  # 数据输出
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 5 Data size: 533 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

三、explain dependency基本用法

explain dependency主要用于分析sql的数据来源。
准备好两张分区表以及测试数据后,sql案例如下,两段join的sql,关联条件加上了非等值判断


explain dependency
select a.id from test_partition1 a
inner join test_partition2 b 
on a.id=b.id and a.part=b.part and a.part>=1 and a.part <=2;

explain dependency
select a.id from test_partition1 a
inner join test_partition2 b 
on a.id=b.id and a.part=b.part 
where a.part>=1 and a.part <=2;

执行结果如下,两个结果相同。笔者看到别的文章都表示上述两段sql结果不同,但是笔者实际执行的结果是相同的,基于hive2.1.1。


{
  "input_tables": [
    {
      "tablename": "test_mall@test_partition1",
      "tabletype": "MANAGED_TABLE"
    },
    {
      "tablename": "test_mall@test_partition2",
      "tabletype": "MANAGED_TABLE"
    }
  ],
  "input_partitions": [
    {
      "partitionName": "test_mall@test_partition1@part=1"
    },
    {
      "partitionName": "test_mall@test_partition1@part=2"
    },
    {
      "partitionName": "test_mall@test_partition2@part=1"
    },
    {
      "partitionName": "test_mall@test_partition2@part=2"
    }
  ]
}

  • 3
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值