怎样通过explain执行计划,来优化SQL(以hive为例)

适用场景

Hive SQL在执行之前会将SQL转换为MapReduce任务,因此需要了解具体的转换过程。可以通过explain关键字来查看具体的执行计划。通过执行计划能看到 SQL 程序转换成相应计算引擎的执行逻辑,从而能够实现更有针对性的优化(比如有些看似等价的SQL写法实则不等价)。

使用语法就是在查询语句之前加上explain关键字:

explain select * from ads.ads_grow_ad_material_statistic_df  where date = 20220609;

前置知识

一个 Hive 查询被转换为一个有向无环图DAG,包含一个或多个stage(阶段),不同stage之间会存在依赖关系。越复杂的查询通常会引入更多的stage,通常stage越多则需要越多时间来完成任务。

默认hive一次执行一个stage,一个stage可以是一个mapreduce任务,也可以是一个抽样阶段,或者是一个合并阶段,还可以是一个limit阶段。

我们拿一个查询简单举个例子:

-- 根据app_id和plan_id来查看收入
explain select app_id, plan_id, sum(pay_mnt) 
from ads.ads_grow_ad_material_statistic_df  
where date = 20220609 
and material_id = 'ALL' 
group by app_id, plan_id;

explain得出的结果如下:

STAGE DEPENDENCIES:                                                                                           |
|   Stage-1 is a root stage                                                                                     |
|   Stage-0 depends on stages: Stage-1                                                                          |
|                                                                                                               |
| STAGE PLANS:                                                                                                  |
|   Stage: Stage-1                                                                                              |
|     Map Reduce                                                                                                |
|       Map Operator Tree:                                                                                      |
|           TableScan                                                                                           |
|             alias: ads_grow_ad_material_statistic_df                                                          |
|             Statistics: Num rows: 262845 Data size: 7096850 Basic stats: COMPLETE Column stats: NONE          |
|             Filter Operator                                                                                   |
|               predicate: (material_id = 'ALL') (type: boolean)                                                |
|               Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE        |
|               Select Operator                                                                                 |
|                 expressions: app_id (type: bigint), plan_id (type: bigint), pay_mnt (type: bigint)            |
|                 outputColumnNames: app_id, plan_id, pay_mnt                                                   |
|                 Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE      |
|                 Group By Operator                                                                             |
|                   aggregations: sum(pay_mnt)                                                                  |
|                   keys: app_id (type: bigint), plan_id (type: bigint)                                         |
|                   mode: hash                                                                                  |
|                   outputColumnNames: _col0, _col1, _col2                                                      |
|                   Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE    |
|                   Reduce Output Operator                                                                      |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint)                               |
|                     sort order: ++                                                                            |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint)                  |
|                     Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE  |
|                     value expressions: _col2 (type: bigint)                                                   |
|       Reduce Operator Tree:                                                                                   |
|         Group By Operator                                                                                     |
|           aggregations: sum(VALUE._col0)                                                                      |
|           keys: KEY._col0 (type: bigint), KEY._col1 (type: bigint)                                            |
|           mode: mergepartial                                                                                  |
|           outputColumnNames: _col0, _col1, _col2                                                              |
|           Statistics: Num rows: 65711 Data size: 1774205 Basic stats: COMPLETE Column stats: NONE             |
|           File Output Operator                                                                                |
|             compressed: false                                                                                 |
|             Statistics: Num rows: 65711 Data size: 1774205 Basic stats: COMPLETE Column stats: NONE           |
|             table:                                                                                            |
|                 input format: org.apache.hadoop.mapred.TextInputFormat                                        |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                     |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                     |
|                                                                                                               |
|   Stage: Stage-0                                                                                              |
|     Fetch Operator                                                                                            |
|       limit: -1                                                                                               |
|       Processor Tree:                                                                                         |
|         ListSink  

我们将上述结果拆分看,先从最外层开始,包含两个大的部分:

先看第一部分 stage dependencies ,包含两个 stage,stage-1 是根 stage,是开始的 stage,包含了,而stage-0 依赖 stage-1,stage-1 执行完成后执行 stage-0。

  • stage dependencies:各个 stage 之间的依赖性
  • stage plan:各个 stage 的执行计划

再看第二部分 stage plan,其中stage1包含了这个job的大部分处理过程,会触发一个mapreduce。里面有一个 Map Reduce,一个 MR 的执行计划分为两个部分:

  • Map Operator Tree:MAP 端的执行计划树
  • Reduce Operator Tree:Reduce 端的执行计划树

这两个执行计划树里面包含这条 sql 语句的 operator:

Key

Desciption

Detail

TableScan

map 端第一个操作肯定是加载表,所以就是表扫描操作,其实就是from

tablescan以这个表作为输入,然后会产生一个只有字段number的输出

alias

表名称

从哪个表查

Statistics

  1. 表统计信息,包含表中数据条数,数据大小等;
  2. join 完成之后生成的数据条数,大小等;
  3. 分组聚合之后的数据条数,数据大小等

Select Operator

选取操作,常见的属性

expressions

需要的字段名称及字段类型

outputColumnNames

  1. 输出的列名称;
  2. 聚合之后输出列名;
  3. join 完成之后输出的字段

Group By Operator

分组聚合操作

有group by 时会触发该operator,会产生一个输出字段_col0,这是为临时字段按规则起的临时字段名

aggregations

显示聚合函数信息

mode

聚合模式:

  1. hash:随机聚合,就是 hash partition;
  2. partial:局部聚合;
  3. final:最终聚合

keys

  1. 分组的字段,如果没有分组,则没有此字段;
  2. 如果有join,也会作为join的条件字段;

Reduce Output Operator

输出到 reduce 操作,常见属性;

sort order

  1. 值为空不排序;
  2. 值为 + 正序排序,值为 - 倒序排序;
  3. 值为 +- 排序的列为两列,第一列为正序,第二列为倒序;

Filter Operator

过滤操作

对应where或having来触发operator

predicate

过滤条件,如 sql 语句中的 where id>=1,则此处显示 (id>= 1)

Map Join Operator

join 操作,常见的属性

condition map

join 方式 ,如 Inner Join 0 to 1 Left Outer Join0 to 2

File Output Operator

文件输出操作

compressed

是否压缩

table

表的信息,包含输入输出文件格式化方式,序列化方式等

Fetch Operator

 客户端获取数据操作

一般是Stage-0 中

limit

值为 -1 表示不限制条数,其他值为限制的条数

一般是Stage-0 中

实例

. join时,是否需要额外过滤掉 null 的值

我们在hive client中输入以下查询计划语句:

SELECT  game_dict.package_name
       ,game_dict.game_name
       ,temp.business
FROM
(
    SELECT  package_name
           ,business
    FROM ods.ods_grow_ad_plan_info
    WHERE date = 20220609
    AND deli_busi = 3
) temp
JOIN
(
    SELECT  package_name
           ,game_name
    FROM gamesdk.game_info_all
    WHERE date = 20220609
) game_dict
ON temp.package_name = game_dict.package_name
LIMIT 10;

问题是:上面这条join会过滤掉package_name 为NULL的值吗?

通过explain来做下验证:(仅截取了部分输出信息)

TableScan                                                                                           |
alias: ods_grow_ad_plan_info                                                                      |
   Statistics: Num rows: 2593 Data size: 300879 Basic stats: COMPLETE Column stats: NONE             ||             Filter Operator                                                                                   |
   predicate: ((deli_busi = 3) and package_name is not null) (type: boolean)                       |
   Statistics: Num rows: 648 Data size: 75190 Basic stats: COMPLETE Column stats: NONE             |
   Select Operator                                                                                 |
     expressions: package_name (type: string), business (type: bigint)                             |
     outputColumnNames: _col0, _col1                                                               |
     Statistics: Num rows: 648 Data size: 75190 Basic stats: COMPLETE Column stats: NONE           |
    HashTable Sink Operator                                                                       |
       keys:                                                                                       |
         0 _col0 (type: string)                                                                    |
         1 _col0 (type: string)     
...

从上述结果可以看到 predicate: package_name is not null,说明 join 时会自动过滤掉关联字段为 null 值的情况,但 left/right join 或 full join 是不会自动过滤 null 值的。

. 做了group by后,是否还需要对语句做排序?

SELECT  app_id
       ,plan_id
       ,SUM(pay_mnt)
FROM ads.ads_grow_ad_material_statistic_df
WHERE date = 20220609
AND material_id = 'ALL'
GROUP BY  app_id
         ,plan_id;

问:对app_id,plan_id做group by后,是否还需要对语句进行排序?

直接来看 explain 之后结果 (仅截取了部分输出信息):

TableScan                                                                                           |
    alias: ads_grow_ad_material_statistic_df                                                          |
    Statistics: Num rows: 262845 Data size: 7096850 Basic stats: COMPLETE Column stats: NONE          |
    Filter Operator                                                                                   |
        predicate: (material_id = 'ALL') (type: boolean)                                                |
        Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE        |
        Select Operator                                                                                 |
            expressions: app_id (type: bigint), plan_id (type: bigint), pay_mnt (type: bigint)            |
            outputColumnNames: app_id, plan_id, pay_mnt                                                   |
            Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE      |
            Group By Operator                                                                             |
                aggregations: sum(pay_mnt)                                                                  |
                keys: app_id (type: bigint), plan_id (type: bigint)                                         |
                mode: hash                                                                                  |
                outputColumnNames: _col0, _col1, _col2                                                      |
                Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE    |
                Reduce Output Operator                                                                      ||                     key expressions: _col0 (type: bigint), _col1 (type: bigint)                               |
                    sort order: ++                                                                            |
                    Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint)                  |
                    Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE  |
                    value expressions: _col2 (type: bigint)
...   

分组主要看Group By Operator,里面的keys: app_id (type: bigint), plan_id (type: bigint) 表示是按照appid和planid来进行分组,再往下走有sort order:++,说明是按照两个字段进行正序排序的。

. 子查询和join哪个执行效率高?

有这样两个SQL:

-- SQL1
select  a.id, b.id
from (
    select id
    from ods.ods_grow_ad_plan_info
    where date = 20220609
) a
join (
    select id, caid
    from ods.ods_grow_ad_create_info
    where date = 20220609
) b
on a.id = b.caid;

-- SQL2
SELECT  a.id
FROM
(
    SELECT  id
    FROM ods.ods_grow_ad_plan_info
    WHERE date = 20220609
) a
WHERE a.id IN ( SELECT id FROM ods.ods_grow_ad_create_info WHERE date = 20220609)

这两条 sql 语句输出的结果是一样的,但是哪条 sql 执行效率高?高的话高在哪里?

第一条的explain:

Map Reduce Local Work                                                                                    |
   Alias -> Map Local Tables:                                                                             |
     a:ods_grow_ad_plan_info                                                                              |
       Fetch Operator                                                                                     |
         limit: -1                                                                                        |
   Alias -> Map Local Operator Tree:                                                                      |
     a:ods_grow_ad_plan_info                                                                              |
       TableScan                                                                                          |
         alias: ods_grow_ad_plan_info                                                                     |
         Statistics: Num rows: 37609 Data size: 300879 Basic stats: COMPLETE Column stats: NONE           |
         Filter Operator                                                                                  |
           predicate: id is not null (type: boolean)                                                      |
           Statistics: Num rows: 18805 Data size: 150443 Basic stats: COMPLETE Column stats: NONE         |
           Select Operator                                                                                |
             expressions: id (type: bigint)                                                               |
             outputColumnNames: _col0                                                                     |
             Statistics: Num rows: 18805 Data size: 150443 Basic stats: COMPLETE Column stats: NONE       |
             HashTable Sink Operator                                                                      |
               keys:                                                                                      |
                 0 _col0 (type: bigint)                                                                   |
                 1 _col1 (type: bigint)
...

第二条的explain:

Map Reduce Local Work                                                                                     |
   Alias -> Map Local Tables:                                                                              |
     sq_1:ods_grow_ad_create_info                                                                          |
       Fetch Operator                                                                                      |
         limit: -1                                                                                         |
   Alias -> Map Local Operator Tree:                                                                       |
     sq_1:ods_grow_ad_create_info                                                                          |
       TableScan                                                                                           |
         alias: ods_grow_ad_create_info                                                                    |
         Statistics: Num rows: 242672 Data size: 1941378 Basic stats: COMPLETE Column stats: NONE          |
         Filter Operator                                                                                   |
           predicate: id is not null (type: boolean)                                                       |
           Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE         |
           Select Operator                                                                                 |
             expressions: id (type: bigint)                                                                |
             outputColumnNames: _col0                                                                      ||                 Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE       |
             Group By Operator                                                                             |
               keys: _col0 (type: bigint)                                                                  |
               mode: hash                                                                                  |
               outputColumnNames: _col0                                                                    |
               Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE     |
               HashTable Sink Operator                                                                     |
                 keys:                                                                                     |
                   0 _col0 (type: bigint)                                                                  |
                   1 _col0 (type: bigint) 
 ...

可以看到第二条的执行相比第一条,多了一个Group By的Operator,经过了一个hash partition的随机聚合,因此可以用join来代替子查询来提高性能。

. 面试题:比较两个sql的效率

-- SQL1
SELECT
 a.id,
 b.user_name
FROM
 test1 a
JOIN test2 b ON a.id = b.id
WHERE
 a.id > 2;
 
-- SQL2
SELECT
 a.id,
 b.user_name
FROM
(SELECT * FROM test1 WHERE id > 2) a
JOIN test2 b ON a.id = b.id;

这两条 sql 语句输出的结果是一样的,但哪条 sql 执行效率高呢?

  • 可能是第一条 sql 执行效率高,因为第二条 sql 有子查询,子查询会影响性能;
  • 也可能是第二条 sql 执行效率高,因为先过滤之后,在进行 join 时的条数减少了,所以执行效率就高了。

explain两个sql后来发现(这里不贴全部了):除了表别名不一样,其他的执行计划完全一样,都是先进行 where 条件过滤,在进行 join 条件关联。说明 hive 底层会自动帮我们进行优化,所以这两条 sql 语句执行效率是一样的。

参考资料

1. 《Hive编程指南》

2. 《Hive实战》

3. 《Hive性能调优实战》

读懂 explain 的执行计划有利于我们优化 Hive SQL,同时也能提升我们对 SQL 的掌控力。如查看 stage 的依赖情况、排查数据倾斜、hive 调优等,小伙伴们阔以自行尝试。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

精神抖擞王大鹏

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值