Hive中的谓词下推：原理与影响-CSDN博客

本文链接：https://blog.csdn.net/d905133872/article/details/131245092

1、什么是谓词下推

所谓谓词下推，就是将尽可能多的判断更贴近数据源，以使查询时能跳过无关的数据。在文件格式使用Parquet或Orcfile时，甚至可能整块跳过不相关的文件。

2、HIVE中的谓词下推

Hive中的Predicate Pushdown，简称谓词下推，主要思想是把过滤条件下推到map端，提前执行过滤，以减少map端到reduce端传输的数据量，提升整体性能。简言之，就是先过滤再做聚合等操作。

-- 具体配置项是：（默认为true）
set hive.optimize.ppd = true

总结：

1、谓词下推：在存储层即过滤了大量大表无效数据，减少扫描无效数据；所谓下推，即谓词过滤在map端执行，所谓不下推，即谓词过滤在reduce端执行
2、inner join时，谓词放任意位置都会下推
3、left join时，左表的谓词应该写在where 后
4、right join时，左表的谓词应该写在join后

3、谓词下推导致结果不一致

我们下面来看几个典型的SQL。

SQL1：20672 和 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from(
    select
        role_id, part_date
    from ods_game_dev.ods_role_create
    where part_date = '2020-01-01'
) t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id and t2.part_date = '2020-01-01'

SQL2： 9721 和 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from ods_game.dev.ods_role_create t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id 
where t1.part_date = '2020-01-01' and t2.part_date = '2020-01-01'

SQL3：20672 和 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from ods_game.dev.ods_role_create t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id and t2.part_date = '2020-01-01'
where t1.part_date = '2020-01-01'

SQL4： 184125 和 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from ods_game.dev.ods_role_create t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id and t2.part_date = '2020-01-01' and t1.part_date = '2020-01-01'

从上面SQL中我们可以看出：

        1）SQL1：t1 表查询先过滤，t2 表条件写在 on 中满足谓词下推。各自进行了条件过滤后，再进行 join 。所以 count 的时候，我们看到的是 count 各自过滤条件的数据。

        2）SQL2：t1 在 where 里，满足谓词下推。t2 不满足谓词下推。所以 t2 表的条件是在 join 之后过滤，这就导致在 count 的时候，都经历了 t2 表的条件。所以数据一致。

        3）SQL3：左表 t1 在 where 满足谓词下推，右表 t2 在 on 满足谓词下推。所以都是先进行了数据的过滤，再进行 join 操作。和 SQL1 一样， count 各自过滤条件的数据。

        4）SQL4：左表 t1 不满足谓词下推，右表 t2 满足过滤条件。针对左表 t1 的过滤条件必须放在 where 上，放在 on 上的效果是不可预期的，不生效。右表 t2 条件在 on 里满足谓词下推，生效。所以 t1 表是全量数据， t2 表是过滤后的数据。