Hive 自0.14.0开始,加入了一项”Cost based Optimizer”来对HQL执行计划进行优化,这个功能通过”hive.cbo.enable”来开启。在Hive 1.1.0之后,这个feature是默认开启的,它可以自动优化HQL中多个JOIN的顺序,并选择合适的JOIN算法
Join reordering and join algorithm selection are few of the optimizations that can benefit from a cost based optimizer. Cost based optimizer would free up user from having to rearrange joins in the right order or from having to specify join algorithm by using query hints and configuration options. This can potentially free up users to model their reporting and ETL needs close to business process without having to worry about query optimizations.
不过在工作中,发现这个功能在某些情况下反而会导致Query的执行效率降低,这里写个blog记录一下。以下仅针对Hive on tez , 至于 Hive on mapreduce 的情况待考察。
假如我们有三张表:
1. 事实表 table_fact
col_name | data_type | comment |
---|---|---|
field_dim_1 | string | |
field_dim_2 | string | |
score | bigint |
2. 维度表 table_dim1
col_name | data_type | comment |
---|---|---|
field_dim_1 | string | |
field_res_1 | string |
3. 维度表 table_dim2
col_name | data_type | comment |
---|---|---|
field_dim_1 | string | |
field_dim_2 | string | |
field_res_2 | bigint |
这是一个常见的星型模型的缩影,由一个事实表和两个维度表组成。通常我们在查询时需要从维度表中关联出我们需要的信息,比如:
select
d1.field_res_1,
d2.field_res_2,
sum(f.score)
from
table_fact f
left join table_dim_1 d1 on f.field_dim_1=d1.field_dim_1
left join table_dim_2 d2 on f.field_dim_1=d2.field_dim_1 and f.field_dim_2=d2.field_dim_2
where
f.field_dim_2="abc"
group by
d1.field_res_1,
d2.field_res_2;
这是一个比较常见的查询语句。在cbo开关打开的情况下,执行计划是这个样子的:
+--------------------------------------------------