首先我们要了解MAPJOIN优化原理,这里简要说明下
Spark Broadcast hash join(Hive map join同理)
1,把小表广播到所有大表分布的节点上,在每个节点上分别进行单机hash join
2,left join时只能广播右表
执行基本条件:小表必须小于参数spark.sql.autoBroadcastJoinThreshold, 默认为10M
sql 场景小表需要left join大表 小表150M左右 大表1T左右
原始sql(广播左表,未成功,因为left join只支持广播右表)
create table tmp.iapp_mkt_per_imei_pkg_repair_t2_not_default_20200812_1 as
select /*+MAPJOIN(a) */ a.imei,b.package,1 status,1 install_type,b.imei imei_b
from
(select imei from tmp.iapp_mkt_per_imei_only_repair_20200812 where install_type = 0 and tail != 'default') a
left join join
(select imei,package from edw.sys_app_list_fact where data_date = 20200812 and tail != 'default' and imei is not null and status in (1,2))b
on a.imei = b.imei