用广播机制优化spark sql案例

最新推荐文章于 2023-03-02 10:17:13 发布

L13763338360

最新推荐文章于 2023-03-02 10:17:13 发布

阅读量658

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/L13763338360/article/details/106616788

版权

spark 专栏收录该内容

28 篇文章 1 订阅

订阅专栏

一个常用的sql示例

insert overwrite table tbl_result
select
……
from table1
left join table2 on table.id = tables1.id
left join table3 on table.id = tables1.id
left join table4 on table.id = tables1.id
left join table5 on table.id = tables1.id
left join table6 on table.id = tables1.id

实际情况可能更复杂，可能涉及大小表各种join、group等操作，处理小文件、shuffle等操作，导致一个任务非常非常慢。

观察运行发现，有些表很大，有些表很小，涉及大小表的，一般优化角度，是将小表广播，用map join代替reduce join。

比如talbe5和table6很小，那么可以将其广播，然后再关联操作

-- set spark.merge.mergefiles=true ;
-- set spark.merge.per.size_mb=128 ;

cache table table5_cache as SELECT * from table5;
cache table table6_cache as SELECT * from table6;

insert overwrite table tbl_result
select
……
from table1
left join table2 on table.id = tables1.id
left join table3 on table.id = tables1.id
left join table4 on table.id = tables1.id
left join table5_cache on table.id = tables1.id
left join table6_cache on table.id = tables1.id

默认广播到所有工作节点的表的最大大小10M，如果表超过了10M(10485760)，还想让其广播，这时启动的时候，需要设置参数，比如将阀值调到100M

--spark.sql.autoBroadcastJoinThreshold=104857600

L13763338360

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
用广播机制优化spark sql案例

一个常用的sql示例insert overwrite table tbl_resultselect ……from table1left join table2 on table.id = tables1.idleft join table3 on table.id = tables1.idleft join table4 on table.id = tables1.idleft join table5 on table.id = tables1.idleft join table6 .
复制链接

扫一扫