1、列裁剪、分区裁剪
在查询的过程中减少不必要的分区和列,例如:
select * from shuidi_dwb.dwb_cf_case_info_full_d
应改为:
select case_id,ckr_id from shuidi_dwb.dwb_cf_case_info_full_d where dt='2019-08-28';
2、尽早尽量过滤数据,减少每个阶段的数据量
在多次关联的时候,尽量在每个自查询中(关联前)加上筛选(where)条件以减少下阶段job的数据量。
优化前:SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'
优化后:SELECT a.val, b.val FROM
(select key,val from a where a.ds=‘2009-07-07’ ) x LEFT OUTER JOIN
(select key,val from b where b.ds=‘2009-07-07’ ) y ON x.key=y.key
3、:善用multi-insert:
#查询了两次a
insert overwrite table tmp1
select ... from a where 条件1;
insert overwrite table tmp2
select ... from a where 条件2;
#查询了一次a
from a
insert overwrite table tmp1
select ... where 条件1
insert overwrite table tmp2
select ... where 条件2
4、with as的正确使用
with as 也叫做子查询部分,就类似于一个视图,首先定义一个sql片段,该sql片段会被整个sql语句所用到,为了让sql语句的可读性更高些
但同时满足以下情况建议不用with,改为创建临时表:
1、逻辑比较复杂的,多个表关联的,比如4个大表及以上关联的
2、with部分超过300万条数据量(约大于25M)
3、下游数据频繁用到该逻辑的,例如:用到3次及以上的
with a as (select ... from table)
select * from a
left join (select * from a) on ...
left join (select * from a) on ...
left join (select * from a) on ...
此时with部分的应该改为 create tmp.tmp1 as (select ... from table)
5、排序
order by:全局排序,在数据量大的时候尽量少用。
select mid, money, name from store order mid asc, money asc
distrubute by sort by:局部排序,先按照key分组,然后根据排序字段去排序。
select mid, money, name from store distribute by mid sort by mid asc, money asc
cluster by:cluster by的功能就是distribute by和sort by相结合,如下2个语句是等价的
select mid, money, name from store cluster by mid
等价于
select mid, money, name from store distribute by mid sort by mid