hive优化
1)多表join优化代码结构:
select .. from JOINTABLES (A,B,C) WITH KEYS (A.key, B.key, C.key) where ....
关联条件相同多表join会优化成一个job
2)LeftSemi-Join是可以高效实现IN/EXISTS子查询的语义
SELECT a.key,a.value FROM a WHERE a.key in (SELECT b.key FROM b);
A、未实现Left Semi-Join之前,Hive实现上述语义的语句是:
SELECT t1.key, t1.value FROM a t1left outer join (SELECT distinctkey from b) t2 on t1.id = t2.id where t2.id is not null;
B、可被替换为Left Semi-Join如下:
SELECT a.key, a.valFROM a LEFT SEMI JOIN b on (a.key = b.key)
这一实现减少至少1次MR过程,注意Left Semi-Join的Join条件必须是等值
3)预排序减少map join和group by扫描数据