1.优化distinct
优化前,数据全部放在一个reduce里
select count(distinct ip)
from
(select id from tablea
union all
select id from tableb) ta
优化后,数据先分布到不同的reduce中,再统一
select
count(*)
from
(select id
from
(select id from from tablea
union all
select id from tableb
) ta group by id) tb
impala 不支持多个distinct,prosto,和hive支持
select count(distinct id),count(distinct name)
from
jt_dw_a.black_test