spark中union 和 unionAll 区别。
union会把数据都扫一遍,然后剔除重复的数据;
然而unionAll直接把两份数据粘贴返回,时间上会快很多。
unionAll用的会比较多一些
union是返回两个数据集的并集,不包括重复行,要求列数要一样,类型可以不同
unionAll是返回两个数据集的并集,包括重复行
Intersect是返回两个数据集的交集,不包括重复行
Minus是返回两个数据集的差集,不包括重复行
spark.sql(" ( select t.cgi,t.n_cgi from (select a.cgi,b.n_cgi ,dis,b.left2 aoa, row_number() over (partition by a.cgi order by b.left2 desc ) as rn from nloc_out a left join ratio_cgi b on a.cgi=b.cgi where a.dis ='1' ) t where t.rn <=3 ) union All ( select t.cgi,t.n_cgi from (select a.cgi,b.n_cgi ,dis,b.right1 aoa, row_number() over (partition by a.cgi order by b.right1 desc ) as rn from nloc_out a left join ratio_cgi b on a.cgi=b.cgi where a.dis ='2' ) t where t.rn <=3 ) union All select t.cgi,t.n_cgi from (select a.cgi,b.n_cgi ,dis,(b.left1+b.right2 ) aoa, row_number() over (partition by a.cgi order by (b.left1+b.right2) desc ) as rn from nloc_out a left join ratio_cgi b on a.cgi=b.cgi where a.dis ='3' ) t where t.rn <=3 ").createOrReplaceTempView("nloc_ncgis_prb_out")