sparkSQL支持的子查询:
spark.sql("select u1 from (select u1,u2,u3 from tableTemp)").show()
spark.sql("select u1 from tableTemp1 where u1 > (select max(pre) from tableTemp2)").show()
sparkSQL不支持的子查询:
- exists和in子查询(新版本已支持)
比如:
Select * From feeData Where FeeID Not in (Select ID From FeeInfo)
FeeInfo 表最好有索引
优化语句:Select * From feeData as d Left Join FeeInfo f on d.FeeID =f.ID Where
f.Id is null
原因参考Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN.
in子查询不支持,in后直接跟值做查询条件还是支持的:
spark.sql("select * from tableTemp where pre_prob in (0.9,1,2)").show()
但还应注意in与括号之间的距离,因为sparkSQL内置了in(expr2,expr3,…)函数,返回的是布尔值:
> SELECT 1 in(1, 2, 3);
true
> SELECT 1 in(2, 3, 4);
false
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 1), named_struct('a', 1, 'b', 3));
false
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 2), named_struct('a', 1, 'b', 3));
true
补充
在新版本的hive以及spark中均支持了exists和in作为子查询,参考:
spark官网描述http://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-where.html
hive官网描述https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
- 子查询作为字段
比如:
val result = spark.sql("select
ssdggs,
ssdggsid,
ssdsmc,
ssds,
count(*) sl,
round(count(*) / (select count(*) from dyycjl dy where dy.xgsbm=sb.ssgdgsid) * 100 , 2) zb
from sbxx sb group by sb.ssdggs,sb.ssdggsid,sb.ssdsmc,sb.ssds")
改为:
val tempTable1 = spark.sql("select
ssdggs,
ssdggsid,
ssdsmc,
ssds,
count(*) sl
from sbxx sb group by sb.ssdggs,sb.ssdggsid,sb.ssdsmc,sb.ssds")
tempTable1.createTempView("tempTable1")
val tempTable2 = spark.sql("select xgsbm,count(*) num from dyycjl group by xgsbm")
tempTable2.createTempView("tempTable2")
spark.sql("select t1.*,round(t1.sl / num *100 , 2) zb from tempTable1 t1 left join tempTable2 t2 on t1.xgsbm=t2.ssgdgsid")