关于所使用的spark版本中的spark sql不支持exists和in等子查询语句的解决方案记录

stackoverflow上一篇很好的问题解答解决方法:


SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features"

EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.

A table holds the rows that make some predicate (statement parameterized by column names) true:

  • The DBA gives the predicates for each base table T with columns T.C,...T(T.C,...)
  • JOIN holds the rows that make the AND of its arguments' predicates true; for a UNION, the OR; for an EXCEPT, the AND NOT.
  • SELECTkept columnsFROMT holds the rows where EXISTS dropped columns [predicate of T].
  • TLEFT SEMI JOINU holds the rows where EXISTS U-only columns [predicate of T AND predicate of U].
  • TWHEREcondition holds the rows where predicate of T AND condition.

(Re querying generally see this answer.)

So by keeping in mind predicate expressions corresponding to SQL you can use straightforward logic rewrite rules to compose and/or reorganize queries. Eg using UNION here need not be "clumsy" either in terms of readability or execution.

Your original question indicated that you understood that you could use UNION and you have edited variants into your question that excise EXISTS and IN from your original queries. Here is another variant also excising OR.

    select <...>    
    from A, B, C, (select ID from ...) as e
    where
      A.FK_1 = B.PK and
      A.FK_2 = C.PK and
      A.ID = e.id
union
    select <...>    
    from A, B, C, (select ID from ...) as e
    where
      A.FK_1 = B.PK and
      A.FK_2 = C.PK and
      A.ID = e.ID

Your Solution 1 does not do what you think it does. If just one of the exists_clause tables are empty, ie even if there are ID matches available in the other, the FROM cross product of tables is empty and no rows are returned. ("An Unintuitive Consequence of SQL Semantics": Chapter 6 The Database Language SQL sidebar page 264 of Database Systems: The Complete Book 2nd Edition.) A FROM is not just introducing names for rows of tables, it is CROSS JOINing and/or OUTER JOINing them after which ON (for INNER JOINs) and WHERE filter some out.

Performance is typically different for different expressions returning the same rows. This depends on DBMS optimization. Many details, which the DBMS and/or programmer may be able to know and if so may or may not know and may or may not best balance, affect the best way to evaluate a query and the best way to write it. But executing two ORed subselects per row in a WHERE (as in your original queries but also your late Solution 2) is not necessarily better than running one UNION of two SELECTs (as in my query).


原链接:http://stackoverflow.com/questions/34861516/spark-replacement-for-exists-and-in


  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值