Spark SQL关于minus的问题记录

最新推荐文章于 2022-06-10 20:09:43 发布

小朋友2D

最新推荐文章于 2022-06-10 20:09:43 发布

阅读量1.2k

点赞数

分类专栏： Spark SQL

本文链接：https://blog.csdn.net/ct2020129/article/details/103306666

版权

Spark 同时被 2 个专栏收录

14 篇文章 1 订阅

订阅专栏

SQL

4 篇文章 0 订阅

订阅专栏

今天写了一个曲折的SQL，大概是这样

-- 有可能是我给人家写复杂了
with
org_year_view as(
    select distinct org, year from A
)
select *
from A
minus
-- find data that cannot be used because of missing data in formula
    select f.*
    from A f
    join (
    -- find missing val
        select c.org, c.year, min(c.val) as val
        from
            (select a.*, b.* from org_year_view a cross join val_view b) c
        left join
            A d
        on c.org = d.org and c.year = d.year and c.val = d.val
        where amt is null
        group by c.org, c.year
    ) e
    on f.org = e.org
    and f.year = e.year
    and f.val >= e.val

大概解释一下，A表是从别的表进行各种查询之后得到的。
现在要再进行一些select，然后与A表自身取差集

报错

我简单提取了一下内容
Exception in thread “main” org.apache.spark.sql.AnalysisException:
Resolved attribute(s) res#2276 missing from amt#1119,val#1129, cd#1124,res#190, year#1113 in operator
!Project [org#1120, year#1113, cd#1124, val#1129, amt#1119, res#2276, res#2276].
Attribute(s) with the same name appear in the operation: res. Please check if the right attribute(s) are used.;;

大概的意思是说res#2276这玩意儿在Project中同名不同列。

这就很有趣了，用[全A表] minus [部分A表]出现这个问题，就代表着[全A表]与[部分A表]其实是来源于不同的表。

查看执行计划

我又提取了一下内容
Except
:- Project [org#58, year#51, cd#62, val#766, amt#57, res#190]
± Project [org#2617, year#2610, cd#2621, val#1361, amt#2616, res#2276]

结果的确证明两张表来源于不同的表。

然而[A] minus [A limit 10]这种操作是行得通的，前提是A表是直接从外部数据源读取形成dataframe进而声明的tem view。

目前没有再对SQL语句本身进行优化了，使用了另一个投机取巧的办法。

偷鸡

([全A表]: DataFrame).except(([部分A表]: DataFrame))

后续

这个需求被取消了-，-

小朋友2D

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark SQL关于minus的问题记录

今天写了一个曲折的SQL，大概是这样-- 有可能是我给人家写复杂了withorg_year_view as( select distinct org, year from A)select *from Aminus-- find data that cannot be used because of missing data in formula select f....
复制链接

扫一扫