一次由于join粒度不统一导致数据膨胀的错误

最新推荐文章于 2024-10-07 12:04:27 发布

共勉zzz

最新推荐文章于 2024-10-07 12:04:27 发布

阅读量255

点赞数 4

文章标签： spark 大数据数据仓库

本文链接：https://blog.csdn.net/zfxhz/article/details/140552742

版权

前段时间犯了一个因为sql中join连接的各部分粒度不一致导致数据膨胀的问题，
下面是我最初的错误代码，这段sql主要是为了查询时间区间内用户id对各个任务域的调用次数情况：

SELECT
t1.id,
COALESCE(sum(t2.pay_order_cnt),0) as `收单单量`,
COALESCE(sum(t3.pay_cnt),0) as `付款单量`,
COALESCE(sum(t4.settle_order_cnt), 0) as `结算单量`,
COALESCE(sum(t5.mwallet_cnt),0) as `钱包单量`,
COALESCE(sum(t6.merchant_create_cnt), 0) as `商户`,
COALESCE(sum(t7.settle_order_cnt), 0) as `实名`
from
upload_table.no t1
LEFT JOIN
trade_daily_inc t2
on t1.id = t2.no
and t2.partition_date BETWEEN '$$begindate' and '$$enddate'
LEFT JOIN
(SELECT
no,count(1) pay_cnt
from pay_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no)t3 on t1.id = t3.no
LEFT JOIN
(
SELECT
no,count(1) settle_order_cnt
from settle_order_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no
)t4 on t1.id = t4.no
LEFT JOIN
(
SELECT
id,
count(1) mwallet_cnt
from mwallet_info_ss
where partition_date = '$$yesterday'
and substr(mwallet_open_account_time,1,10) BETWEEN '$$begindate' and '$$enddate'
GROUP BY id
) t5 on t1.id = t5.id
LEFT JOIN
(
SELECT
id,
count(1) merchant_create_cnt
from pay_merchant_info_ss
where partition_date = '$$yesterday'
and substr(create_time,1,10) BETWEEN '$$begindate' and '$$enddate'
GROUP BY id
)t6 on t1.id = t6.id
LEFT JOIN
(
SELECT
merchant_id, count(1) settle_order_cnt
from usr_auth_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY merchant_id
)t7 on t1.id = t7.merchant_id
GROUP BY t1.id

考虑到收单交易的宽表太大，计算耗时明显大于其他业务部分的表，为了方便sql的查询，我提前针对这部分在app层中加工了一层对每一天的收单量进行汇总的表。
于是就有了错误的代码：

COALESCE(sum(t2.pay_order_cnt),0) as `收单单量`,
-------------
LEFT JOIN
trade_daily_inc t2
on t1.id = t2.no
and t2.partition_date BETWEEN '$$begindate' and '$$enddate'
--------------
GROUP BY t1.id

可以看到，经过加工后的t2表，因为无需再按天汇总，我便直接以每天的订单量为粒度，在最外层进行了sum，并在最后按照id做group by。
这样写如果仅仅针对收单单量这一任务域，无疑是正确的，那我们接着往下看：

LEFT JOIN
(SELECT
no,count(1) pay_cnt
from pay_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no)t3 on t1.id = t3.no

LEFT JOIN
(
SELECT
no,count(1) settle_order_cnt
from settle_order_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no
)t4 on t1.id = t4.no
......

对于剩下需要join的任务域，我以每个id的调用次数为粒度进行了join操作，单独看也是没问题的。
但是合在一起呢？粒度上的不统一就导致了膨胀：
如果某一id有如下数据

Id      t2          
1   2024-1-1       
1	2024-1-2
1   2024-1-3

Id        t3          
1	    50000

经过粒度不一致的join后，就会变成：

Id      t2        t3        
1	 2024-1-1    50000   
1    2024-1-2    50000
1    2024-1-3    50000

最后再sum汇总的话，对于id 1在t3上的作用域调用次数，就由50000变成了150000，这就是join粒度不统一导致的数据膨胀。
解决方法也很简单，在t2表上提前sum,统一粒度为每个id的总调用量，即可，这样也无需最外层的sum和group by了。
最终修改后的代码如下：

SELECT t1.id,
      COALESCE(t2.pay_order_cnt,0) as `收单单量`,
      COALESCE(t3.pay_cnt,0) as `付款单量`,
      COALESCE(t4.settle_order_cnt, 0) as `结算单量`,
      COALESCE(t5.mwallet_cnt,0) as `钱包单量`,
      COALESCE(t6.merchant_create_cnt, 0) as `商户进件单量`,
      COALESCE(t7.settle_order_cnt, 0) as `实名单量`
 from upload_table. no t1
 LEFT JOIN (
       select no,
              sum(pay_order_cnt) as pay_order_cnt
         from trade_merchant_daily_inc
        where partition_date BETWEEN '$$begindate' and '$$enddate'
        group by no
      ) t2
   on t1. id = t2. no
 LEFT JOIN (
       SELECT no,
              count(1) pay_cnt
         from trd_pay_inc
        where partition_date BETWEEN '$$begindate' and '$$enddate'
        GROUP BY no
      )t3
   on t1.id = t3.no
 LEFT JOIN (
       SELECT no,
              count(1) settle_order_cnt
         from settle_order_inc
        where partition_date BETWEEN '$$begindate' and '$$enddate'
        GROUP BY no
      )t4
   on t1.id = t4.no
 LEFT JOIN (
       SELECT id,
              count(1) mwallet_cnt
         from mwallet_info_ss
        where partition_date = '$$yesterday'
          and substr(mwallet_open_account_time,1,10) BETWEEN '$$begindate' and '$$enddate'
        GROUP BY id
      ) t5
   on t1. id = t5. id
 LEFT JOIN (
       SELECT id,
              count(1) merchant_create_cnt
         from merchant_info_ss
        where partition_date = '$$yesterday'
          and substr(create_time,1,10) BETWEEN '$$begindate' and '$$enddate'
        GROUP BY id
      )t6
   on t1.id = t6.id
 LEFT JOIN (
       SELECT merchant_id,
              count(1) settle_order_cnt
         from usr_auth_inc
        where partition_date BETWEEN '$$begindate' and '$$enddate'
        GROUP BY merchant_id
      )t7
   on t1. id = t7.merchant_id