前段时间犯了一个因为sql中join连接的各部分粒度不一致导致数据膨胀的问题,
下面是我最初的错误代码,这段sql主要是为了查询时间区间内用户id对各个任务域的调用次数情况:
SELECT
t1.id,
COALESCE(sum(t2.pay_order_cnt),0) as `收单单量`,
COALESCE(sum(t3.pay_cnt),0) as `付款单量`,
COALESCE(sum(t4.settle_order_cnt), 0) as `结算单量`,
COALESCE(sum(t5.mwallet_cnt),0) as `钱包单量`,
COALESCE(sum(t6.merchant_create_cnt), 0) as `商户`,
COALESCE(sum(t7.settle_order_cnt), 0) as `实名`
from
upload_table.no t1
LEFT JOIN
trade_daily_inc t2
on t1.id = t2.no
and t2.partition_date BETWEEN '$$begindate' and '$$enddate'
LEFT JOIN
(SELECT
no,count(1) pay_cnt
from pay_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no)t3 on t1.id = t3.no
LEFT JOIN
(
SELECT
no,count(1) settle_order_cnt
from settle_order_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no
)t4 on t1.id = t4.no
LEFT JOIN
(
SELECT
id,
count(1) mwallet_cnt
from mwallet_info_ss
where partition_date = '$$yesterday'
and substr(mwallet_open_account_time,1,10) BETWEEN '$$begindate' and '$$enddate'
GROUP BY id
) t5 on t1.id = t5.id
LEFT JOIN
(
SELECT
id,
count(1) merchant_create_cnt
from pay_merchant_info_ss
where partition_date = '$$yesterday'
and substr(create_time,1,10) BETWEEN '$$begindate' and '$$enddate'
GROUP BY id
)t6 on t1.id = t6.id
LEFT JOIN
(
SELECT
merchant_id, count(1) settle_order_cnt
from usr_auth_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY merchant_id
)t7 on t1.id = t7.merchant_id
GROUP BY t1.id
考虑到收单交易的宽表太大,计算耗时明显大于其他业务部分的表,为了方便sql的查询,我提前针对这部分在app层中加工了一层对每一天的收单量进行汇总的表。
于是就有了错误的代码:
COALESCE(sum(t2.pay_order_cnt),0) as `收单单量`,
-------------
LEFT JOIN
trade_daily_inc t2
on t1.id = t2.no
and t2.partition_date BETWEEN '$$begindate' and '$$enddate'
--------------
GROUP BY t1.id
可以看到,经过加工后的t2表,因为无需再按天汇总,我便直接以每天的订单量为粒度,在最外层进行了sum,并在最后按照id做group by。
这样写如果仅仅针对收单单量这一任务域,无疑是正确的,那我们接着往下看:
LEFT JOIN
(SELECT
no,count(1) pay_cnt
from pay_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no)t3 on t1.id = t3.no
LEFT JOIN
(
SELECT
no,count(1) settle_order_cnt
from settle_order_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no
)t4 on t1.id = t4.no
......
对于剩下需要join的任务域,我以每个id的调用次数为粒度进行了join操作,单独看也是没问题的。
但是合在一起呢?粒度上的不统一就导致了膨胀:
如果某一id有如下数据
Id t2
1 2024-1-1
1 2024-1-2
1 2024-1-3
Id t3
1 50000
经过粒度不一致的join后,就会变成:
Id t2 t3
1 2024-1-1 50000
1 2024-1-2 50000
1 2024-1-3 50000
最后再sum汇总的话,对于id 1在t3上的作用域调用次数,就由50000变成了150000,这就是join粒度不统一导致的数据膨胀。
解决方法也很简单,在t2表上提前sum,统一粒度为每个id的总调用量,即可,这样也无需最外层的sum和group by了。
最终修改后的代码如下:
SELECT t1.id,
COALESCE(t2.pay_order_cnt,0) as `收单单量`,
COALESCE(t3.pay_cnt,0) as `付款单量`,
COALESCE(t4.settle_order_cnt, 0) as `结算单量`,
COALESCE(t5.mwallet_cnt,0) as `钱包单量`,
COALESCE(t6.merchant_create_cnt, 0) as `商户进件单量`,
COALESCE(t7.settle_order_cnt, 0) as `实名单量`
from upload_table. no t1
LEFT JOIN (
select no,
sum(pay_order_cnt) as pay_order_cnt
from trade_merchant_daily_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
group by no
) t2
on t1. id = t2. no
LEFT JOIN (
SELECT no,
count(1) pay_cnt
from trd_pay_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no
)t3
on t1.id = t3.no
LEFT JOIN (
SELECT no,
count(1) settle_order_cnt
from settle_order_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY no
)t4
on t1.id = t4.no
LEFT JOIN (
SELECT id,
count(1) mwallet_cnt
from mwallet_info_ss
where partition_date = '$$yesterday'
and substr(mwallet_open_account_time,1,10) BETWEEN '$$begindate' and '$$enddate'
GROUP BY id
) t5
on t1. id = t5. id
LEFT JOIN (
SELECT id,
count(1) merchant_create_cnt
from merchant_info_ss
where partition_date = '$$yesterday'
and substr(create_time,1,10) BETWEEN '$$begindate' and '$$enddate'
GROUP BY id
)t6
on t1.id = t6.id
LEFT JOIN (
SELECT merchant_id,
count(1) settle_order_cnt
from usr_auth_inc
where partition_date BETWEEN '$$begindate' and '$$enddate'
GROUP BY merchant_id
)t7
on t1. id = t7.merchant_id
本次分享就到这里,字段已做特殊脱敏处理,欢迎指正!