排查数据问题笔记

盛源_01

已于 2024-07-04 22:33:26 修改

阅读量194

点赞数

文章标签：大数据 spark 分布式经验分享

于 2023-10-10 09:58:16 首次发布

本文链接：https://blog.csdn.net/weixin_40829577/article/details/133739021

版权

一、分区失效

1. 子查询

用子查询限定分区时, in操作会转成join操作, 扫描表数据时无法利用分区过滤, 要等到join操作时才过滤出所需分区的数据, 因此分区限制时最好使用常量;

-- 子查询限制分区, 可能导致分区过滤失效
select 
 *
from 
    tbl_a
where 
    dt = '20231005' 
    and app_name in ( select distinct app_name from fm_event_cfg )
;    
 
-- 外部传参或常量限制分区, 不会导致分区失效  
select 
    *
from 
    tbl_a
where 
    dt = '20231005' 
    and instr('${app_names}', app_name) > 0 -- 分区字段类型无限制
    -- and app_name in ('${app_names}')   -- 分区字段是数值类型
;

2. or条件

or条件覆盖了分区条件, or条件两端注意加();

-- (dt是时间分区, name是表字段)
-- 这种or写法会导致dt分区失效扫全表
select
    id
    ,name
from
    tbl_name
where 
    dt = 20231027
    and name = 'shy' or name = 'ssjt'
;
-- 避免分区失效扫全表, or条件注意加()
select
    id
    ,name
from
    tbl_name
where 
    dt = 20231027
    and ( name = 'shy' or name = 'ssjt' )
;


-- (dt是一级分区, type是二级分区), 下面查询二级分区过滤失效, 会扫描type=22分区之外的所有数据
select
    * 
from 
    tbl_name
where 
    dt = '20231106' 
and (
        (
            type in ('11') and event = 'aa'
        ) 
        or
        (
            type not in ('22') and event = 'bb'
        ) 
    ) 
;