优化join or情形

通常,在hive中对于模糊匹配关联方面的查询效率是非常低的,如or 关联,基于like的模糊匹配关联,对于此类问题往往需要找到好的优化方案。

对于join关联时涉及多个or连接,本次优化方案转化为union 或 union all的实现形式。

1、需求

有一天,旁边的做数据分析的同事,发我一个sql语句,说跑了15min多了,查询进度条一直没有进度,叫我帮忙优化一下,语句如下:

select 
  list.u_type as `黑名单拉黑维度`
  ,list.val as `拉黑的筛选值`
  ,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
  ,reg.order_no as `订单号`
  ,reg.hos_name as `医院名称`
  ,reg.first_dept_name as `一级科室`
  ,reg.second_dept_name as `二级科室`
  ,reg.doctor_name as `医生名称`
  ,split_part(split_part(reg.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
  ,reg.order_create_time as `挂号时间`
  ,reg.treatment_dt as `就诊时间`
  ,reg.order_status as `订单状态`
  ,reg.product_price as `订单金额`
  ,reg.patient_name as `就诊卡姓名`
  ,reg.patient_cred_no as `身份证号`
  ,reg.patient_phone as `预留手机号`
  ,reg.order_ip as `ip`
from
( 
select
   order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
   ,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from dw.aggr_reg_entity 
where month>='2022-01' and order_status='TOKEN'
)reg --15837934条数据
left join
(select 
   u_type,val,op_log 
from dw.fact_black_list 
where id is not null
) list --402422条数据
on case when list.u_type='PHONE' then val else '非' end=reg.patient_phone or 
case when list.u_type='IP' then val else '非' end=reg.order_ip or 
case when list.u_type ='CARD_RISK' then val else '非' end=reg.patient_card_no or 
case when list.u_type ='UID' then val else '非' end=reg.user_id or
 
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

2、问题分析

看到以上逻辑后,第一感觉就是在join关联时,严重影响了查询效率,并且其中涉及多个join on ... or ... or 关联的情形。

于是进行了尝试,验证上边的假设

(1) 首先去掉or后边的所有条件,只保留on匹配,结果很快就跑完了;

(2) 当保留on 匹配,外加1个or匹配时,出结果的速度明显比上边慢下来了;

(3) 当保留on 匹配,外加2个or匹配时,sql查询根本跑不动,查询进度条就一直停滞不前进了。

于是验证了上边的猜想,是由于join on ... or ... or 模糊匹配关联时,or条件导致的查询速度太慢,接下来进行hive查询语句的优化。

3、进行优化

思路:需要避免上边join on ... or ... or 模糊匹配的情况,需要把其拆分开。并且把case...when尽量不要放在on后边。

拆分开后,使用union的方式,查询出数据用了 57秒 完成。

with base_entity as( --15837934条数据
select 
   order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
   ,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity 
where month>='2022-01' and order_status='TOKEN'
)

,

base_list as          --402422条数据
(select 
  u_type,val,op_log 
    ,case when u_type='PHONE' then val else '非' end aa
    ,case when u_type='IP' then val else '非' end bb
    ,case when u_type ='CARD_RISK' then val else '非' end cc
    ,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list 
where id is not null   
)


select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.aa=entity.patient_phone

union 

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on  list.bb=entity.order_ip 

union

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.cc=entity.patient_card_no

union

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.dd=entity.user_id

;

拆分开后,使用union all的方式,查询出数据只用了 11秒 就完成了。

with base_entity as( --15837934条数据
select 
   order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
   ,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity 
where month>='2022-01' and order_status='TOKEN'
)

,

base_list as          --402422条数据
(select 
  u_type,val,op_log 
    ,case when u_type='PHONE' then val else '非' end aa
    ,case when u_type='IP' then val else '非' end bb
    ,case when u_type ='CARD_RISK' then val else '非' end cc
    ,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list 
where id is not null   
)


select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.aa=entity.patient_phone

union all 

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on  list.bb=entity.order_ip 

union all

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.cc=entity.patient_card_no

union all

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.dd=entity.user_id

group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

但是上述的方式显示显得很啰嗦,如果后面需要匹配的or比较多,比如有n个的时候,那么同样的逻辑就要union all  n-1次代码看起来相当繁琐,且性能较低。

针对以上问题,也可以采用一种优雅的实现方式:我们知道采用or连接的时候,无非就是base_entity表中的字段在base_list表中匹配到了就成功,对于这种需要匹配就成功的连接方式,我们自然想到hive中高效的实现方式locate()函数,对于该函数的理解,可以具体参考如下文章:
https://blog.csdn.net/godlovedaniel/article/details/125126193

hive中字符串查找函数 instr 和 locate_奔跑者-辉的博客-CSDN博客

4、小结

要避免hive中 join 基于or 形式模糊匹配关联,可以借助于union all的实现方式,或借助于locate()模糊匹配的方法代码简洁优雅,在hive中用途较广,读者也需要务必掌握。

  • 3
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值