优化join or情形

奔跑者-辉

已于 2023-03-07 09:14:50 修改

阅读量2.5k

点赞数 3

分类专栏： hive 文章标签： hive hadoop 数据仓库

于 2022-09-03 20:24:59 首次发布

本文链接：https://blog.csdn.net/index_test/article/details/126681005

版权

hive 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

通常，在hive中对于模糊匹配关联方面的查询效率是非常低的，如or 关联，基于like的模糊匹配关联，对于此类问题往往需要找到好的优化方案。

对于join关联时涉及多个or连接，本次优化方案转化为union 或 union all的实现形式。

1、需求

有一天，旁边的做数据分析的同事，发我一个sql语句，说跑了15min多了，查询进度条一直没有进度，叫我帮忙优化一下，语句如下：

select 
  list.u_type as `黑名单拉黑维度`
  ,list.val as `拉黑的筛选值`
  ,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
  ,reg.order_no as `订单号`
  ,reg.hos_name as `医院名称`
  ,reg.first_dept_name as `一级科室`
  ,reg.second_dept_name as `二级科室`
  ,reg.doctor_name as `医生名称`
  ,split_part(split_part(reg.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
  ,reg.order_create_time as `挂号时间`
  ,reg.treatment_dt as `就诊时间`
  ,reg.order_status as `订单状态`
  ,reg.product_price as `订单金额`
  ,reg.patient_name as `就诊卡姓名`
  ,reg.patient_cred_no as `身份证号`
  ,reg.patient_phone as `预留手机号`
  ,reg.order_ip as `ip`
from
( 
select
   order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
   ,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from dw.aggr_reg_entity 
where month>='2022-01' and order_status='TOKEN'
)reg --15837934条数据
left join
(select 
   u_type,val,op_log 
from dw.fact_black_list 
where id is not null
) list --402422条数据
on case when list.u_type='PHONE' then val else '非' end=reg.patient_phone or 
case when list.u_type='IP' then val else '非' end=reg.order_ip or 
case when list.u_type ='CARD_RISK' then val else '非' end=reg.patient_card_no or 
case when list.u_type ='UID' then val else '非' end=reg.user_id or
 
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

2、问题分析

看到以上逻辑后，第一感觉就是在join关联时，严重影响了查询效率，并且其中涉及多个join on ... or ... or 关联的情形。

于是进行了尝试，验证上边的假设：

(1) 首先去掉or后边的所有条件，只保留on匹配，结果很快就跑完了；

(2) 当保留on 匹配，外加1个or匹配时，出结果的速度明显比上边慢下来了；

(3) 当保留on 匹配，外加2个or匹配时，sql查询根本跑不动，查询进度条就一直停滞不前进了。

于是验证了上边的猜想，是由于join on ... or ... or 模糊匹配关联时，or条件导致的查询速度太慢，接下来进行hive查询语句的优化。

3、进行优化

思路：需要避免上边join on ... or ... or 模糊匹配的情况，需要把其拆分开。并且把case...when尽量不要放在on后边。

拆分开后，使用union的方式，查询出数据用了 57秒完成。

with base_entity as( --15837934条数据
select 
   order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
   ,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity 
where month>='2022-01' and order_status='TOKEN'
)

,

base_list as          --402422条数据
(select 
  u_type,val,op_log 
    ,case when u_type='PHONE' then val else '非' end aa
    ,case when u_type='IP' then val else '非' end bb
    ,case when u_type ='CARD_RISK' then val else '非' end cc
    ,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list 
where id is not null   
)


select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.aa=entity.patient_phone

union 

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on  list.bb=entity.order_ip 

union

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.cc=entity.patient_card_no

union

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.dd=entity.user_id

;

拆分开后，使用union all的方式，查询出数据只用了 11秒就完成了。

with base_entity as( --15837934条数据
select 
   order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
   ,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity 
where month>='2022-01' and order_status='TOKEN'
)

,

base_list as          --402422条数据
(select 
  u_type,val,op_log 
    ,case when u_type='PHONE' then val else '非' end aa
    ,case when u_type='IP' then val else '非' end bb
    ,case when u_type ='CARD_RISK' then val else '非' end cc
    ,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list 
where id is not null   
)


select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.aa=entity.patient_phone

union all 

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on  list.bb=entity.order_ip 

union all

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.cc=entity.patient_card_no

union all

select 
  list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity 
left join base_list list 
on list.dd=entity.user_id

group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

但是上述的方式显示显得很啰嗦，如果后面需要匹配的or比较多，比如有n个的时候，那么同样的逻辑就要union all n-1次代码看起来相当繁琐，且性能较低。

针对以上问题，也可以采用一种优雅的实现方式：我们知道采用or连接的时候，无非就是base_entity表中的字段在base_list表中匹配到了就成功，对于这种需要匹配就成功的连接方式，我们自然想到hive中高效的实现方式locate()函数，对于该函数的理解，可以具体参考如下文章：
https://blog.csdn.net/godlovedaniel/article/details/125126193

hive中字符串查找函数 instr 和 locate_奔跑者-辉的博客-CSDN博客