Hive0.14在left outer join多级连接中,执行计划生成BUG记录

前言:

        这几天遇到一个很诡异的问题,一个三级left outer join的句子,在hive0.9和0.14上的执行结果会不一样。

        而且在0.14上通过转换右表连接的顺序可以达到正确输出的目的,但是其中是为什么却不得而知,情况非常

        诡异,猜测是编译器编译执行计划有问题!(所谓转换右表连接顺序是指把A left outer join B left outer join C

        改成A left outer join C left outer join B,出问题的在B子句中有个聚合出的结果在最终结果中表现不对。)


下面详细介绍下问题:

                      原语句:

select  A.state_date,
            A.customer,
            A.channel_2,
            A.id,
            A.pid,
            A.type,
            A.pv,
            A.uv,
            A.visits,
            if (C.stay_visits is  null , 0 ,C.stay_visits) as stay_visits,
            A.stay_time,
            if (B.bounce is  null , 0 ,B.bounce) as bounce
  from
      (select a.state_date,
             a.customer,
             b.url as channel_2,
             b.id,
             b.pid,
             b.type,
             count( 1 ) as pv,
             count(distinct a.gid) uv,
             count(distinct a.session_id) as visits,
             sum(a.stay_time) as stay_time
        from      
                ( select state_date,
                            customer,
                            gid,
                            session_id,
                            ep,
                            stay_time
                     from bdi_fact.mid_pageview_dt0
                     where l_date = '$v_date'
                   )a
                   join
                   (select l_date as state_date ,
                           url,
                           id,
                           pid,
                           type,
                           cid
                    from bdi_fact.frequency_channel
                    where l_date = '$v_date'
                    and type = '2'
                    and dr= '0'
                   )b
                    on  a.customer=b.cid 
                    where a.ep  rlike b.url
                    group by a.state_date, a.customer, b.url,b.id,b.pid,b.type
        )A
       
     left outer join
        (   select
                    c.state_date ,
                    c.customer ,
                    d.url as channel_2,
                    d.id,
                    sum(pagedepth) as bounce
             from
                   ( select
                               t1.state_date ,
                               t1.customer ,
                               t1.session_id,
                               t1.ep,
                               t2.pagedepth
                     from          
                          ( select
                                      state_date ,
                                      customer ,
                                      session_id,
                                      exit_url as ep
                           from ods.mid_session_enter_exit_dt0
                           where l_date = '$v_date'
                           )t1
                          join
                           ( select
                                     state_date ,
                                     customer ,
                                     session_id,
                                     pagedepth
                             from ods.mid_session_action_dt0
                             where l_date = '$v_date'
                             and  pagedepth= '1'
                           )t2
                          on t1.customer=t2.customer
                          and t1.session_id=t2.session_id
                    )c
                    join
                    (select *
                    from bdi_fact.frequency_channel
                    where l_date = '$v_date'
                    and type = '2'
                    and dr= '0'
                    )d
                    on c.customer=d.cid
                    where c.ep  rlike d.url
                    group by  c.state_date,c.customer,d.url,d.id
              )B
              on
          A.customer=B.customer
              and A.channel_2=B.channel_2
              and A.id=B.id
       left outer join
      (
              select e.state_date,
             e.customer,
             f.url as channel_2,
             f.id,
             f.pid,
             f.type,
             count(distinct e.session_id) as stay_visits
        from      
                ( select state_date,
                            customer,
                            gid,
                            session_id,
                            ep,
                            stay_time
                     from bdi_fact.mid_pageview_dt0
                     where l_date = '$v_date'
                   )e
                   join
                   (select l_date as state_date,
                           url,
                           id,
                           pid,
                           type,
                           cid
                    from bdi_fact.frequency_channel
                    where l_date = '$v_date'
                    and type = '2'
                    and dr= '0'
                   )f
                    on  e.customer=f.cid 
                    where e.ep  rlike f.url
                    and e.stay_time is not  null
                    and e.stay_time <> '0'
                    group by e.state_date, e.customer, f.url,f.id,f.pid,f.type
            )C
     on
         A.customer=C.customer
         and   A.channel_2=C.channel_2
         and   A.id=C.id
         and   A.pid=C.pid
         and   A.type=C.type
  where A.customer= 'Cdianyingwang'  and A.channel_2= 'http://www.1905.com/film/filmnews/jk/'  and A.id= '127' ;"

                  在B子句中算出的结果B.bounce在最终结果中会显示错误(如正确结果是500,但是却显示是100)

                  但是,通过调整连接顺序后就很正常了。

 

                  打印执行计划出来看:

                                  

 

                可以很明显的看见同一个阶段有一个字段没有输出出来,这个阶段就是B子句的任务。而这个字段就是B.bounce。

                这应该就能解释为什么最后结果是不对的

                好了,既然知道问题在哪儿了,就来看源码吧!

                通过查找ExprNodeColumnDesc.java(负责解析并生成输出字段)类,有以下发现:

                

                (图中箭头所指是我修改后的代码:

                                               原代码是:if (tabAlias != null && dest.tabAlias != null && !tabAlias.equals(dest.tabAlias)){

                                                                                                   return false;
                                                                })

                解析器在判断是否输出字段时,会判断当前字段是否跟最终表的字段相同(在查询中有很多中间表,如多级连接中):

                                (以下所说的表都是指表别名

                                 如果中间表和最终表都不为空且当前表不是最终表的话,返回false。即当前字段跟最终字段不同,需要输出。

                                 但是有个漏洞,就是当最终表为null的时候,这个时候如果是原代码的话,会返回true,字段会被直接丢弃

                                 因为返回true的话编译器判断最终结果中已有这个字段不需要输出。

                                 但是想想,如果最终表的别名为null,当前表的字段应该输出才对啊!不然数据的去处没了,源头也会没了。

                                 仔细想想这应该也是一个写代码时粗心犯下的一个错误吧!(在left outer join多级连接中有可能目标表的别名会为空)

 

               通过修改以上代码并编译后在集群上测试,执行计划输出正常了,数据结果也正常了:

                      图为patch前的结果(最后一列为B.bounce):

                      

                      图为patch后的结果:

                      

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值