where top1000(X)<=10等条件谓词提前问题记录

最新推荐文章于 2021-12-01 10:27:24 发布

tomson8975

最新推荐文章于 2021-12-01 10:27:24 发布

阅读量649

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/tomson8975/article/details/49465635

版权

hive 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

今天，遇到这样一个问题，在下面sql中：

 
          add jar /opt/hadoop/yuanfeng/bdiudf.jar; 
         
          create temporary function top1000 as  
          'com.bfd.id.Id' 
          ; 
         
          select a.state_date,a.customer,a.taskid,a.step_id,a.exit_title,a.pv,top1000(a.only_id) 
         
          from 
         
          (  select t1.state_date,t1.customer,t1.taskid,t1.step_id,t1.exit_title,t1.pv,t1.only_id 
         
          from  
         
          ( select t11.state_date, 
         
          t11.customer, 
         
          t11.taskid, 
         
          t11.step_id, 
         
          t11.exit_title, 
         
          t11.pv, 
         
          concat(t11.customer,t11.taskid,t11.step_id) as only_id 
         
          from 
         
          (  select state_date,customer,taskid,step_id,exit_title,count(*) as pv 
         
          from bdi_fact2.mid_url_step 
         
          where exit_url!= 
          '-1' 
         
          and exit_title != 
          '-1' 
         
          and l_date= 
          '2015-08-31' 
         
          group by state_date,customer,taskid,step_id,exit_title 
         
          )t11 
         
          )t1 
         
          order by t1.only_id,t1.pv desc 
         
          )a 
         
          where  a.customer= 
          'Cdianyingwang' 
         
          and a.taskid= 
          '33' 
         
          and a.step_id= 
          '0'  
         
          and top1000(a.only_id)<= 
          10 
          ;

在0.9和0.14的hive中查询结果不一样，究其根本，是0.9hive没有谓词提前功能（ppd），而0.14有这个功能。

问题就出现在这里，请看执行计划：

 
          STAGE DEPENDENCIES: 
         
          Stage- 
          1  
          is a root stage 
         
          Stage- 
          2  
          depends on stages: Stage- 
          1 
         
          Stage- 
          0  
          is a root stage

可以看到，一共两个运算阶段，stage-1,stage-2

stage-1：

 
          //即子查询a.t1表的内容  
         
          ( select t11.state_date, 
         
          t11.customer, 
         
          t11.taskid, 
         
          t11.step_id, 
         
          t11.exit_title, 
         
          t11.pv, 
         
          concat(t11.customer,t11.taskid,t11.step_id) as only_id 
         
          from 
         
          (  select state_date,customer,taskid,step_id,exit_title,count(*) as pv 
         
          from bdi_fact2.mid_url_step 
         
          where exit_url!= 
          '-1' 
         
          and exit_title != 
          '-1' 
         
          and l_date= 
          '2015-08-31' 
         
          group by state_date,customer,taskid,step_id,exit_title 
         
          )t11 
         
          )t1

来看看这个阶段做了什么：

看到红框里的内容了吗？编译器在这里做了一个愚蠢的事情（其实也不能怪编译器），在错误的时间错误的地点做了错误的事。

——在这个时候提前把数据过滤了，为什么说是提前呢，是因为原语句的意图是在所有记录根据only_id和pv全排序后再取top1000，而不是在t1表

这个中间结果这里，而就是这样的动作导致了后续的一些错误：

看下这个图：

这个是stage-1的执行日志，可以看到reduce有两个。因为在这个阶段过早的top1000了（top1000(only_id)<=10），每个reduce取了10条记录。

好，stage-1先说到这里，我们来看看stage-2:

 
          select a.state_date,a.customer,a.taskid,a.step_id,a.exit_title,a.pv,top1000(a.only_id) 
         
          from 
         
          (  select t1.state_date,t1.customer,t1.taskid,t1.step_id,t1.exit_title,t1.pv,t1.only_id 
         
          from  
         
          (  
         
          ... 
         
          )t1 
         
          order by t1.only_id,t1.pv desc 
         
          )a 
         
          where  a.customer= 
          'Cdianyingwang' 
         
          and a.taskid= 
          '33' 
         
          and a.step_id= 
          '0'  
         
          and top1000(a.only_id)<= 
          10 
          ;

其实，stage-2就只有一个reduce阶段：

可以看到，本来在这个阶段应该做top1000(only_id)<=10操作的，但是在谓词条件这一块是空的！因为这个条件已经被提前了！

好的，已经比较清晰了，为什么top1000(only_id)<=10明明限定了10条，结果会出20条——因为stage-1有2个reduce.

为什么会这样？

因为对于编译器来说，top1000(only_id)<=10只是一个谓词条件，跟a=1,b="abc"是没有差异的，编译器会毫不犹豫的把这个条件下推到语句的最内层（t1）。

所以，这不是bug，编译器无法感知谓词条件或者说是udf函数是什么意图，只会简单的把他下推到底。

所以该怎么做？

可以:set hive.optimize.ppd=false;

tomson8975

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
where top1000(X)<=10等条件谓词提前问题记录

今天，遇到这样一个问题，在下面sql中：add jar /opt/hadoop/yuanfeng/bdiudf.jar; create temporary function top1000 as 'com.bfd.id.Id'; select a.state_date,a.customer,
复制链接

扫一扫

专栏目录