Using HDInsight tools to optimize hive script

Hive job issue

Below is a hive script coming from TPC-DS and I hadsubmitted to Microsoft Azure HDInsight cluster for a long time (more than twohours), but it was still in running status. Why it was so slow?

use tpcds_text_2;
set hive.execution.engine=tez;
select avg(ss_quantity)
       ,avg(ss_ext_sales_price)
       ,avg(ss_ext_wholesale_cost)
       ,sum(ss_ext_wholesale_cost)
 from store_sales
     ,store
     ,customer_demographics
     ,household_demographics
     ,customer_address
     ,date_dim
 where store.s_store_sk = store_sales.ss_store_sk
 and  store_sales.ss_sold_date_sk = date_dim.d_date_sk and date_dim.d_year = 2001
 and((store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
  and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk
  and customer_demographics.cd_marital_status = 'M'
  and customer_demographics.cd_education_status = '4 yr Degree'
  and store_sales.ss_sales_price between 100.00 and 150.00
  and household_demographics.hd_dep_count = 3   
     )or
     (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
  and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk
  and customer_demographics.cd_marital_status = 'D'
  and customer_demographics.cd_education_status = 'Primary'
  and store_sales.ss_sales_price between 50.00 and 100.00   
  and household_demographics.hd_dep_count = 1
     ) or 
     (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
  and customer_demographics.cd_demo_sk = ss_cdemo_sk
  and customer_demographics.cd_marital_status = 'U'
  and customer_demographics.cd_education_status = 'Advanced Degree'
  and store_sales.ss_sales_price between 150.00 and 200.00 
  and household_demographics.hd_dep_count = 1  
     ))
 and((store_sales.ss_addr_sk = customer_address.ca_address_sk
  and customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('KY', 'GA', 'NM')
  and store_sales.ss_net_profit between 100 and 200  
     ) or
     (store_sales.ss_addr_sk = customer_address.ca_address_sk
  and customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('MT', 'OR', 'IN')
  and store_sales.ss_net_profit between 150 and 300  
     ) or
     (store_sales.ss_addr_sk = customer_address.ca_address_sk
  and customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('WI', 'MO', 'WV')
  and store_sales.ss_net_profit between 50 and 250  
     ));

Using HDInsight Tool to find Issue

In order to find the root cause for slow job, it isnecessary to know what the job was doing (blocked at somewhere). One option isto read job log, but it is not so intuitionistic. I prefer to using a morevisual tool. Microsoft HDInsight tools is the one I choose.

Using it, we can get below job execution graph

 

Through the graph, we can see all the map vertices hadfinished and the hive job is blocked at the Reducer5. Next, let us see what theexecution plan for Reducer 5. Double-click Reduce 5 node and below graphs show up.

It is clear what execution plan for reducer5 is.  Merge JoinàmapJoinà mapJoinà  map JoinàfilteràselectàGroupby. It is easy tounderstand what “four joins” do. Map7 and map4 do the merger join, then do mapjoin with map3, map8, map1 (these three maps only several kb data, so do themap side join). 

Putting mouse on three Mapside Join operator, I found somethingstrange, the mapside join for Map3 and Map1 are cross join because there nojoin keys for both of them. It is quite different with map side join for Map8.  

It can explain why the job are blocked here. Even you runthis job for two days, it will still stay here!

 

 

Optimize hive script

Though finding the reason for running slowly, it is alsoworth investigating why it generate such bad execution plan.  

Apache hive document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins said

 

Version 0.13.0+: Implicit join notation
Implicit join notation is supported starting with Hive 0.13.0 (see HIVE-5558). This allows the FROM clause to join a comma-separated list of tables, omitting the JOIN keyword. For example:
SELECT * 
FROM table1 t1, table2 t2, table3 t3 
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535'; 

Go back to our hive script, it has the similar code withabove and should be an implicit join. But why some mapside joins are convertedto cross joins?

Map8 is the only mapsidejoin converted to inner join as we expected, its corresponding table is “data_dim”and look up it in script andits where condition is “store_sales.ss_sold_date_sk = date_dim.d_date_sk anddate_dim.d_year = 2001”.Compared with other two mapside join conditions,it is much simple. So I think perhaps it is an optimization issue for hive and itcannot deal with some complex situations for implicit join. I also wonder the latest hive version can solve this issue.

Let us change this script by handand we get below script. the only difference is to extract some common “where condition” and their logic aretotally same, then rerun it.

 

use tpcds_text_2;
set hive.execution.engine=tez;
select avg(ss_quantity)
       ,avg(ss_ext_sales_price)
       ,avg(ss_ext_wholesale_cost)
       ,sum(ss_ext_wholesale_cost)
from store_sales
     ,store
     ,customer_demographics
     ,household_demographics
     ,customer_address
     ,date_dim
where store_sales.ss_store_sk=store.s_store_sk
and  store_sales.ss_sold_date_sk = date_dim.d_date_sk
and  store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
and  store_sales.ss_cdemo_sk=customer_demographics.cd_demo_sk
and  store_sales.ss_addr_sk = customer_address.ca_address_sk
and date_dim.d_year = 2001
and((
  customer_demographics.cd_marital_status = 'M'
  and customer_demographics.cd_education_status = '4 yr Degree'
  and store_sales.ss_sales_price between 100.00 and 150.00
  and household_demographics.hd_dep_count = 3   
     )or
     (customer_demographics.cd_marital_status = 'D'
  and customer_demographics.cd_education_status = 'Primary'
  and store_sales.ss_sales_price between 50.00 and 100.00   
  and household_demographics.hd_dep_count = 1
     ) or 
     (customer_demographics.cd_marital_status = 'U'
  and customer_demographics.cd_education_status = 'Advanced Degree'
  and store_sales.ss_sales_price between 150.00 and 200.00 
  and household_demographics.hd_dep_count = 1  
     ))

and((customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('KY', 'GA', 'NM')
  and store_sales.ss_net_profit between 100 and 200  
     ) or
     (
  customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('MT', 'OR', 'IN')
  and store_sales.ss_net_profit between 150 and 300  
     ) or
     (customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('WI', 'MO', 'WV')
  and store_sales.ss_net_profit between 50 and 250  
     ));

 

Below is job execution graph, it took1 minute and 46 seconds to complete.

 

 

 

Compared this graph with above, bothhive job execution plan and Reducer5 execution plan are same, the difference isall the mapside joins at reduce5 are inner join.

Go further, we continue to changethis script to using Join on operation explicitly. After moving some conditionsin “where” to “join on”,we get below script.

use tpcds_text_2;
set hive.execution.engine=tez;
select avg(ss_quantity)
       ,avg(ss_ext_sales_price)
       ,avg(ss_ext_wholesale_cost)
       ,sum(ss_ext_wholesale_cost)
from store_sales JOIN store On store.s_store_sk = store_sales.ss_store_sk  
                  Join  date_dim on store_sales.ss_sold_date_sk = date_dim.d_date_sk
                  Join  household_demographics on store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
                  Join customer_demographics on customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk
                  Join customer_address ON store_sales.ss_addr_sk = customer_address.ca_address_sk
where date_dim.d_year = 2001
and((customer_demographics.cd_marital_status = 'M'
  and customer_demographics.cd_education_status = '4 yr Degree'
  and store_sales.ss_sales_price between 100.00 and 150.00
  and household_demographics.hd_dep_count = 3   
     )or
     (customer_demographics.cd_marital_status = 'D'
  and customer_demographics.cd_education_status = 'Primary'
  and store_sales.ss_sales_price between 50.00 and 100.00   
  and household_demographics.hd_dep_count = 1
     ) or 
     (customer_demographics.cd_marital_status = 'U'
  and customer_demographics.cd_education_status = 'Advanced Degree'
  and store_sales.ss_sales_price between 150.00 and 200.00 
  and household_demographics.hd_dep_count = 1  
     ))
and((customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('KY', 'GA', 'NM')
  and store_sales.ss_net_profit between 100 and 200  
     ) or
    (customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('MT', 'OR', 'IN')
 and store_sales.ss_net_profit between 150 and 300  
     ) or
     (customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('WI', 'MO', 'WV')
  and store_sales.ss_net_profit between 50 and 250  
     ));

Below it its job execution graph and its exaction plan haschanged. It took 1 minute 26 seconds to complete and 20 seconds faster thanabove script.


 

Conclusion

 HDInsight tool is an excellent tool to do hivejob analysis

 Using explicit join on if the script is complex


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值