hive> create table lpx_mapjoin as
> SELECT '2012-04-17' as stat_date
> ,b.admin_member_id
> ,a.category_level2_id
> ,b.keywords
> ,sum(shownum) as sum_prod_show_cnt
> ,sum(clicknum) as sum_prod_click_cnt
> ,sysdate('yyyy-MM-dd hh:mm:ss')as dw_ins_date
> FROM (SELECT category_id
> ,category_level2_id
> FROM pub_en_category_dimt0_h
> WHERE dw_begin_date <= '2012-04-17'
> AND hp_dw_end_date > '2012-04-17'
> AND category_type = '02'
> ) a
> JOIN (SELECT admin_member_id
> ,lower(trim(regexp_replace(keyword,'[\\\\t|\\\\s]+', ' '))) as keywords
> ,coalesce(shownum,cast(0 as bigint)) as shownum
> ,coalesce(clicknum,cast(0 as bigint)) as clicknum
> ,post_category_id
> FROM idl_en_ctr_click_sdt0
> WHERE hp_stat_date = '2012-04-17'
> AND keyword IS NOT NULL
> ) b
> ON (a.category_id = b.post_category_id)
> GROUP BY b.admin_member_id
> ,a.category_level2_id
> ,b.keywords ;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified.Estimated from input data size: 3
In order to change the average loadfor a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum numberof reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number ofreducers:
set mapred.reduce.tasks=<number>
Cannot run job locally: InputSize (= 2511131447) is larger than hive.exec.mode.local.auto.inputbytes.max (= -1)
Warning:Can't find tianshu info:markid missing!
Starting Job = job_201205151343_1062053,Tracking URL = http://hdpjt:50030/jobdetails.jsp?jobid=job_201205151343_1062053
Kill Command = /dhwdata/hadoop/bin/../bin/hadoopjob -Dmapred.job.tracker=hdpjt:9001 -killjob_201205151343_1062053
Hadoop job information for Stage-1: number of mappers: 111; number of reducers: 3
2012-05-25 13:24:05,753 Stage-1 map= 26%, reduce = 0%
2012-05-25 13:24:23,957 Stage-1 map= 77%, reduce = 0%
2012-05-25 13:24:41,265 Stage-1 map= 92%, reduce = 0%
2012-05-25 13:25:00,561 Stage-1 map= 100%, reduce = 14%
2012-05-25 13:25:11,320 Stage-1 map= 100%, reduce = 23%
2012-05-25 13:25:29,255 Stage-1 map= 100%, reduce = 32%
2012-05-25 13:25:48,604 Stage-1 map= 100%, reduce = 57%
2012-05-25 13:26:07,861 Stage-1 map= 100%, reduce = 70%
2012-05-25 13:26:31,523 Stage-1 map= 100%, reduce = 73%
2012-05-25 13:26:54,908 Stage-1 map= 100%, reduce = 77%
20

本文介绍了在Hive中使用MapJoin优化查询性能的过程。通过创建`lpx_mapjoin`表并执行MapJoin操作,减少了Reduce任务,显著提升了处理速度。在Map阶段即完成两个表的JOIN,避免了大量数据的shuffle过程,从而提高了大数据处理的效率。
最低0.47元/天 解锁文章
678

被折叠的 条评论
为什么被折叠?



