Hive中mapjoin优化例子

1 基本信息

3个表,1个事实表,2个维度表
事实表 test_fact (mid string,sex_id string,age_id string )
维度表dim_user_demography_age (age_id  string,age_name string )
维度表dim_user_demography_sex (sex_id string,sex_name  string

测试的sql

select mid,sex_name,age_name from test_fact   f
join dim_user_demography_age d1 on d1.age_id=f.age_id
join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;

d1 ,d2是维度表,仅有几行数据,数据量<10k

2 测试用到的参数及说明

测试用到的参数

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;

2.1 hive.auto.convert.join

启用该参数后,hive会基于表的size自动的将普通join转换成mapjoin

2.2 hive.auto.convert.join.noconditionaltask

启用该参数后,hive会基于表的size自动的将普通join转换成mapjoin.对于一个n路连接来说,如果其中n-1个表或者分区的数据大小小于特定值,将会把一个普通join转换成mapjoin
测试用的sql,可以看做是一个3路连接

select mid,sex_name,age_name from test_fact   f
join dim_user_demography_age d1 on d1.age_id=f.age_id
join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;

2.3 hive.auto.convert.join.noconditionaltask.size

默认为10M,以上述sql举例,就是要求d1 d2 2个表的数据大小<10M。只有 hive.auto.convert.join.noconditionaltask=true的时候才生效

3 测试说明

下面分情况讨论下上上面参数对任务的影响。

3.1 启用hive.auto.convert.join.noconditionaltask

如果该参数启用,该sql只有有1个job。
日志如下,维度表d1,d2会被本地被转成成1个hash表,上传到hdfs。在mr任务中,用于做mapjoin。任务速度最快。

Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:17:09     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:17:10     Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
2018-04-24 16:17:10     Uploaded 1 File to: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable (308 bytes)
2018-04-24 16:17:10     Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile11--.hashtable
2018-04-24 16:17:10     Uploaded 1 File to: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile11--.hashtable (387 bytes)

3.2 不启用hive.auto.convert.join.noconditionaltask

日志如下,会有两个mapjoin的过程,性能差。
1)将d1本地生成map表,上传hdfs
2)f与d1做一次mapjoin,保留结果
3)将d2本地生成map表,上传hdfs
4)与d2的map表做一次mapjoin
5)查询结果

hive> select mid,sex_name,age_name from test_fact   f
    > join dim_user_demography_age d1 on d1.age_id=f.age_id
    > join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;
Query ID = hdp-ads-audit_20180424161919_0663afcb-af1f-4b0f-b17b-23117c257969
Total jobs = 5
Stage-13 is selected by condition resolver.
Stage-1 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:19:25     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:19:26     Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10008/HashTable-Stage-8/MapJoin-mapfile21--.hashtable
2018-04-24 16:19:26     Uploaded 1 File to: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10008/HashTable-Stage-8/MapJoin-mapfile21--.hashtable (387 bytes)
2018-04-24 16:19:26     End of local task; Time Taken: 0.984 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1524125616287_138024, Tracking URL = http://mhdp12.namenodetest:8888/proxy/application_1524125616287_138024/
Kill Command = /usr/bin/hadoop/software/yarn//bin/hadoop job  -kill job_1524125616287_138024
Hadoop job information for Stage-8: number of mappers: 1; number of reducers: 0
2018-04-24 16:19:40,865 Stage-8 map = 0%,  reduce = 0%
2018-04-24 16:19:59,570 Stage-8 map = 100%,  reduce = 0%, Cumulative CPU 15.58 sec
MapReduce Total cumulative CPU time: 15 seconds 580 msec
Ended Job = job_1524125616287_138024
Moved to trash: /home/dddddd/hive/scratchdir/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-mr-10013/bd115834-c22a-4d5e-b960-d09ddaf5a361/map.xml
Moved to trash: /home/dddddd/hive/scratchdir/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/_task_tmp.-mr-10003
Stage-11 is selected by condition resolver.
Stage-12 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:20:05     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:20:06     Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
2018-04-24 16:20:06     Uploaded 1 File to: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable (308 bytes)
2018-04-24 16:20:06     End of local task; Time Taken: 0.95 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1524125616287_138033, Tracking URL = http://mhdp12.namenodetest:8888/proxy/application_1524125616287_138033/
Kill Command = /usr/bin/hadoop/software/yarn//bin/hadoop job  -kill job_1524125616287_138033
Interrupting... Be patient, this might take some time.
Press Ctrl+C again to kill JVM
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: Age of Ai 设计师:meimeiellie 返回首页