Hive On Tez,Tez 和 MapReduce engine 参数优化

Hive On Tez

Hive 默认支持 MapReduce,Tez,Spark(在 SparkSQL 中支持) 等执行引擎。因此给 Hive 换上 Tez 非常简单,只需给 hive-site.xml 中设置:

<property>
    <name>hive.execution.engine</name>
    <value>tez</value>
</property>

设置hive.execution.engine为 tez 后进入到 Hive 执行 SQL:

hive> select count(*) as c from userinfo;
Query ID = zhenqin_20161104150743_4155afab-4bfa-4e8a-acb0-90c8c50ecfb5
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1478229439699_0007)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.19 s     
--------------------------------------------------------------------------------
OK
1000000
Time taken: 6.611 seconds, Fetched: 1 row(s)

可以看到,我的 userinfo 中有 100W 条记录,执行一遍 count 需要 6.19s。 现在把 engine 换为 mr

set hive.execution.engine=mr;

再次执行 count userinfo:

hive> select count(*) as c from userinfo;
Query ID = zhenqin_20161104152022_c7e6c5bd-d456-4ec7-b895-c81a369aab27
Total jobs = 1
Launching Job 1 out of 1
Starting Job = job_1478229439699_0010, Tracking URL = http://localhost:8088/proxy/application_1478229439699_0010/
Kill Command = /Users/zhenqin/software/hadoop/bin/hadoop job  -kill job_1478229439699_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-11-04 15:20:28,323 Stage-1 map = 0%,  reduce = 0%
2016-11-04 15:20:34,587 Stage-1 map = 100%,  reduce = 0%
2016-11-04 15:20:40,796 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_1478229439699_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   HDFS Read: 215 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1000000
Time taken: 19.46 seconds, Fetched: 1 row(s)
hive> 

可以看到,使用 Tez 效率比 MapReduce 有近3倍的提升。而且,Hive 在使用 Tez 引擎执行时,有 ==>> 动态的进度指示。而在使用 mr 时,只有日志输出 map and reduce 的进度百分比。使用 tez,输出的日志也清爽很多。

在我测试的很多复杂的 SQL,Tez 的都比 MapReduce 快很多,快慢取决于 SQL 的复杂度。执行简单的 select 等并不能体现 tez 的优势。Tez 内部翻译 SQL 能任意的 Map,Reduce,Reduce 组合,而 MR 只能 Map->Reduce->Map->Reduce,因此在执行复杂 SQL 时, Tez 的优势明显。

 

Tez 和 MapReduce engine 参数优化

优化参参数(在同样条件下,使用了tez从300s+降到200s+)

set hive.execution.engine=tez;
set mapred.job.name=recommend_user_profile_$idate;
set mapred.reduce.tasks=-1;
set hive.exec.reducers.max=160;
set hive.auto.convert.join=true;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=16; 
set hive.optimize.skewjoin=true;
set hive.exec.reducers.bytes.per.reducer=100000000;
set mapred.max.split.size=200000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

Tez内存优化

1、AM、Container大小设置

  • tez.am.resource.memory.mb

参数说明:Set tez.am.resource.memory.mb tobe the same as yarn.scheduler.minimum-allocation-mb the YARNminimum container size.

  • hive.tez.container.size

参数说明:Set hive.tez.container.size to be the same as or a small multiple(1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb but NEVER more than yarn.scheduler.maximum-allocation-mb.

2、AM、Container JVM参数设置

  • tez.am.launch.cmd-opts 

默认值:80%*tez.am.resource.memory.mb

参数说明:一般不需要调整

  • hive.tez.java.ops

    默认值:80%*hive.tez.container.size

       参数说明:Hortonworks建议“–server –Djava.net.preferIPv4Stack=true–XX:NewRatio=8 –XX:+UseNUMA –XX:UseG1G”

  • tez.container.max.java.heap.fraction

    默认值:0.8

       参数说明:task\AM占用JVM Xmx的比例,该参数建议调整,需根据具体业务情况修改;

3、Hive内存Map Join参数设置

  • tez.runtime.io.sort.mb

默认值:100

参数说明:输出排序需要的内存大小。建议值:40%*hive.tez.container.size,一般不超过2G;

  • hive.auto.convert.join.noconditionaltask

默认值:true

参数说明:是否将多个mapjoin合并为一个,使用默认值

  • hive.auto.convert.join.noconditionaltask.size

默认值:

参数说明:多个mapjoin转换为1个时,所有小表的文件大小总和的最大值,这个值只是限制输入的表文件的大小,并不代表实际mapjoin时hashtable的大小。 建议值:1/3* hive.tez.container.size

  • tez.runtime.unordered.output.buffer.size-mb

默认值:100

参数说明:Size of the buffer to use if not writing directly to disk.。 建议值:10%* hive.tez.container.size

4、Container重用设置

  • tez.am.container.reuse.enabled

    默认值:true

    参数说明:Container重用开关

Mapper/Reducer优化

1、Mapper数设置

默认值:50*1024*1024

参数说明:Lower bound on thesize (in bytes) of a grouped split, to avoid generating too many small splits.

  • tez.grouping.max-size

默认值:1024*1024*1024

参数说明:Upper bound on thesize (in bytes) of a grouped split, to avoid generating excessively largesplits.

2、Reducer数设置

  • hive.tez.auto.reducer.parallelism

默认值:false

参数说明:Turn on Tez' autoreducer parallelism feature. When enabled, Hive will still estimate data sizesand set parallelism estimates. Tez will sample source vertices' output sizesand adjust the estimates at runtime as necessary.

建议设置为true.

  • hive.tex.min.partition.factor

默认值:0.25

参数说明:When auto reducerparallelism is enabled this factor will be used to put a lower limit to thenumber of reducers that Tez specifies.

  • hive.tez.max.partition.factor

默认值:2.0

参数说明:When auto reducerparallelism is enabled this factor will be used to over-partition data inshuffle edges.

  • hive.exec.reducers.bytes.per.reducer

默认值:256,000,000

参数说明:Sizeper reducer. The default in Hive 0.14.0 and earlier is 1 GB, that is, if theinput size is 10 GB then 10 reducers will be used. In Hive 0.14.0 and later thedefault is 256 MB, that is, if the input size is 1 GB then 4 reducers willbe used.

 

以下公式确认Reducer个数:

Max(1, Min(hive.exec.reducers.max [1009], ReducerStage estimate/hive.exec.reducers.bytes.per.reducer))x hive.tez.max.partition.factor [2]

3、Shuffle参数设置

  • tez.shuffle-vertex-manager.min-src-fraction

默认值:0.25

参数说明:thefraction of source tasks which should complete before tasks for the currentvertex are scheduled.

  • tez.shuffle-vertex-manager.max-src-fraction

默认值:0.75

参数说明:oncethis fraction of source tasks have completed, all tasks on the current vertexcan be scheduled. Number of tasks ready for scheduling on the current vertexscales linearly between min-fraction and max-fraction.

 

例子:

hive.exec.reducers.bytes.per.reducer=1073741824; // 1gb

tez.shuffle-vertex-manager.min-src-fraction=0.25;

tez.shuffle-vertex-manager.max-src-fraction=0.75;

This indicates thatthe decision will be made between 25% of mappers finishing and 75% of mappersfinishing, provided there's at least 1Gb of data being output (i.e if 25% ofmappers don't send 1Gb of data, we will wait till at least 1Gb is sent out).

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值