Hive--参数优化、Map、Reduce Task个数优化

最新推荐文章于 2022-08-29 19:55:02 发布

XK&RM

最新推荐文章于 2022-08-29 19:55:02 发布

阅读量1.6k

点赞数

分类专栏： Hive 文章标签： hive 大数据

本文链接：https://blog.csdn.net/qq_41301707/article/details/111272062

版权

Hive 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Hive--参数优化、Map、Reduce Task个数优化

1 Hive--参数优化

1.1 hive.fetch.task.conversion

1.2 hive.exec.mode.local.auto

1.3 hive.mapred.mode

1.4 hive.mapred.reduce.tasks.speculative.execution

1.5 hive.optimize.cp

1.6 hive.optimize.ppd

2 MapReduce 阶段Map、Reduce Task个数优化

2.1 Map Task 个数优化

2.2 Reduce Task 个数优化

Hive中的执行引擎目前支持：MapReduce、Spark、Tez
本文设定的执行引擎为MapReduce

1 Hive--参数优化

Hive官网--参数

1.1 hive.fetch.task.conversion

Default Value: minimal in Hive 0.10.0 through 0.13.1, more in Hive 0.14.0 and later
Added In: Hive 0.10.0 with HIVE-2925; default changed in Hive 0.14.0 with HIVE-7397
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.

Supported values are none, minimal and more.

0. none:  Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)
1. minimal:  SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more:  SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

"more" can take any kind of expressions in the SELECT clause, including UDFs.
(UDTFs and lateral views are not yet supported – see HIVE-5718.)

建议使用more模式，增加SQL执行速度

1.1.1 none模式

none:禁用这个参数，SQL无论什么样子都会走MapReduce

hive> set hive.fetch.task.conversion;
hive.fetch.task.conversion=none

hive> select * from bigdata.emp;
Query ID = work_20201216094245_d44ea4d3-0a5b-4302-93dd-4ef9a5252517
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608016084001_0020, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0020/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-16 09:43:02,342 Stage-1 map = 0%,  reduce = 0%
2020-12-16 09:43:10,641 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.28 sec
MapReduce Total cumulative CPU time: 2 seconds 280 msec
Ended Job = job_1608016084001_0020
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.28 sec   HDFS Read: 4413 HDFS Write: 451 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 280 msec
OK
7369	SMITH	20
7499	ALLEN	30
7521	WARD	30
7566	JONES	20
7654	MARTIN	30
7698	BLAKE	30
7782	CLARK	10
7788	SCOTT	20
7839	KING	10
7844	TURNER	30
7876	ADAMS	20
7900	JAMES	30
7902	FORD	20
7934	MILLER	10
Time taken: 27.002 seconds, Fetched: 14 row(s)

1.1.2 minimal模式

minimal:正常扫描全表，不会触发MapReduce,如果进行FILTER会触发MapRedcue
分区表的分区字段FILTER不会触发MapReduce
正常表

hive> set hive.fetch.task.conversion;
hive.fetch.task.conversion=minimal

hive> select * from bigdata.emp where dept_no = '20';
Query ID = work_20201216094750_3df492b8-bbd8-4e41-b378-bba5fe1b3dc7
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608016084001_0022, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0022/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-16 09:48:07,799 Stage-1 map = 0%,  reduce = 0%
2020-12-16 09:48:17,119 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.33 sec
MapReduce Total cumulative CPU time: 4 seconds 330 msec
Ended Job = job_1608016084001_0022
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 4.33 sec   HDFS Read: 4952 HDFS Write: 216 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 330 msec
OK
7369	SMITH	20
7566	JONES	20
7788	SCOTT	20
7876	ADAMS	20
7902	FORD	20
Time taken: 27.728 seconds, Fetched: 5 row(s)
hive> select * from bigdata.emp;
OK
7369	SMITH	20
7499	ALLEN	30
7521	WARD	30
7566	JONES	20
7654	MARTIN	30
7698	BLAKE	30
7782	CLARK	10
7788	SCOTT	20
7839	KING	10
7844	TURNER	30
7876	ADAMS	20
7900	JAMES	30
7902	FORD	20
7934	MILLER	10
Time taken: 0.144 seconds, Fetched: 14 row(s)

分区表
创建分区表并加载数据

CREATE TABLE IF NOT EXISTS bigdata.emp_partition(
emp_no String,
emp_name String
)
PARTITIONED BY (dept_no String)
ROW FORMAT 
DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
-- 开启动态分区
set hive.exec.dynamic.partition=true; 
-- 这个属性默认是strict，即限制模式，strict是避免全分区字段是动态的，必须至少一个分区字段是指定有值即静态的，且必
-- 须放在最前面。设置为nonstrict之后所有的分区都可以是动态的了。
set hive.exec.dynamic.partition.mode=nonstrict;
hive> load data local inpath '/home/work/data/hive/emp.txt' overwrite into table bigdata.emp_partition;
hive> select * from bigdata.emp_partition;
OK
7782	CLARK	10
7839	KING	10
7934	MILLER	10
7369	SMITH	20
7566	JONES	20
7788	SCOTT	20
7876	ADAMS	20
7902	FORD	20
7499	ALLEN	30
7521	WARD	30
7654	MARTIN	30
7698	BLAKE	30
7844	TURNER	30
7900	JAMES	30
Time taken: 0.204 seconds, Fetched: 14 row(s)

分区表测试分区字段FILTER是否走了MapReduce

hive> set hive.fetch.task.conversion;
hive.fetch.task.conversion=minimal
hive> select * from bigdata.emp_partition where dept_no = '20';
OK
7369	SMITH	20
7566	JONES	20
7788	SCOTT	20
7876	ADAMS	20
7902	FORD	20
Time taken: 0.172 seconds, Fetched: 5 row(s)

1.1.3 more 模式

在more模式下面，无论是否是分区表FILTER都不会走MapReduce

hive> set hive.fetch.task.conversion=more;
hive> select * from bigdata.emp where dept_no = '20';
OK
7369	SMITH	20
7566	JONES	20
7788	SCOTT	20
7876	ADAMS	20
7902	FORD	20
Time taken: 0.153 seconds, Fetched: 5 row(s)
hive> select * from bigdata.emp_partition where dept_no = '20';
OK
7369	SMITH	20
7566	JONES	20
7788	SCOTT	20
7876	ADAMS	20
7902	FORD	20
Time taken: 0.224 seconds, Fetched: 5 row(s)

1.2 hive.exec.mode.local.auto

默认的是false,即关闭本地模式
线上建议关闭本地模式,开发或者测试建议开启本地模式
有时候在数据量比较小的时候，或者本地测试的，没有必要把作业提交到Yarn再走MapReduce，直接使用本地模式就好了，可以增加查询速度，加快开发的速度
在开启本地模式之后，还需要设定以下几个参数
hive.exec.mode.local.auto.inputbytes.max：在本地模式下可以处理的最大的数据量，默认是128M
hive.exec.mode.local.auto.tasks.max：在本地模式下，最大的task数量，默认是4
hive.exec.mode.local.auto.input.files.max：在本地模式下，最大的文件数，默认是4
下面就以count(1) 来对比是否开启本地模式的查询速度
没有打开本地模式

hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false

hive> select count(1) from bigdata.emp;
Query ID = work_20201216102827_ff3113d0-5c91-4a4f-a330-f9ce782d0e62
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1608016084001_0024, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0024/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0024
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-16 10:28:43,829 Stage-1 map = 0%,  reduce = 0%
2020-12-16 10:28:54,149 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.28 sec
2020-12-16 10:29:00,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.24 sec
MapReduce Total cumulative CPU time: 6 seconds 240 msec
Ended Job = job_1608016084001_0024
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.24 sec   HDFS Read: 8334 HDFS Write: 102 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 240 msec
OK
14
Time taken: 34.122 seconds, Fetched: 1 row(s)

开启本地模式

hive> set hive.exec.mode.local.auto=true;
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=true

hive> select count(1) from bigdata.emp;
Automatically selecting local only mode for query
Query ID = work_20201216103030_6c88b989-8348-4521-aa41-c23dee70931e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
20/12/16 10:30:33 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-work/mapred/local/1608085830754/3.0.0-cdh6.2.0-mr-framework.tar.gz <- /home/work/mr-framework
20/12/16 10:30:33 INFO mapred.LocalDistributedCacheManager: Localized hdfs://nameservice1/user/yarn/mapreduce/mr-framework/3.0.0-cdh6.2.0-mr-framework.tar.gz as file:/tmp/hadoop-work/mapred/local/1608085830754/3.0.0-cdh6.2.0-mr-framework.tar.gz
20/12/16 10:30:33 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-work/mapred/local/1608085830755/libjars <- /home/work/libjars/*
20/12/16 10:30:33 WARN mapred.LocalDistributedCacheManager: Failed to create symlink: /tmp/hadoop-work/mapred/local/1608085830755/libjars <- /home/work/libjars/*
20/12/16 10:30:33 INFO mapred.LocalDistributedCacheManager: Localized file:/tmp/hadoop/mapred/staging/work558758749/.staging/job_local558758749_0001/libjars as file:/tmp/hadoop-work/mapred/local/1608085830755/libjars
Job running in-process (local Hadoop)
20/12/16 10:30:33 INFO mapred.LocalJobRunner: OutputCommitter set in config org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter
20/12/16 10:30:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter
20/12/16 10:30:33 INFO mapred.LocalJobRunner: Waiting for map tasks
20/12/16 10:30:33 INFO mapred.LocalJobRunner: Starting task: attempt_local558758749_0001_m_000000_0
20/12/16 10:30:33 INFO mapred.LocalJobRunner: 
20/12/16 10:30:33 INFO mapred.LocalJobRunner: hdfs://nameservice1/user/hive/warehouse/bigdata.db/emp/emp.txt:0+195
20/12/16 10:30:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local558758749_0001_m_000000_0
20/12/16 10:30:33 INFO mapred.LocalJobRunner: map task executor complete.
20/12/16 10:30:33 INFO mapred.LocalJobRunner: Waiting for reduce tasks
20/12/16 10:30:33 INFO mapred.LocalJobRunner: Starting task: attempt_local558758749_0001_r_000000_0
20/12/16 10:30:33 INFO mapred.LocalJobRunner: 1 / 1 copied.
20/12/16 10:30:33 INFO mapred.LocalJobRunner: 1 / 1 copied.
20/12/16 10:30:33 INFO mapred.LocalJobRunner: reduce > reduce
20/12/16 10:30:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local558758749_0001_r_000000_0
20/12/16 10:30:33 INFO mapred.LocalJobRunner: reduce task executor complete.
2020-12-16 10:30:34,084 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local558758749_0001
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 464136294 HDFS Write: 848647374 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
14
Time taken: 3.85 seconds, Fetched: 1 row(s)

1.3 hive.mapred.mode

Default Value: 
Hive 0.x: nonstrict
Hive 1.x: nonstrict
Hive 2.x: strict (HIVE-12413)
Added In: Hive 0.3.0
The mode in which the Hive operations are being performed. In strict mode, some risky queries are not allowed to run. For example, full table scans are prevented (see HIVE-10454) and ORDER BY requires a LIMIT clause.

在非严格模式下，SQL不会增加任何显示
在严格模式下，Order By后面要增加Limit,分区表FILTER必须要加上分区字段，无法使用笛卡尔积语法
绝大部分场景建议使用严格模式，可以有效保护数据平台，有一些特殊的场景可以开启非严格模式
以下在非严格模式下面测试

hive> set hive.mapred.mode;
hive.mapred.mode=nonstrict
-- 正常表使用Order by
hive> select * from bigdata.emp order by emp_no;
Automatically selecting local only mode for query
Query ID = work_20201216104456_5f7b9b48-11d8-4268-9101-0e93e77d9e28
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
20/12/16 10:44:58 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-work/mapred/local/1608086696446/3.0.0-cdh6.2.0-mr-framework.tar.gz <- /home/work/mr-framework
20/12/16 10:44:58 INFO mapred.LocalDistributedCacheManager: Localized hdfs://nameservice1/user/yarn/mapreduce/mr-framework/3.0.0-cdh6.2.0-mr-framework.tar.gz as file:/tmp/hadoop-work/mapred/local/1608086696446/3.0.0-cdh6.2.0-mr-framework.tar.gz
20/12/16 10:44:58 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-work/mapred/local/1608086696447/libjars <- /home/work/libjars/*
20/12/16 10:44:58 WARN mapred.LocalDistributedCacheManager: Failed to create symlink: /tmp/hadoop-work/mapred/local/1608086696447/libjars <- /home/work/libjars/*
20/12/16 10:44:58 INFO mapred.LocalDistributedCacheManager: Localized file:/tmp/hadoop/mapred/staging/work1994370039/.staging/job_local1994370039_0002/libjars as file:/tmp/hadoop-work/mapred/local/1608086696447/libjars
Job running in-process (local Hadoop)
20/12/16 10:44:58 INFO mapred.LocalJobRunner: OutputCommitter set in config org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter
20/12/16 10:44:58 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.hive.ql.io.HiveFileFormatUtils$NullOutputCommitter
20/12/16 10:44:58 INFO mapred.LocalJobRunner: Waiting for map tasks
20/12/16 10:44:58 INFO mapred.LocalJobRunner: Starting task: attempt_local1994370039_0002_m_000000_0
20/12/16 10:44:58 INFO mapred.LocalJobRunner: 
20/12/16 10:44:58 INFO mapred.LocalJobRunner: hdfs://nameservice1/user/hive/warehouse/bigdata.db/emp/emp.txt:0+195
20/12/16 10:44:58 INFO mapred.LocalJobRunner: Finishing task: attempt_local1994370039_0002_m_000000_0
20/12/16 10:44:58 INFO mapred.LocalJobRunner: map task executor complete.
20/12/16 10:44:58 INFO mapred.LocalJobRunner: Waiting for reduce tasks
20/12/16 10:44:58 INFO mapred.LocalJobRunner: Starting task: attempt_local1994370039_0002_r_000000_0
20/12/16 10:44:58 INFO mapred.LocalJobRunner: 1 / 1 copied.
20/12/16 10:44:58 INFO mapred.LocalJobRunner: 1 / 1 copied.
20/12/16 10:44:58 INFO mapred.LocalJobRunner: reduce > reduce
20/12/16 10:44:58 INFO mapred.LocalJobRunner: Finishing task: attempt_local1994370039_0002_r_000000_0
20/12/16 10:44:58 INFO mapred.LocalJobRunner: reduce task executor complete.
2020-12-16 10:44:59,685 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1994370039_0002
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 928267738 HDFS Write: 848647927 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
7369	SMITH	20
7499	ALLEN	30
7521	WARD	30
7566	JONES	20
7654	MARTIN	30
7698	BLAKE	30
7782	CLARK	10
7788	SCOTT	20
7839	KING	10
7844	TURNER	30
7876	ADAMS	20
7900	JAMES	30
7902	FORD	20
7934	MILLER	10
Time taken: 3.617 seconds, Fetched: 14 row(s)
-- 分区表不使用分区字段
hive> select * from bigdata.emp_partition where emp_no='7782';
OK
7782	CLARK	10
Time taken: 0.154 seconds, Fetched: 1 row(s)

下面在严格模式下面测试

hive> set hive.mapred.mode;
hive.mapred.mode=strict
-- 正常表使用orderby 
hive> select * from bigdata.emp order by emp_no;
FAILED: SemanticException 1:35 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'emp_no'
-- 分区表FILTER不使用分区字段
hive> select * from bigdata.emp_partition where emp_no='20';
FAILED: SemanticException [Error 10056]: Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.no.partition.filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "emp_partition" Table "emp_partition"
-- 笛卡尔积测试
hive> select * from bigdata.emp a join bigdata.emp b;
FAILED: SemanticException Cartesian products are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.cartesian.product to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.

1.4 hive.mapred.reduce.tasks.speculative.execution

Default Value: true
Added In: Hive 0.5.0
Whether speculative execution for reducers should be turned on.

推测式执行，默认是true,开启推测式执行，如果出现长尾作业，会在另一台机器上面重新开启一个Task执行，两个Task哪个先执行完就用哪个结果

1.5 hive.optimize.cp

Default Value: true
Added In: Hive 0.4.0 with HIVE-626
Removed In: Hive 0.13.0 with HIVE-4113
Whether to enable column pruner. (This configuration property was removed in release 0.13.0.)

列裁剪，默认是开启
在select 查询的时候尽量只拿取需要的字段，而不要select *，拿取所有的字段，这样可以有效的减少IO
对应的还有分区裁剪，如果是一个分区表，过滤的时候优先过滤分区字段，可以减少扫描的文件，以此来减少IO操作

1.6 hive.optimize.ppd

Default Value: true
Added In: Hive 0.4.0 with HIVE-279, default changed to true in Hive 0.4.0 with HIVE-626
Whether to enable predicate pushdown (PPD). 

Note: Turn on Configuration Properties#hive.optimize.index.filter as well to use file format specific indexes with PPD.

谓词下压，默认是开启，如果两个表做Join操作的时候，优先先把两个表的过滤条件筛选一部分数据，以此来减少scan的数据量，来减少IO

2 MapReduce 阶段Map、Reduce Task个数优化

2.1 Map Task 个数优化

Map Task 个数一般不需要优化，查看MapReduce Inputformat 找到 Map Task 是由这个参数决定的mapreduce.input.fileinputformat.split.maxsize
mapreduce.input.fileinputformat.split.maxsize指的是InputFormat切割文件的时候最大的size，当这个参数越大，Map阶段的Task数量越小
查看mapreduce.input.fileinputformat.split.maxsize默认大小

hive> set mapreduce.input.fileinputformat.split.maxsize;
mapreduce.input.fileinputformat.split.maxsize=256000000

2.2 Reduce Task 个数优化

查看Hive 控制台日志输出，发现这些参数和Reduce Task个数有关系

Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>

查看源代码计算Reduce 个数是在org.apache.hadoop.hive.ql.exec.mr.MapRedTask这个类里面 setNumberOfReducers 函数计算的
首先寻找mapred.reduce.tasks 是否设定Reduce Task 个数，如果设定会读取mapred.reduce.tasks参数设定的值作为最终的Reduce Task 的个数，mapred.reduce.tasks默认是-1，通过计算的到Reduce个数
计算Reduce 公式

int reducers = Utilities.estimateNumberOfReducers(conf, inputSummary, work.getMapWork(),
    work.isFinalMapRed()){
	long bytesPerReducer = conf.getLongVar(HiveConf.ConfVars.BYTESPERREDUCER);
	int maxReducers = conf.getIntVar(HiveConf.ConfVars.MAXREDUCERS);
	estimateReducers(totalInputFileSize, bytesPerReducer, maxReducers, powersOfTwo){
		// bytesPerReducer 数据就是通过这个参数设定的 hive.exec.reducers.bytes.per.reducer 默认是256000000L
		// maxReducers 是通过hive.exec.reducers.max这个参数设定的，默认是1009
		// bytes 是这批数据的总的字节大小
		double bytes = Math.max(totalInputFileSize, bytesPerReducer);
		int reducers = (int) Math.ceil(bytes / bytesPerReducer);
		reducers = Math.max(1, reducers);
		reducers = Math.min(maxReducers, reducers);
		// 总的来说，可以把Redcue Task 计算公式 = min((总的数据字节大小/hive.exec.reducers.bytes.per.reducer参数设定的数据),hive.exec.reducers.max设定的数据大小)
	}
  }

所以需要修改Reduce Task个数，需要修改hive.exec.reducers.bytes.per.reducer即可，如果需要增加Reduce Task个数，则减少hive.exec.reducers.bytes.per.reducer大小

XK&RM

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Hive--参数优化、Map、Reduce Task个数优化

Hive--优化Hive中的执行引擎目前支持：MapReduce、Spark、Tez 本文设定的执行引擎为MapReduce1 Hive--参数优化Hive官网--参数1.1 hive.fetch.task.conversionDefault Value: minimal in Hive 0.10.0 through 0.13.1, more in Hive 0.14.0 and la...
复制链接

扫一扫