Hive on Tez : How to control the number of Mappers and Reducers

最新推荐文章于 2023-09-14 11:26:58 发布

头顶榴莲树

最新推荐文章于 2023-09-14 11:26:58 发布

阅读量454

点赞数

分类专栏： hive 文章标签： hive hadoop 数据仓库

hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Hive on Tez : How to control the number of Mappers and Reducers

Goal:

How to control the number of Mappers and Reducers in Hive on Tez.

Env:

Hive 2.1
Tez 0.8

Solution:

1. # of Mappers

Which Tez parameters control this?

tez.grouping.max-size(default 1073741824 which is 1GB)
tez.grouping.min-size(default 52428800 which is 50MB)
tez.grouping.split-count(not set by default)

Which log for debugging # of Mappers?
DAG syslog in the DAG Application Master container directory.
Search for "grouper.TezSplitGrouper", for example:

# grep grouper.TezSplitGrouper syslog_dag_1475192050844_0026_1

2017-05-23 15:00:50,285 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Grouping splits in Tez

2017-05-23 15:00:50,288 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired numSplits: 59 lengthPerGroup: 97890789 numLocations: 4 numSplitsPerLocation: 40 numSplitsInGroup: 2 totalLength: 5775556608 numOriginalSplits: 161 . Grouping by length: true count: false nodeLocalOnly: false

2017-05-23 15:00:50,291 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Doing rack local after iteration: 18 splitsProcessed: 139 numFullGroupsInRound: 0 totalGroups: 68 lengthPerGroup: 73418096 numSplitsInGroup: 1

2017-05-23 15:00:50,291 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Allowing small groups after iteration: 19 splitsProcessed: 139 numFullGroupsInRound: 0 totalGroups: 68

2017-05-23 15:00:50,291 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Number of splits desired: 59 created: 69 splitsProcessed: 161

This means, this Hive on Tez query finally spawns 69 Mappers.

If we set tez.grouping.max-size=tez.grouping.min-size=1073741824(1G), here is the result:

# grep grouper.TezSplitGrouper syslog_dag_1475192050844_0030_1

2017-05-23 17:16:11,851 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Grouping splits in Tez

2017-05-23 17:16:11,852 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired splits: 6 too large. Desired splitLength: 97890789 Min splitLength: 1073741824 New desired splits: 6 Final desired splits: 6 All splits have localhost: false Total length: 5775556608 Original splits: 161

2017-05-23 17:16:11,854 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired numSplits: 6 lengthPerGroup: 962592768 numLocations: 4 numSplitsPerLocation: 40 numSplitsInGroup: 26 totalLength: 5775556608 numOriginalSplits: 161 . Grouping by length: true count: false nodeLocalOnly: false

2017-05-23 17:16:11,856 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Doing rack local after iteration: 3 splitsProcessed: 135 numFullGroupsInRound: 0 totalGroups: 6 lengthPerGroup: 721944576 numSplitsInGroup: 19

2017-05-23 17:16:11,856 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Allowing small groups after iteration: 4 splitsProcessed: 135 numFullGroupsInRound: 0 totalGroups: 6

2017-05-23 17:16:11,856 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Number of splits desired: 6 created: 7 splitsProcessed: 161

This time only 7 Mappers are spawned because of "Min splitLength: 1073741824".

If we set tez.grouping.split-count=13 here is the result:

# grep grouper.TezSplitGrouper syslog_dag_1475192050844_0039_1

2017-05-24 16:27:05,523 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Grouping splits in Tez

2017-05-24 16:27:05,523 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired numSplits overridden by config to: 13

2017-05-24 16:27:05,526 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired numSplits: 13 lengthPerGroup: 444273585 numLocations: 4 numSplitsPerLocation: 40 numSplitsInGroup: 12 totalLength: 5775556608 numOriginalSplits: 161 . Grouping by length: true count: false nodeLocalOnly: false

2017-05-24 16:27:05,528 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Doing rack local after iteration: 5 splitsProcessed: 156 numFullGroupsInRound: 0 totalGroups: 14 lengthPerGroup: 333205184 numSplitsInGroup: 9

2017-05-24 16:27:05,528 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Allowing small groups after iteration: 6 splitsProcessed: 156 numFullGroupsInRound: 0 totalGroups: 14

2017-05-24 16:27:05,528 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Number of splits desired: 13 created: 15 splitsProcessed: 161

This time 15 Mappers are spawned because of "Desired numSplits overridden by config to: 13".

BTW, the query tested is "select count(*) from passwords".
Here "Original splits: 161" means there are totally 161 files:

1 2	`# ls passwords\|wc -l` `161`

Here "Total length: 5775556608" means the table size is about 5.4G:

1 2	`# du -b passwords` `5775556769 passwords`

The detailed algorithm is in tez-mapreduce/src/main/java/org/apache/tez/mapreduce/grouper/TezSplitGrouper.java.
And also this article is explaining the logic:How initial task parallelism works - Tez - Apache Software Foundation

2. # of Reducers

Same as Hive on MR query, below parameters controls # of Reducers:

hive.exec.reducers.bytes.per.reducer(default 256000000)
hive.exec.reducers.max(default 1009)
hive.tez.auto.reducer.parallelism(default false)

Take below query for example, focus on "Reducer 2" which is the join:

hive> select count(*) from passwords a, passwords b where a.col0=b.col1;

Query ID = mapr_20170524140623_bc36636e-c295-4e75-b7ac-fe066320dce1

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1475192050844_0034)

----------------------------------------------------------------------------------------------

VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

----------------------------------------------------------------------------------------------

Map 1 .......... container SUCCEEDED 69 69 0 0 0 0

Map 4 .......... container SUCCEEDED 69 69 0 0 0 0

Reducer 2 ...... container SUCCEEDED 45 45 0 0 1 0

Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0

----------------------------------------------------------------------------------------------

VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 164.32 s

----------------------------------------------------------------------------------------------

OK

0

"Reducer 2" spawns 45 Reducers.

If we double hive.exec.reducers.bytes.per.reducer to 512000000, "Reducer 2" spawns only half # of Reducers -- 23 this time.

hive> set hive.exec.reducers.bytes.per.reducer=512000000;

hive> select count(*) from passwords a, passwords b where a.col0=b.col1;

Query ID = mapr_20170524142206_d07caa6a-0061-43a6-b5e9-4f67880cf118

Total jobs = 1

Launching Job 1 out of 1

Tez session was closed. Reopening...

Session re-established.

Status: Running (Executing on YARN cluster with App id application_1475192050844_0035)

----------------------------------------------------------------------------------------------

VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

----------------------------------------------------------------------------------------------

Map 1 .......... container SUCCEEDED 69 69 0 0 0 0

Map 4 .......... container SUCCEEDED 69 69 0 0 0 0

Reducer 2 ...... container SUCCEEDED 23 23 0 0 0 0

Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0

----------------------------------------------------------------------------------------------

VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 179.62 s

----------------------------------------------------------------------------------------------

OK

0

Of course, we can set a hard limit of # of Reducers by setting hive.exec.reducers.max=10:

hive> set hive.exec.reducers.max=10;

hive> select count(*) from passwords a, passwords b where a.col0=b.col1;

Query ID = mapr_20170524142736_4367dee2-b695-4162-ad47-99d7ff2311bc

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1475192050844_0035)

----------------------------------------------------------------------------------------------

VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

----------------------------------------------------------------------------------------------

Map 1 .......... container SUCCEEDED 69 69 0 0 0 0

Map 4 .......... container SUCCEEDED 69 69 0 0 0 0

Reducer 2 ...... container SUCCEEDED 10 10 0 0 1 0

Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0

----------------------------------------------------------------------------------------------

VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 153.35 s

----------------------------------------------------------------------------------------------

OK

0

Another feature is controlled by hive.tez.auto.reducer.parallelism:
Turn on Tez' auto reducer parallelism feature. When enabled, Hive will still estimate data sizes and set parallelism estimates. Tez will sample source vertices' output sizes and adjust the estimates at runtime as necessary.

hive> set hive.tez.auto.reducer.parallelism = true;

hive> select count(*) from passwords a, passwords b where a.col0=b.col1;

Query ID = mapr_20170524143541_18b3c2b6-75a8-4fbb-8a0d-2cf354fd7a72

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1475192050844_0036)

----------------------------------------------------------------------------------------------

VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

----------------------------------------------------------------------------------------------

Map 1 .......... container SUCCEEDED 69 69 0 0 0 0

Map 4 .......... container SUCCEEDED 69 69 0 0 0 0

Reducer 2 ...... container SUCCEEDED 12 12 0 0 1 0

Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0

----------------------------------------------------------------------------------------------

VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 143.69 s

----------------------------------------------------------------------------------------------

OK

0

From DAG syslog file, we can see in the beginning "Reducer 2" tried to spawn 90 Reducers, and then it got changed to 12 at runtime:

[root@s3 container_e02_1475192050844_0036_01_000001]# grep "Reducer 2" syslog_dag_1475192050844_0036_1 |grep parallelism

2017-05-24 14:36:36,831 [INFO] [App Shared Pool - #0] |vertexmanager.ShuffleVertexManager|: Reducing auto parallelism for vertex: Reducer 2 from 90 to 12

2017-05-24 14:36:36,837 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Resetting vertex location hints due to change in parallelism for vertex: vertex_1475192050844_0036_1_02 [Reducer 2]

2017-05-24 14:36:36,841 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Vertex vertex_1475192050844_0036_1_02 [Reducer 2] parallelism set to 12 from 90

Prev Page Next Page Home