hadoop2中Mappers和Reducers堆栈的大小配置

最新推荐文章于 2024-01-30 15:24:36 发布

生命不息丶折腾不止

最新推荐文章于 2024-01-30 15:24:36 发布

阅读量3.3k

点赞数 1

分类专栏： hadoop 文章标签： hadoop yarn heap

hadoop 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

If a YARN container grows beyond its heap size setting, the map or reduce task will fail with an error similar to the one below:

"Container [pid=14639,containerID=container_1400188786457_0006_01_001609] is running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 3.1 GB of 12.5 GB virtual memory used. Killing container."

The default heapsize for mappers is 1.5GB and for reducers is 2.5GB on the Altiscale platform.

You can solve this by increasing the heap size for the container for mappers or reducers, depending on which one is having the problem when you look at the job history UI or container logs.

`mapreduce.{map|reduce}.memory.mb` vs. `mapreduce.{map|reduce}.java.opts`

In Hadoop 2, tasks are run within containers launched by YARN. mapreduce.{map|reduce}.memory.mb is used by YARN to set the memory size of the container being used to run the map or reduce task. If the task grows beyond this limit, YARN will kill the container.

To execute the actual map or reduce task, YARN will run a JVM within the container. The Hadoop property mapreduce.{map|reduce}.java.opts is intended to pass options to this JVM. This can include Xmx to set max heap size of the JVM. However, the subsequent growth in the memory footprint of the JVM due to the settings in mapreduce.{map|reduce}.java.opts is limited by the actual size of the container as set by mapreduce.{map|reduce}.memory.mb.

Consequently, you should ensure that the heap you specify in mapreduce.{map|reduce}.java.opts is set to be less than the memory specified by mapreduce.{map|reduce}.memory.mb. If, for example, you see the following fatal error reported by your mapper or reducer:

2014-10-10 00:19:39,693 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space

This is a good indication that you need to make adjustments to mapreduce.{map|reduce}.java.opts and commensurate changes to mapreduce.{map|reduce}.memory.mb.

For example:

hadoop jar <jarName> -Dmapreduce.map.memory.mb=4096 -Dmapreduce.map.java.opts=-Xmx3686m

and from the Hive CLI, you would run:

hive> set mapreduce.map.memory.mb=4096;

hive> set mapreduce.map.java.opts=-Xmx3686m;

Note: The two properties yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb cannot be set by customers.

Setting the heap size for Mappers or Reducers

You can solve the memory error by increasing the heap size for the container for mappers or reducers, depending on which one is having the problem when you look at the job history UI or container logs.

SET mapreduce.{map|reduce}.memory.mb=<numerical memory value>;

Important: You should also raise the java heap specified by mapreduce.{map|reduce}.java.opts. However, you should ensure that the heap you specify in mapreduce.{map|reduce}.java.opts is set to be less than the container memory specified by mapreduce.{map|reduce}.memory.mb.

Specifically, a good rule of thumb is to set the jave heap size to be 10% less than the container size:
mapreduce.{map|reduce}.java.opts = mapreduce.{map|reduce}.memory.mb x 0.9

See the section above for further explanation of these two settings.

For example (in Hive), to configure Reducer memory allocation:

SET mapreduce.reduce.memory.mb=3809; SET mapreduce.reduce.java.opts=-Xmx3428m;

or to configure Mapper memory allocation:

SET mapreduce.map.memory.mb=3809;
SET mapreduce.map.java.opts=-Xmx3428m;

As a Hadoop job option, for example:

hadoop jar <jarName> <youClassName> -Dmapreduce.reduce.memory.mb=5120 -Dmapreduce.reduce.java.opts=-Xmx4608m <otherArgs>

Increasing the memory size of mappers or reducers comes at the expense of reduced parallelism of your cluster since it can now launch fewer containers simultaneously, so do feel free to experiment with the memory settings to find the lowest heapsize that will allow you to complete your jobs comfortably.

We would suggest that you at least bump up the values 20% higher according to the virtual memory used from the logs if you have ran into similar execptions. For example, given the following error:

"Container [pid=14639,containerID=container_1400188786457_0006_01_001609] is running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 3.1 GB of 12.5 GB virtual memory used. Killing container."

you should try at least 3809 (3174 x 1.2) as the new heapsize (mapreduce.{map|reduce}.java.opts) value and bump up the
mapreduce.{map|reduce}.memory.mb accordingly.

Setting the container heapsize in Hive

Most tools that operate on top of the Hadoop MapReduce framework provide ways to tune these Hadoop level settings for its jobs. For example, in Hive there are multiple ways to do this. Three of these are shown here:

1) Pass directly via the Hive command line:

hive -hiveconf mapreduce.map.memory.mb=5120 -hiveconf mapreduce.reduce.memory.mb=5120 -hiveconf mapreduce.map.java.opts=-Xmx4608m -hiveconf mapreduce.reduce.java.opts=-Xmx4608m -e "select count(*) from test_table;"

2) Set the ENV variable before invoking Hive:

export HIVE_OPTS="-hiveconf mapreduce.map.memory.mb=5120 -hiveconf mapreduce.reduce.memory.mb=5120 -hiveconf mapreduce.map.java.opts=-Xmx4608m -hiveconf mapreduce.reduce.java.opts=-Xmx4608m"

3) Use the "set" command within the Hive CLI.

set mapreduce.map.memory.mb=5120; set mapreduce.map.java.opts=-Xmx4608m;
set mapreduce.reduce.memory.mb=5120; set mapreduce.reduce.java.opts=-Xmx4608m;
select count(*) from test_table;

The above 3 examples use a theoritical value that has no assumption. In order to identify whether to bump up the mapper's or reducer's memory settings, you should be able to
tell from the Job History UI that will indicate whether it is failing in the Mapper phase or the Reducer phase. This varies from application to application that runs on MapReduce and
also varies based on input data and algorithm.

Setting the Tez memory footprint through Hive

To change Tez memory footprints through Hive, you need to set the following configuration parameters:

SET hive.tez.container.size=<numerical memory value> Sets the size of the container spawned by YARN.
SET hive.tez.java.opts=-Xmx<numerical max heap size>m Java command line options for Tez.

For example:

SET hive.tez.container.size=6656
 SET hive.tez.java.opts=-Xmx4096m

"hive.tez.container.size" and "hive.tez.java.opts" are the parameters that alter Tez memory settings in Hive. If "hive.tez.container.size" is set to "-1" (default value), it picks the value of "mapreduce.map.memory.mb". If "hive.tez.java.opts" is not specified, it relies on the "mapreduce.map.java.opts" setting. Thus, if Tez specific memory settings are left as default values, memory sizes are picked from mapreduce mapper memory settings "mapreduce.map.memory.mb".

Important: Please note that the setting for "hive.tez.java.opts" must be smaller than the size specified for "hive.tez.container.size", or "mapreduce.{map|reduce}.memory.mb" if "hive.tez.container.size" is not specified. Don't forget to review both of them when setting either one to ensure "hive.tez.java.opts" is smaller then "hive.tez.container.size" or "mapreduce.{map|reduce}.java.opts" is smaller then "mapreduce.{map|reduce}.memory.mb".

See OOM when running Tez jobs in YARN in the Troubleshooting area of this site for more information.

Settings the container heapsize for HiveServer2 sessions

HiveServer2 provides a different channel than HiveCLI and the Hive command line tool. If you are submitting queries via HiveServer2 with JDBC or ODBC driver, or a python module such as pyhs2, the following examples show you how to customize the values.

1) Beeline / JDBC URL

The JDBC URL string will look like this:

jdbc:hive2://localhost:10000/default?mapreduce.map.memory.mb=3809;mapreduce.map.java.opts=-Xmx3428m;mapreduce.reduce.memory.mb=2560;mapreduce.reduce.java.opts=-Xmx2304m;

You use a semi-colon to specify multiple key-value pairs to customize this session with HiveServer2. The default in the URL before the question mark is pointing to the default database.

2) pyhs2 module example

You will need to perform the SET statements without the semi-colon in the cursor.

import pyhs2
conn = pyhs2.connect(host='hostname_to_your_hiveserver2',port=10000,user='alti-test',authMechanism="PLAIN",database='default') 
cur = conn.cursor()
cur.execute("SET mapreduce.map.memory.mb=3809")
cur.execute("SET mapreduce.map.java.opts=-Xmx3428m")
cur.execute("SET mapreduce.reduce.memory.mb=2560")
cur.execute("SET mapreduce.reduce.java.opts=-Xmx2304m")
cur.execute("SELECT COUNT(*) FROM yourtable_example")