hadoop2中Mappers和Reducers堆栈的大小配置

If a YARN container grows beyond its heap size setting, the map or reduce task will fail with an error similar to the one below:

"Container [pid=14639,containerID=container_1400188786457_0006_01_001609] is running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 3.1 GB of 12.5 GB virtual memory used. Killing container."

The default heapsize for mappers is 1.5GB and for reducers is 2.5GB on the Altiscale platform.

You can solve this by increasing the heap size for the container for mappers or reducers, depending on which one is having the problem when you look at the job history UI or container logs.

mapreduce.{map|reduce}.memory.mb vs. mapreduce.{map|reduce}.java.opts

In Hadoop 2, tasks are run within containers launched by YARN.  mapreduce.{map|reduce}.memory.mb is used by YARN to set the memory size of the container being used to run the map or reduce task. If the task grows beyond this limit, YARN will kill the container.

To execute the actual map or reduce task, YARN will run a JVM within the container. The Hadoop property mapreduce.{map|reduce}.java.opts is intended to pass options to this JVM. This can include ­Xmx to set max heap size of the JVM. However, the subsequent growth in the memory footprint of the JVM due to the settings in mapreduce.{map|reduce}.java.opts is limited by the actual size of the container as set by mapreduce.{map|reduce}.memory.mb.

Consequently, you should ensure that the heap you specify in mapreduce.{map|reduce}.java.opts is set to be less than the memory specified by mapreduce.{map|reduce}.memory.mb.  If, for example, you see the following fatal error reported by your mapper or reducer:

2014-10-10 00:19:39,693 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space

This is a good indication that you need to make adjustments to mapreduce.{map|reduce}.java.opts and commensurate changes to mapreduce.{map|reduce}.memory.mb.

For example:

  hadoop jar <jarName> -Dmapreduce.map.memory.mb=4096 -Dmapreduce.map.java.opts=-Xmx3686m

and from the Hive CLI, you would run:

  hive> set mapreduce.map.memory.mb=4096;

  hive> set mapreduce.map.java.opts=-Xmx3686m;

Note: The two properties  yarn.nodemanager.resource.memory-mb and  yarn.scheduler.maximum-allocation-mb cannot be set by customers. 

 

Setting the heap size for Mappers or Reducers

You can solve the memory error by increasing the heap size for the container for mappers or reducers, depending on which one is having the problem when you look at the job history UI or container logs.

   SET mapreduce.{map|reduce}.memory.mb=<numerical memory value>;

Important: You should also raise the java heap specified by  mapreduce.{map|reduce}.java.opts. However, you should ensure that the heap you specify in  mapreduce.{map|reduce}.java.opts is set to be  less than the container memory specified by  mapreduce.{map|reduce}.memory.mb.

Specifically, a good rule of thumb is to set the jave heap size to be 10% less than the container size:
mapreduce.{map|reduce}.java.opts = mapreduce.{map|reduce}.memory.mb x 0.9

See  the section above for further explanation of these two settings. 


For example (in Hive), to configure Reducer memory allocation:

   SET mapreduce.reduce.memory.mb=3809;
  
SET mapreduce.reduce.java.opts=-Xmx3428m;

or to configure Mapper memory allocation:

   SET mapreduce.map.memory.mb=3809;
   SET mapreduce.map.java.opts=-Xmx3428m;

As a Hadoop job option, for example:

hadoop jar <jarName> <youClassName> -Dmapreduce.reduce.memory.mb=5120 -Dmapreduce.reduce.java.opts=-Xmx4608m <otherArgs>

Increasing the memory size of mappers or reducers comes at the expense of reduced parallelism of your cluster since it can now launch fewer containers simultaneously, so do feel free to experiment with the memory settings to find the lowest heapsize that will allow you to complete your jobs comfortably.

We would suggest that you at least bump up the values 20% higher according to the virtual memory used from the logs if you have ran into similar execptions.  For example, given the following error:  

"Container [pid=14639,containerID=container_1400188786457_0006_01_001609] is running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 3.1 GB of 12.5 GB virtual memory used. Killing container."

you should try at least 3809 (3174 x 1.2) as the new heapsize (mapreduce.{map|reduce}.java.opts) value and bump up the
mapreduce.{map|reduce}.memory.mb accordingly.

Setting the container heapsize in Hive

Most tools that operate on top of the Hadoop MapReduce framework provide ways to tune these Hadoop level settings for its jobs.  For example, in Hive there are multiple ways to do this.  Three of these are shown here:

1) Pass directly via the Hive command line:

hive -hiveconf mapreduce.map.memory.mb=5120 -hiveconf mapreduce.reduce.memory.mb=5120 -hiveconf mapreduce.map.java.opts=-Xmx4608m -hiveconf mapreduce.reduce.java.opts=-Xmx4608m -e "select count(*) from test_table;"

2) Set the ENV variable before invoking Hive:

export HIVE_OPTS="-hiveconf mapreduce.map.memory.mb=5120 -hiveconf mapreduce.reduce.memory.mb=5120 -hiveconf mapreduce.map.java.opts=-Xmx4608m -hiveconf mapreduce.reduce.java.opts=-Xmx4608m"


3) Use the "set" command within the Hive CLI.

   set mapreduce.map.memory.mb=5120;
   set mapreduce.map.java.opts=-Xmx4608m;

   set mapreduce.reduce.memory.mb=5120;
   set mapreduce.reduce.java.opts=-Xmx4608m;

   select count(*) from test_table;

The above 3 examples use a theoritical value that has no assumption. In order to identify whether to bump up the mapper's or reducer's memory settings, you should be able to 
tell from the Job History UI that will indicate whether it is failing in the Mapper phase or the Reducer phase. This varies from application to application that runs on MapReduce and 
also varies based on input data and algorithm.

Setting the Tez memory footprint through Hive

To change Tez memory footprints through Hive, you need to set the following configuration parameters:

  • SET hive.tez.container.size=<numerical memory value Sets the size of the container spawned by YARN.
  • SET hive.tez.java.opts=-Xmx<numerical max heap size>m  Java command line options for Tez.

For example:

SET hive.tez.container.size=6656
SET hive.tez.java.opts=-Xmx4096m
 

"hive.tez.container.size" and "hive.tez.java.opts" are the parameters that alter Tez memory settings in Hive. If "hive.tez.container.size" is set to "-1" (default value), it picks the value of "mapreduce.map.memory.mb". If "hive.tez.java.opts" is not specified, it relies on the "mapreduce.map.java.opts" setting.  Thus, if Tez specific memory settings are left as default values, memory sizes are picked from mapreduce mapper memory settings "mapreduce.map.memory.mb".

Important: Please note that the setting for "hive.tez.java.opts" must be smaller than the size specified for "hive.tez.container.size", or "mapreduce.{map|reduce}.memory.mb" if "hive.tez.container.size" is not specified. Don't forget to review both of them when setting either one to ensure "hive.tez.java.opts" is smaller then "hive.tez.container.size" or "mapreduce.{map|reduce}.java.opts" is smaller then "mapreduce.{map|reduce}.memory.mb".

See OOM when running Tez jobs in YARN in the Troubleshooting area of this site for more information. 

 Settings the container heapsize for HiveServer2 sessions

HiveServer2 provides a different channel than HiveCLI and the Hive command line tool. If you are submitting queries via HiveServer2 with JDBC or ODBC driver, or a python module such as pyhs2, the following examples show you how to customize the values.

1) Beeline / JDBC URL

The JDBC URL string will look like this:

jdbc:hive2://localhost:10000/default?mapreduce.map.memory.mb=3809;mapreduce.map.java.opts=-Xmx3428m;mapreduce.reduce.memory.mb=2560;mapreduce.reduce.java.opts=-Xmx2304m;

You use a semi-colon to specify multiple key-value pairs to customize this session with HiveServer2. The default in the URL before the question mark is pointing to the default database.

2) pyhs2 module example 

 You will need to perform the SET statements without the semi-colon in the cursor.

import pyhs2
conn = pyhs2.connect(host='hostname_to_your_hiveserver2',port=10000,user='alti-test',authMechanism="PLAIN",database='default') 
cur = conn.cursor()
cur.execute("SET mapreduce.map.memory.mb=3809")
cur.execute("SET mapreduce.map.java.opts=-Xmx3428m")
cur.execute("SET mapreduce.reduce.memory.mb=2560")
cur.execute("SET mapreduce.reduce.java.opts=-Xmx2304m")
cur.execute("SELECT COUNT(*) FROM yourtable_example")
 

Note: The authMechanism depends on what is enabled in your HiveServer2 settings.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值