Configuration Parameters: What can you just ignore

Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”

Nigel's guitar goes to 11, but your cluster might not.At Cloudera, we’re working hard to make Hadoop easier to use and to make configuration less painful. Our Hadoop Configuration Tool gives you a web-based guide to help set up your cluster. Once it’s running, though, you might want to look under the hood and tune things a bit.

The rest of this post discusses why it’s a bad idea to just set all the limits as high as they’ll go, and gives you some pointers to get started on finding a happy medium.

Why can’t you just set all the limits to 1,000,000?

Increasing most settings has a direct impact on memory consumption. Increasing DataNode and TaskTracker settings, therefore, has an adverse impact on RAM available to individual MapReduce tasks. On large hardware, they can be set generously high. In general though, unless you have several dozen more more nodes working together, dialing up settings very high wastes system resources like RAM that could be better applied to running your mapper and reducer code.

That having been said, here’s a list of some things that can be cranked up higher than the defaults by a fair margin:

File descriptor limits

A busy Hadoop daemon might need to open a lot of files. The open fd ulimit in Linux defaults to 1024, which might be too low. You can set to something more generous — maybe 16384. Setting this an order of magnitude higher (e.g., 128K) is probably not a good idea. No individual Hadoop daemon is supposed to need hundreds of thousands of fds; if it’s consuming that many, then there’s probably an fd leak or other bug that needs fixing. This would just mask the true problem until errors started showing up somewhere else.

You can view your ulimits in bash by running:

To set the fd ulimit for a process, you’ll need to be root. As root, open a shell, and run:

You can then run the Hadoop daemon from that shell; the ulimits will be inherited. e.g.:

You can also set the ulimit for the hadoop user in /etc/security/limits.conf; this mechanism will set the value persistently. Make sure pam_limits is enabled for whatever auth mechanism the hadoop daemon is using. The entry will look something like:

If you’re running our distribution, we ship a modified version of Hadoop 0.18.3 that includes HADOOP-4346, a fix for the “soft fd leak” that has affected Hadoop since 0.17, so this should be less critical for our users. Users of the official Apache Hadoop release are affected by the fd leak for all 0.17, 0.18, and 0.19 versions. (The fix is committed for 0.20.) For the curious, we’ve published a list of all differences between our release of Hadoop and the stock 0.18.3 release.

If you’re running Linux 2.6.27, you should also set the epoll limit to something generous; maybe 4096 or 8192.

Then put the following text in /etc/sysctl.conf:

See http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ for more details.

Internal settings

If there is more RAM available than is consumed by task instances, set io.sort.factor to 25 or 32 (up from 10).io.sort.mb should be 10 * io.sort.factor. Don’t forget, multiply io.sort.mb by the number of concurrent tasks to determine how much RAM you’re actually allocating here, to prevent swapping. (So 10 task instances with io.sort.mb= 320 means you’re actually allocating 3.2 GB of RAM for sorting, up from 1.0 GB.) An open ticket on the Hadoop bug tracking database suggests making the default value here 100. This would likely result in a lower per-stream cache size than 10 MB.

io.file.buffer.size – this is one of the more “magic” parameters. You can set this to 65536 and leave it there. (I’ve profiled this in a bunch of scenarios; this seems to be the sweet spot.)

If the NameNode and JobTracker are on big hardware, set dfs.namenode.handler.count to 64 and same withmapred.job.tracker.handler.count. If you’ve got more than 64 GB of RAM in this machine, you can double it again.

dfs.datanode.handler.count defaults to 3 and could be set a bit higher. (Maybe 8 or 10.) More than this takes up memory that could be devoted to running MapReduce tasks, and I don’t know that it gives you any more performance. (An increased number of HDFS clients implies an increased number of DataNodes to handle the load.)

mapred.child.ulimit should be 2–3x higher than the heap size specified in mapred.child.java.opts and left there to prevent runaway child task memory consumption.

Setting tasktracker.http.threads higher than 40 will deprive individual tasks of RAM, and won’t see a positive impact on shuffle performance until your cluster is approaching 100 nodes or more.

Conclusions

Configuring Hadoop for “optimal performance” is a moving target, and depends heavily on your own applications. There are settings that need to be moved off their defaults, but finding the best value for each is difficult. Our configurator for Hadoop will do a reasonable job of getting you started.

We’d love to hear from you about your own configurations. Did you discover a combination of settings that really made your cluster sing? Please share in the comments.

The photo of Nigel’s amplifier is from the movie This is Spinal Tap, distributed by Embassy Pictures.

Reference: http://blog.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/

数据中心机房是现代信息技术的核心设施,它承载着企业的重要数据和服务,因此,其基础设计与规划至关重要。在制定这样的方案时,需要考虑的因素繁多,包括但不限于以下几点: 1. **容量规划**:必须根据业务需求预测未来几年的数据处理和存储需求,合理规划机房的规模和设备容量。这涉及到服务器的数量、存储设备的容量以及网络带宽的需求等。 2. **电力供应**:数据中心是能源消耗大户,因此电力供应设计是关键。要考虑不间断电源(UPS)、备用发电机的容量,以及高效节能的电力分配系统,确保电力的稳定供应并降低能耗。 3. **冷却系统**:由于设备密集运行,散热问题不容忽视。合理的空调布局和冷却系统设计可以有效控制机房温度,避免设备过热引发故障。 4. **物理安全**:包括防火、防盗、防震、防潮等措施。需要设计防火分区、安装烟雾探测和自动灭火系统,设置访问控制系统,确保只有授权人员能进入。 5. **网络架构**:规划高速、稳定、冗余的网络架构,考虑使用光纤、以太网等技术,构建层次化网络,保证数据传输的高效性和安全性。 6. **运维管理**:设计易于管理和维护的IT基础设施,例如模块化设计便于扩展,集中监控系统可以实时查看设备状态,及时发现并解决问题。 7. **绿色数据中心**:随着环保意识的提升,绿色数据中心成为趋势。采用节能设备,利用自然冷源,以及优化能源管理策略,实现低能耗和低碳排放。 8. **灾难恢复**:考虑备份和恢复策略,建立异地灾备中心,确保在主数据中心发生故障时,业务能够快速恢复。 9. **法规遵从**:需遵循国家和地区的相关法律法规,如信息安全、数据保护和环境保护等,确保数据中心的合法运营。 10. **扩展性**:设计时应考虑到未来的业务发展和技术进步,保证机房有充足的扩展空间和升级能力。 技术创新在数据中心机房基础设计及规划方案中扮演了重要角色。例如,采用虚拟化技术可以提高硬件资源利用率,软件定义网络(SDN)提供更灵活的网络管理,人工智能和机器学习则有助于优化能源管理和故障预测。 总结来说,一个完整且高效的数据中心机房设计及规划方案,不仅需要满足当前的技术需求和业务目标,还需要具备前瞻性和可持续性,以适应快速变化的IT环境和未来可能的技术革新。同时,也要注重经济效益,平衡投资成本与长期运营成本,实现数据中心的高效、安全和绿色运行。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值