impove hadoop mapreduce performance

[url]http://hadoop.apache.org/common/docs/current/mapred_tutorial.html[/url]
[url]http://hadoop.group.iteye.com/group/topic/18294[/url]

1.set combiner:
Users can optionally specify a combiner, via JobConf.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer. 设置Combiner Class就是为了在把数据由mapper传给reducer前先把local machine的数据处理过,这样就避免数据的大规模迁移(先处理local data,再传给reducer)

2.how many maps:
理想状态下 = sizeOf(inputData)/blockSize(试??,是理想状态下还是最高数目)
Task setup takes awhile, so it is best if the maps take at least a minute to execute(最好map的执行时间至少1分钟). The right level of parallelism for maps seems to [color=red]be around 10-100 maps per-node[/color], although it has been set up to 300 maps for very cpu-light map tasks(cpu light的task可以设置得更高点).

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless [color=red]setNumMapTasks(int)[/color] (which [color=red]only provides a hint[/color] to the framework) is used to set it even higher. setNumMapTask()只是给mapreduce framework一个hint,而并非执行时真的就是这么个map task 数

3.How Many Reduces:

[color=red]The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).[/color]

With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

4.Reducer NONE:

It is legal to set the number of reduce-tasks to zero if no reduction is desired.不用reduce阶段

5.mapred.tasktracker.map.tasks.maximum:(default value = 2)
The maximum number of map tasks that will be run
simultaneously by a task tracker.一台机器上(tasktracker)同时运行的map task的个数。一个map task是有input data split 后的一份执行map函数?
《pro hadoop》书79页说这个值最好设置为the effective number of CPUs on the node.(意思是=cpu的数目?双核的设为2??)

6.mapred.map.tasks 这个job的所有tasks的个数(default value=2)如果要设定的话,根据上面第二点,=machinenum*(10-100)试??

7. dfs.block.size 待
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值