【Hadoop】中map与reduce的个数问题

最新推荐文章于 2022-08-03 17:45:58 发布

__威少__

最新推荐文章于 2022-08-03 17:45:58 发布

阅读量1w

点赞数

分类专栏：大数据文章标签： Hadoop mapreduce 数目

本文链接：https://blog.csdn.net/zwan0518/article/details/9409361

版权

Hadoop中map任务的数量通常由输入文件的DFS块数决定，可以通过mapred.map.tasks作为输入格式的提示。reduce任务默认为1，数量受nodes * mapred.tasktracker.tasks.maximum * (0.95~1.75)影响。可通过JobConf的conf.setNumMapTasks和conf.setNumReduceTasks来手动调整任务数量。实际运行中，map数量不会低于数据分割产生的数量，reduce数量可根据任务需求设置。

摘要由CSDN通过智能技术生成

在hadoop中当一个任务没有设置的时候，该任务的执行的map的个数是由任务本身的数据量决定的，具体计算方法会在下文说明；而reduce的个数hadoop是默认设置为1的。为何设置为1那，因为一个任务的输出的文件个数是由reduce的个数来决定的。一般一个任务的结果默认是输出到一个文件中，所以reduce的数目设置为1。那如果我们为了提高任务的执行速度如何对map与reduce的个数来进行调整那。

在讲解之前首先，看一下hadoop官方文档是如何说明的。

Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. Howev