MapReduce数据流

最新推荐文章于 2021-09-20 13:15:41 发布

海兰

最新推荐文章于 2021-09-20 13:15:41 发布

阅读量2.2k

点赞数

分类专栏： Hadoop开发

本文链接：https://blog.csdn.net/hadoop_/article/details/9300925

版权

Hadoop开发专栏收录该内容

46 篇文章 0 订阅

订阅专栏

Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization because it doesn’t use valuable clus-
ter bandwidth. Sometimes, however, all three nodes hosting the HDFS block replicas
for a map task’s input split are running other map tasks, so the job scheduler will look
for a free map slot on a node in the same rack as one of the blocks. Very occasionally
even this is not possible, so an off-rack node is used, which results in an inter-rack

network transfer. The three possibilities:

Hadoop在存储有输入数据(HDFS中的数据)的节点上运行Map任务，可以获得最佳性能。这就是所谓的数据本地化优化(data locality optimization)。

a: Data-local map tasks

b: rack-local map tasks

c: off-rack map tasks

MapReduce data flow with a single reduce task:

一个reduce任务的MapReduce数据流：

虚线框：node

虚线箭头：node内部的数据传输

实线箭头：节点之间的数据传输

MapReduce data flow with multiple reduce tasks：

多个reduce任务的MapReduce数据流：

This diagram makes it clear why the data flow between map and reduce tasks is collo-
quially known as “the shuffle,” as each reduce task is fed by many map tasks. The
shuffle is more complicated than this diagram suggests, and tuning it can have a big
impact on job execution time.

海兰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce数据流

Hadoop does its best to run the map task on a node where the input data resides inHDFS. This is called the data locality optimization because it doesn’t use valuable clus-ter bandwidth. Sometimes,
复制链接

扫一扫