MapReduce: Simplified Data Processing on Large Clusters 论文笔记

最新推荐文章于 2023-04-09 01:59:51 发布

花湖少年

最新推荐文章于 2023-04-09 01:59:51 发布

阅读量615

点赞数

分类专栏：论文分析文章标签： mapreduce

本文链接：https://blog.csdn.net/wj199395/article/details/70336289

版权

论文分析专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Why do it

The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.

Programming Model

Map

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

Reduce

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

Execution overview

MapDeduce

Conclusions

why this model is success

the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.
a large variety of problems are easily expressible as MapReduce computations.
we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines

Experiences

restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant.
network bandwidth is a scarce resource, the locality optimization allows us to read data from local disks, and writing a single copy of the intermediate data to local disk saves network bandwidth.
redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.

花湖少年

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce: Simplified Data Processing on Large Clusters 论文笔记

Why do itThe issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with t
复制链接

扫一扫