MapReduce是Google提出的一个软件架构，用于大规模数据集（大于1TB）的并行运算。

最新推荐文章于 2022-02-10 18:45:51 发布

geneculture

最新推荐文章于 2022-02-10 18:45:51 发布

阅读量2.2k

点赞数

文章标签： mapreduce google processing function input output

MapReduce is a patented^[1] software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.^[2]

The framework is inspired by the map and reduce functions commonly used in functional programming,^[3] although their purpose in the MapReduce framework is not the same as their original forms.^[4]

MapReduce libraries have been written in C++, C#, Erlang, Java, OCaml, Perl, Python, PHP, Ruby, F#, R and other programming languages.

[]

[]Overview

MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or as a grid (if the nodes use different hardware). Computational processing can occur on data stored either in afilesystem (unstructured) or within a database (structured).

"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase – all that is required is that all outputs of the map operation that share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data are still available.

http://en.wikipedia.org/wiki/MapReduce

geneculture

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce是Google提出的一个软件架构，用于大规模数据集（大于1TB）的并行运算。

MapReduce is a patented[1] software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.[2]The framework is inspired by the map and
复制链接

扫一扫