## MapReduce 中文版

### 其他实现

Nutch项目开发了一个实验性的MapReduce的实现[2]。

### 参考

* Dean, Jeffrey & Ghemawat, Sanjay (2004)."MapReduce:大规模集群上的简单数据处理方式" 2005年4月6日。

^ "我们的灵感来自lisp和其他函数式编程语言中的古老的映射和化简操作." -"MapReduce:大规模集群上的简单数据处理方式"

## MapReduce 英文版

MapReduce is a programming tool developed by Google in C++, in which parallel computations over large (> 1 terabyte) data sets are performed. The terminology of "Map" and "Reduce", and their general idea, is borrowed from functional programming languages use of the constructs map and reduce in functional programming and features of array programming languages. [1]

The actual software is implemented by specifying a Map function that maps key-value pairs to new key-value pairs and a subsequent Reduce function that consolidates all mapped key-value pairs sharing the same keys to single key-value pairs.

### Map and Reduce

In simpler terms, what a map function does is go over a conceptual list of independent elements (for example, a list of test scores) and performs a specified operation on each element (with the previous example, one might have discovered a flaw in the test that gave each student a score too high by one; one could then define a map function of "minus 1"- it would subtract one from each score, correcting them.); the fact that each element is operated on independently, and that the original list is not being modified because a new list is created to hold the answers means that it is very easy to make a map operation highly parallel, and thus useful in high-performance applications and domains like parallel programming.

A reduce operation on the other hand, usually takes a list and combines elements appropriately (Continuing the preceding example, what if one wanted to know the class average? One could define a reduce function which halved the size of the list by adding an entry in the list to its neighbor, recursively continuing until there is only one (large) entry, and dividing the total sum by the original entry of elements to get the average); while since a reduce always ends up with a single answer, it is not as parallelizable as a map function, the large number of fairly independent calculations means that reduce functions are still useful in highly parallel environments.

### Distribution and reliability

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead, and sends out the node's assigned data to other nodes. Individual operations use atomic operations for naming file outputs as a double check to insure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).

The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth, which their internal networks do not have much of.

### Uses

According to Google, they use MapReduce in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the Internet, and replaced the old ad hoc programs that updated the index.

MapReduce generates a large number of intermediate, temporary files, which are generally managed by, and accessed through, Google File System, for greater performance.

### Other Implementations

The Nutch project has developed an experimental implementation [2] of MapReduce.

References

* Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on Large Clusters". Retrieved Apr. 6, 2005.

↑ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters"

• 本文已收录于以下专栏：

• wy521ly
• 2008年09月19日 11:12
• 237

• joliny
• 2008年04月13日 19:37
• 720

## MapReduce常见计算模式

MapReduce常见计算模式，即实际编程中如何使用MapReduce实现常见的Top N，join等数据操作。过滤模式1，简单过滤模式。简单过滤模式指在一个较大的数据集中按照规则筛选出较小的数据集。...
• OnlyQi
• 2016年03月04日 15:16
• 1133

http://blog.csdn.net/opennaive/article/details/7514146 江湖传说永流传：谷歌技术有"三宝"，GFS、MapReduce和大表（BigTab...
• oLevin
• 2014年01月22日 10:11
• 3599

• qq_26787115
• 2015年12月09日 22:22
• 5191

• zealotcat
• 2010年01月07日 09:47
• 3312

• spring8743
• 2014年11月11日 10:37
• 442

• qq_26787115
• 2016年06月28日 10:27
• 4217

MapReduce：超大机群上的简单数据处理                                             摘要 MapReduce是一个编程模型，和处理、产生大数据集...
• a879365197
• 2015年11月04日 22:38
• 446