MapReduce zz

最新推荐文章于 2024-07-10 23:15:15 发布

lyso1

最新推荐文章于 2024-07-10 23:15:15 发布

阅读量282

点赞数

分类专栏：新技术编程-算法文章标签： mapreduce output function input each 编程

编程-算法同时被 2 个专栏收录

40 篇文章 0 订阅

订阅专栏

新技术

4 篇文章 0 订阅

订阅专栏

MapReduce: 一个巨大的倒退

http://www.pgsqldb.org/mwiki/index.php/MapReduce:_%E4%B8%80%E4%B8%AA%E5%B7%A8%E5%A4%A7%E7%9A%84%E5%80%92%E9%80%80

What is MapReduce?/何谓MapReduce？

The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.

MapReduce的基本思想很直接。它包括用户写的两个程序:map和reduce，以及一个framework，在一个计算机簇中执行大量的每个程序的实例。

The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.

map程序从输入文件中读取"records"的集合，执行任何需要的过滤或者转换，并且以(key,data)的形式输出records的集合。当map程序产生输出记录，"split"函数对每一个输出的记录的key应用一个函数，将records分割为M个不连续的块(buckets)。这个split函数有可能是一个hash函数，而其他确定的函数也是可用的。当一个块被写满后，将被写道磁盘上。然后map程序终止，输出M个文件，每一个代表一个块(bucket)。

In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.

通常情况下，map程序的多个实例持续运行在compute cluster的不同节点上。每一个map实例都被MapReduce scheduler分配了input file的不同部分，然后执行。如果有N个节点参与到map阶段，那么在这N个节点的磁盘储存都有M个文件，总共有N*M个文件。

The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.

值得注意的地方是，所有的map实例都使用同样的hash函数。因此，有相同hash值的所有output record会出被放到相应的输出文件中。

The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the "answer" to a MapReduce computation.

MapReduce的第二个阶段执行M个reduce程序的实例， Rj, 1 <= j <= M. 每一个reduce实例的输入是Rj，包含文件Fi,j, 1<= i <= N. 注意，每一个来自map阶段的output record，含有相同的hash值的record将会被相同的reduce实例处理 -- 不论是哪一个map实例产生的数据。在map-reduce架构处理过后，input records将会被以他们的keys来分组(以排序或者哈希的方式)，到一个reduce实例然后给reduce程序处理。和map程序一样，reduce程序是任意计算语言表示的。因此，它可以对它的records做任何想做事情。例如，可以添加一些额外的函数，来计算record的其他data field。每一个reduce实例可以将records写到输出文件中，组成MapReduce计算的"answer"的一部分。

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

和SQL可以做对比的是，map程序和聚集查询中的 group-by 语句相似。Reduce函数和聚集函数(例如，average,求平均)相似，在所有的有相同group-by的属性的列上计算。

MapReduce是Google的一项重要技术，它是一个编程模型，用以进行大数据量的计算。对于大数据量的计算，通常采用的处理手法就是并行计算。至少现阶段而言，对许多开发人员来说，并行计算还是一个比较遥远的东西。MapReduce就是一种简化并行计算的编程模型，它让那些没有多少并行计算经验的开发人员也可以开发并行应用。在我看来，这也就是MapReduce的价值所在，通过简化编程模型，降低了开发并行应用的入门门槛。相对于现在普通的开发而言，并行计算需要更多的专业知识，有了MapReduce，并行计算就可以得到更广泛的应用。

MapReduce的名字源于这个模型中的两项核心操作：Map和Reduce。也许熟悉Functional Programming的人见到这两个词会倍感亲切。简单的说来，Map是把一组数据一对一的映射为另外的一组数据，其映射的规则由一个函数来指定，比如对[1, 2, 3, 4]进行乘2的映射就变成了[2, 4, 6, 8]。Reduce是对一组数据进行归约，这个归约的规则由一个函数指定，比如对[1, 2, 3, 4]进行求和的归约得到结果是10，而对它进行求积的归约结果是24。

Map操作是独立的对每个元素进行操作，在FP中，操作是没有副作用的，换句话说，Map操作将产生一组全新的数据，而原来的数据保持不变。因此，它是高度并行的。Reduce操作虽然不如Map操作并行性那么好，但是它总会得到一个相对简单的结果，大规模运算也相对独立，因此也是比较适合并行的。

无论是Map还是Reduce都是以另外的函数作为参数，在FP中，这样的函数被称为高阶函数（high-order function）。正是因为它们可以同其它函数相结合，所以，我们只要把Map和Reduce这两个高阶函数进行并行化处理，而无需面面俱到的把所有的函数全部考虑到。这样便形成了一个以Map和Reduce为基础的框架，具体应用相关代码写在用户代码中，之后与MapReduce结合获得并行处理的能力。当然，这么做的前提是按照这个框架的要求，把计算归结为Map和Reduce操作。为什么是Map和Reduce？从前面的内容我们可以看出，在Map过程中，我们将数据并行，也就是将数据分开，而Reduce则把分开的数据合到了一起，换句话说，Map是一个分的过程，Reduce则对应着合，这一分一合便在不知不觉中完成了计算。所以，站在计算的两端来看，与我们通常熟悉的串行计算没有任何差别，所有的复杂性都在中间隐藏了。

所有这些并行化能力的获得都与FP有着密不可分的关系。事实上，不仅仅是MapReduce从FP中获得了灵感，其它一些并行编程模型也走上了同样的道路。FP中有很多的好东西，比如自动内存管理，比如动态类型。在遥远的年代里，因为机器性能的原因，它们无法得到广泛应用，当机器性能不再是瓶颈，这些东西便逐渐复活了。

前面提到过，并行计算对于普通开发人员来说，有一个比较高的门槛。从前我们或许可以不理会并行计算，但是随着Intel开始将多核带入人们的日常生活，并行计算将会变得更加平民化，毕竟谁也不希望自己机器里面的多核只有一个在干活。现在的许多操作系统会把多核视为多处理器，但那也得有多任务才能在CPU处多分得一杯羹。对于服务器端应用来说，拥有多任务的能力是一个正常的现象。但对于很多桌面应用来说，一条道跑到黑的情况比较多见。而且，多任务并非为并行计算专门准备的，所以，控制粒度是很大的。如果需要更细粒度的并行计算，至少从表达能力上来说，多任务就有些麻烦了。

并行计算进入日常开发的难度就在于编程模型，太复杂的东西会被人唾弃的，CORBA在这方面已经是个反面教材了。MapReduce已经为我们演示了一种可以接受的编程模型，接下来，变化还会有，Intel和AMD都在努力。不过，具体的进程得取决于多核CPU占领市场的进度。

：转载时请以超链接形式标明文章原始出处和作者信息及本声明
http://dreamhead.blogbus.com/logs/2617482.html

lyso1

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce zz

MapReduce: 一个巨大的倒退http://www.pgsqldb.org/mwiki/index.php/MapReduce:_%E4%B8%80%E4%B8%AA%E5%B7%A8%E5%A4%A7%E7%9A%84%E5%80%92%E9%80%80版权声明：转载时请以超链接形式标明文章原始出处和作者信息及本声明http://dreamhead.blogbus.com/logs/2617482.htmlMapReduce是Google的一项重要技术，它是一个编程模型，用以进行大数据量的计算。对于
复制链接

扫一扫