谷歌MapReduce论文学习

最新推荐文章于 2022-10-30 13:05:03 发布

Remoa

最新推荐文章于 2022-10-30 13:05:03 发布

阅读量2.2k

点赞数 1

分类专栏： Hadoop 文章标签： mapreduce 谷歌论文翻译

本文链接：https://blog.csdn.net/Remoa_Dengqinyi/article/details/79405155

版权

Hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

MapReduce: Simplified Data Processing on Large Clusters MapReduce：简化数据流程在大规模集群中

目录：

1、Abstract 摘要

2、Introduction 介绍

3、Programming Model 编程模型

Abstract 摘要

（1）MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

翻译：MapReduce是一个编程模型，是一个处理和生成大量数据集的相关实现。用户指定一个处理键值对的Map函数来生成一个中间的键值对的数据集，然后一个Reduce函数合并所有的关联着相同的中间键的中间值。真实世界中的许多任务都可以用这个模型表现，正如论文中所展现的那样。

associated a. 关联的

implementation [ˌɪmplɪmənˈteʃən] n. 实施，实现

processing n. 处理

intermediate [ˌɪntərˈmi:diət] a. 中间的

expressible [ɪk'spresəbəl] a. 可表现的

（2）Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

翻译：用这种实用的方式编写的程序能够自动地在大规模的普通机器组成的集群上实现并行化。这个运行时系统关心这些细节：对输入数据的划分，大量机器中程序的调度执行，机器出现失败的处理，管理机器间的必要通信。这使得编程人员不必有任何的并行计算和分布式处理系统的经验，都能够简单地使用一个大型分布式系统的资源。

functional [ˈfʌŋkʃənəl] a. 功能的，实用的

parallelize ['pærəlelˌaɪz] v. 并行化

commodity [kəˈmɑ:dəti] n. 有利，有益；有价值的物品；商品

takes care of phr. 照顾，关心

required [rɪ'kwaɪəd] a. 必要的，必须的

inter-machine n. 机器间

utilize [ˈjutlˌaɪz] v. 使用

（3）Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

翻译：我们的MapReduce实现运行在一个大规模的普通机器组成的集群上，它是高可扩展的：一个典型的MapReduce计算在几千台机器上处理许多TB的数据。编程人员发现系统很容易使用：已经实现了数百的MapReduce程序，每天有1000多个MapReduce作业在谷歌的集群中执行。

scalable ['skeɪləbəl] a. 可扩展的

computation [ˌkɑ:mpjuˈteɪʃn] n.计算

terabyte [ˈtɛrəˌbaɪt] n. TB

upwards [ˈʌpwərdz] adv. 向上地

1 Introduction 介绍

（4）Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.

翻译：在过去的五年里，作者和许多其他在谷歌的同事为了处理大量的原始数据，已经完成了数百的、专用的计算方法。例如文档抓取、网站请求日志等。为了计算各种形式的派生数据，例如倒排索引、各种各样的Web文档的图结构表示形式、每台主机中爬虫抓取的页面数量统计、给定的一天中最频繁的查询集合等。大部分这样的计算在概念上很容易理解。然而输入数据的量级通常很大，不得不通过成百上千的机器进行分布式计算，使得能在一个可接受的时间内完成计算。如何并行化计算、分发数据、处理出错的问题，所有这些问题综合在一起，需要大量的复杂的代码来处理，使得原本简单的计算变得难以理解。

raw data phr. 原始数据

derived [dɪ'raɪvd] v. 派生

inverted indices phr. 倒排索引

conceptually [kən'septʃʊrlɪ] adv. 概念地

straightforward [ˌstreɪtˈfɔ:rwərd] a. 简单的

reasonable [ˈrizənəbəl] a. 合理的

conspire [kənˈspaɪr] v.搞阴谋；协力促成

obscure [əbˈskjʊr] v. 掩盖；使难理解

（5）As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

翻译：为了解决这种复杂性，我们设计了一种新的抽象模型允许我们表述我们尝试执行的简单的计算，但是封装了并行化、容错、数据分发、负载均衡的复杂细节在库里。我们的抽象模型的灵感来自LISP语言和其它许多的函数式语言的Map和Reduce的原语。我们意识到我们大部分的计算在输入数据的逻辑记录应用Map操作得出一个中间键值对，然后应用Reduce操作在所有的具有相同键的值上，从而达到合并派生数据，得到一个想要的结果。我们使用这种功能化的模型，结合用户指定的Map和Reduce函数操作，允许我们简单地实现大量并行化计算，同时使用“再次执行”的机制，作为容错的实现方案。

primitive [ˈprɪmɪtɪv] a. 原始的

present v. 提出，介绍

intermediate [ˌɪntərˈmi:diət] a. 中间的

appropriately [ə'proʊprɪrtlɪ] adv. 适当地

userspecified a. 用户指定的

re-execution n. 重新执行，再次执行

mechanism [ˈmɛkəˌnɪzəm] n. 机制

（6）The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.

翻译：这个工作的主要贡献是通过简单的、功能强大的接口，提供自动并行化和分布式大规模的分布式计算能力。结合这个接口的实现，实现了在大规模的普通机器组成的集群上的高性能计算。

scale [skel] n. 级别，规格

combined with phr. 结合

（7）Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has performance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis for a rewrite of our production indexing system. Section 7 discusses related and future work.

翻译：第二节描述了基础的编程模型以及给了一些示例。第三节描述了MapReduce适用我们的基于集群的计算环境的定制接口实现。第四节描述了我们在编程模型中发现的许多实用技巧。第五节展示了我们在各种各样的任务中，测量我们MapReduce实现的性能。第六节探索了MapReduce在谷歌内部我们的一些使用经验，正如基于MapReduce我们的索引系统产品的重写。第七节讨论了相关的和未来的工作。

tailored [ˈteɪlərd] a. 定制的

refinement [rɪˈfaɪnmənt] n. 改进，细化

measurement [ˈmeʒərmənt] n. 测量

2 Programming Model 编程模型

（8）The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.

翻译：计算模型利用一个输入键值对集合，然后产生一个输出键值对集合。MapReduce的库的用户用两种函数来表达这种计算模型：Map和Reduce。

（9）Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

翻译：Map函数由用户编写，利用一个输入对，产生一个中间键值对集合。MapReduce库把所有具有相同的中间键I的中间值组织到一起，然后将它们传递到Reduce函数中。

（10）The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

翻译：Reduce函数，也是由用户进行编写，接收一个中间键I和这个键的值的集合。它合并这些值，形成一个较小的值的集合。通常每次Reduce函数调用仅有零个或一个输出的值。中间值通过一个迭代器供给用户定义的Reduce函数。这允许我们来处理那些无法全部存入内存中的大量的值列表。

form [fɔ:rm] v. 形成，产生

typically [ˈtɪpɪklɪ] adv. 通常，典型地

invocation [ˌɪnvəˈkeʃən] n. 调用

supply [səˈplaɪ] v. 供给

via [ˈvaɪə, ˈviə] prep. 经过，通过

iterator [ɪtə'reɪtə] n. 迭代器

fit v. 适合

2.1 Example 示例

（11）Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:

翻译：考虑在一个大的文档集合中对每个单词出现的数目进行统计的场景。用户将编写类似于下面的伪代码。

occurrence [əˈkɜ:rəns] n. 发生，出现

pseudo ['su:doʊ] a. 虚伪的

map(String key, String value):

// key: document name

// value: document contents for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts int result = 0;

for each v in values: result += ParseInt(v);

Emit(AsString(result));

（12）The map function emits each word plus an associated count of occurrences (just ‘1’ in this simple example). The reduce function sums together all counts emitted for a particular word.

翻译：Map函数输出每个单词、以及这个词的出现次数（在这个简单的例子里就是“1”）。Reduce函数对Map函数产生的每一个特定的单词的数目进行求和。

emit [ɪˈmɪt] v. 发出，发射

associated [əˈsoʊʃieɪtɪd] v. 联系，陪伴

（13）In addition, the user writes code to fill in a mapreduce specification object with the names of the input and output files, and optional tuning parameters. The user then invokes the MapReduce function, passing it the specification object. The user’s code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this example.

翻译：除此之外，用户编写代码，使用输入和输出的文件名字、可选择的调节参数来满足一个符合MapReduce模型规范的对象。然后用户激活MapReduce函数，并把这个规范对象传递给它。用户代码和MapReduce库链接在一起（用C++实现）。附录A中包含了这个示例的完整程序代码。

tune [tu:n] v. 协调，调整

appendix [əˈpɛndɪks] n. 附录，附加物

2.2 Types 类型

（14）Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types:

map (k1,v1) → list(k2,v2)

reduce (k2,list(v2)) → list(v2)

翻译：尽管前面的伪代码根据字符串类型的输入和输出进行编写，从概念上讲，由用户定义的Map和Reduce函数有相关联的类型：

map (key1,value1) → list(key2,value2)

reduce (key2,list(value2)) → list(value2)

in terms of phr. 根据，就……而言

（15）I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore,the intermediate keys and values are from the same domain as the output keys and values.

翻译：换言之，输入的键和值与输出的键和值的泛型不同。更进一步说，中间的键和值与输出的键和值在泛型相同。

I.e. abbr. 即，换言之

（16）Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types.

翻译：我们的C++通过字符串类型作为用户定义函数的输入输出，用户在代码中对字符串类型转换成适当的类型。

2.3 More Examples 更多的示例

（17）Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations.

翻译：这里是一些简单的有趣的示例，能够简单地使用MapReduce计算模型来表示。

（18）Distributed Grep:The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.

翻译：分布式检索：如果Map函数输出匹配一个提供的模式的一行。Reduce函数是一个恒等函数，仅仅把提供的中间数据复制到输出。

grep [计] 检索目标行命令

identity [aɪˈdɛntɪti] n. 身份，特征

（19）Count of URL Access Frequency: The map function processes logs of web page requests and outputs URL, 1 . The reduce function adds together all values for the same URL and emits a URL, total count pair.

翻译：计算URL的访问频率分布：Map函数处理web页面请求日志，然后输出<URL， 1>的键值对。Reduce函数将具有相同URL的键的值加起来，输出一个<URL，总数>的键值对。

frequency [ˈfrikwənsi] n. 频繁性，频率分布

（20）Reverse Web-Link Graph:The map function outputs target, source pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: (target, list(source))

翻译：倒转网络链接图：Map函数在源页面资源中搜索每一个链接目标，输出为<目标URL，源页面>。Reduce函数把给定的目标URL组的链接合起来成为一个URL资源列表，然后发出<目标URL，源页面列表>的键值对。

concatenate [kɑnˈkæt(ə)ˌneɪt] v. 把……联系起来

（21）Term-Vector per Host:A term vector summarizes the most important words that occur in a document or a set of documents as a list of word, f requency pairs. The map function emits a hostname, term vector pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final (hostname, term vector) pair.

翻译：每个主机的检索词向量：检索词向量用<词，频率分布>的键值对，总结了出现在一个文档或者文档集合中最重要的词。Map函数为每个输入的文档输出<主机名，检索词向量>键值对（主机名从文档URL中提取出来）。Reduce函数接收给定主机的所有文档的检索词向量。它将这些检索词向量加在一起，丢掉低频的检索词，然后输出一个最终的<主机名，检索词向量>键值对。

summarize [ˈsʌməˌraɪz] v. 总结

extract [ɪkˈstrækt] v. 提取，获得

throw away phr. 丢掉，摒弃

（22）Inverted Index:The map function parses each document, and emits a sequence of word, document ID pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a word, list(document ID) pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

翻译：倒排索引：Map函数解析每一个文档，输出一个<词，文档号>键值对的列表。Reduce函数接收一个给定词的所有键值对，对符合的文档号进行排序，输出<词，文档号列表>的键值对。所有的输出键值对集合形成了一个简单的倒排索引。这以一种简单的计算方法来跟踪词在文档中的位置。

parse [pɑ:rs] v. 解析

correspond [ˌkɔ:rəˈspɑ:nd] v. 符合，一致

augment [ɔɡˈmɛnt] v. 增强，加强

track v. 跟踪，监测

keep track of phr. 与……保持联系

（23）Distributed Sort: The map function extracts the key from each record, and emits a key, record pair. The reduce function emits all pairs unchanged. This computation depends on the partitioning facilities described in Section 4.1 and the ordering properties described in Section 4.2.

翻译：分布式排序：Map函数从每一个记录中提取键值，然后输出<键，记录>键值对。Reduce函数不改变任何的键值对。这种运算依赖分区机制在4.1节进行描述，然后排序属性在4.2节进行描述。

facility [fəˈsɪləti] n. 设备，能力

order n. 规则，制度，命令