Cloud Computing(1)_Introduction to MapReduce

最新推荐文章于 2020-11-26 17:53:57 发布

夏大兔

最新推荐文章于 2020-11-26 17:53:57 发布

阅读量494

点赞数

分类专栏：云计算文章标签：云计算

本文链接：https://blog.csdn.net/xiayiqian71/article/details/59632480

版权

云计算专栏收录该内容

9 篇文章 0 订阅

订阅专栏

MapReduce = functional programming meets distributed processing

MapReduce can refer to:
The programming model;
The execution framework(aka “runtime”);
The specific implementation.

Functional programming

Two important concepts in functional programming
- Map: takes a function f and applies it to every element in a list
- Fold: iteratively applies a function g to aggregate results
Map and Fold

MapReduce

Let’s assume a long list of records, imagine if:

we can parallelize map operations
we have a mechanism for bringing map results back together in the fold operation
That’s MapReduce! (and Hadoop)

Typical Big Data Problem

Iterate over a large number of records
Extract something of interest from each(MAP)
Shuffle and sort intermediate results
Aggregate intermediates results(Reduce)
Generate final output

Mappers and Reducers

In MapReduce, the programmer de nes a mapper and a reducer with the following signatures:

map: (k1, v1)->[(k2, v2)]
reduce: (k2, [v2])-> [(k3, v3)]

All values with the same key are sent to the same reducer
The execution framework handles everything else
- Handles scheduling
- Handles “data distribution”
- Handles synchronization
- Handles errors and faults
- Everything happens on top of a distributed FS

Partitioners and Combiners

There are two additional elements that complete the programming model: partitioners and combiners.
P&C
Partitioners are responsible for dividing up the intermediate key space and
assigning intermediate key-value pairs to reducers.

Combiners are an optimization in MapReduce that allow for local aggregation before the shue and sort phase.
One can think of combiners as “mini-reducers” that take place on the output of the mappers.

partition(k', number of partitions)
combine(k', v')-><k', v'>*

Distributed File System

Don’t move data to workers, move workers to the data!

GFS(Google File System) for Google’s MapReduce
HDFS(Hadoop Distributed File System) for Hadoop

GFS

The distributed file system adopts a master-slave architecture in which the master maintains the fi le namespace (metadata, directory structure, fi le to block mapping, location of blocks, and access permissions) and the slaves manage the actual data blocks.
In GFS, the master is called the GFS master, and the slaves are called GFS chunkservers.

HDFS

In HDFS, the roles are filled by the namenode and datanodes, respectively.

Namenode Respondibilities:

Namespace management. The namenode is responsible for maintaining the fi le namespace, which includes metadata, directory structure, fi le-to-block mapping, location of blocks, and access permissions. These data are held in memory for fast access and all mutations are persistently logged.
Coordinating fi le operations. The namenode directs application clients to datanodes for read operations, and allocates blocks on suitable datanodes for write operations.
All data transfers occur directly between clients and datanodes. No data is moved through the namenode.
数据只在用户和datanode之间传输。
Maintaining overall health of the fi le system. Periodic contact with the datanodes; Block re-replication and rebalancing; Garbage collection.

Putting everything together-Hadoop Cluster Architecture

Putting everything together, we get the architecture of a complete Hadoop cluster.
Hadoop
Architecture of a complete Hadoop cluster, which consists of three separate components: the HDFS master (called the namenode), the job submission node (called the jobtracker), and many slave nodes (three shown here). Each of the slave nodes runs a tasktracker for executing map and reduce tasks and a datanode daemon for serving HDFS data.