MapReduce = functional programming meets distributed processing
MapReduce can refer to:
The programming model;
The execution framework(aka “runtime”);
The specific implementation.
Functional programming
Two important concepts in functional programming
- Map: takes a function f and applies it to every element in a list
- Fold: iteratively applies a function g to aggregate results
MapReduce
Let’s assume a long list of records, imagine if:
- we can parallelize map operations
- we have a mechanism for bringing map results back together in the fold operation
That’s MapReduce! (and Hadoop)
Typical Big Data Problem
- Iterate over a large number of records
- Extract something of interest from each(MAP)
- Shuffle and sort intermediate results
- Aggregate intermediates results(Reduce)
- Generate final output
Mappers and Reducers
In MapReduce, the programmer de nes a mapper and a reducer with the following signatures:
map: (k1, v1)->[(k2, v2)]
reduce: (k2, [v2])-> [(k3, v3)]
All values with the same key are sent to the same reducer
The execution framework handles everything else
- Handles scheduling
- Handles “data distribution”
- Handles synchronization
- Handles errors and faults
- Everything happens on top of a distributed FS
Partitioners and Combiners
There are two additional elements that complete the programming model: partitioners and combiners.
Partitioners are responsible for dividing up the intermediate key space and
assigning intermediate key-value pairs to reducers.
Combiners are an optimization in MapReduce that allow for local aggregation before the shue and sort phase.
One can think of combiners as “mini-reducers” that take place on the output of the mappers.
partition(k', number of partitions)
combine(k', v')-><k', v'>*
Distributed File System
Don’t move data to workers, move workers to the data!
- GFS(Google File System) for Google’s MapReduce
HDFS(Hadoop Distributed File System) for Hadoop
GFS
The distributed file system adopts a master-slave architecture in which the master maintains the fi le namespace (metadata, directory structure, fi le to block mapping, location of blocks, and access permissions) and the slaves manage the actual data blocks.
In GFS, the master is called the GFS master, and the slaves are called GFS chunkservers.HDFS
In HDFS, the roles are filled by the namenode and datanodes, respectively.
Namenode Respondibilities:
Namespace management. The namenode is responsible for maintaining the fi le namespace, which includes metadata, directory structure, fi le-to-block mapping, location of blocks, and access permissions. These data are held in memory for fast access and all mutations are persistently logged.
Coordinating fi le operations. The namenode directs application clients to datanodes for read operations, and allocates blocks on suitable datanodes for write operations.
All data transfers occur directly between clients and datanodes. No data is moved through the namenode.
数据只在用户和datanode之间传输。Maintaining overall health of the fi le system. Periodic contact with the datanodes; Block re-replication and rebalancing; Garbage collection.
Putting everything together-Hadoop Cluster Architecture
Putting everything together, we get the architecture of a complete Hadoop cluster.
Architecture of a complete Hadoop cluster, which consists of three separate components: the HDFS master (called the namenode), the job submission node (called the jobtracker), and many slave nodes (three shown here). Each of the slave nodes runs a tasktracker for executing map and reduce tasks and a datanode daemon for serving HDFS data.