Cloud Computing(1)_Introduction to MapReduce

MapReduce = functional programming meets distributed processing

MapReduce can refer to:
The programming model;
The execution framework(aka “runtime”);
The specific implementation.

Functional programming

Two important concepts in functional programming
- Map: takes a function f and applies it to every element in a list
- Fold: iteratively applies a function g to aggregate results
Map and Fold

MapReduce

Let’s assume a long list of records, imagine if:

  • we can parallelize map operations
  • we have a mechanism for bringing map results back together in the fold operation
    That’s MapReduce! (and Hadoop)
Typical Big Data Problem
  • Iterate over a large number of records
  • Extract something of interest from each(MAP)
  • Shuffle and sort intermediate results
  • Aggregate intermediates results(Reduce)
  • Generate final output
Mappers and Reducers

MapReduce
In MapReduce, the programmer de nes a mapper and a reducer with the following signatures:

map: (k1, v1)->[(k2, v2)]
reduce: (k2, [v2])-> [(k3, v3)]

All values with the same key are sent to the same reducer
The execution framework handles everything else
- Handles scheduling
- Handles “data distribution”
- Handles synchronization
- Handles errors and faults
- Everything happens on top of a distributed FS

Partitioners and Combiners

There are two additional elements that complete the programming model: partitioners and combiners.
P&C
Partitioners are responsible for dividing up the intermediate key space and
assigning intermediate key-value pairs to reducers.

Combiners are an optimization in MapReduce that allow for local aggregation before the shue and sort phase.
One can think of combiners as “mini-reducers” that take place on the output of the mappers.

partition(k', number of partitions)
combine(k', v')-><k', v'>*

流程

Distributed File System

Don’t move data to workers, move workers to the data!

  • GFS(Google File System) for Google’s MapReduce
  • HDFS(Hadoop Distributed File System) for Hadoop

    GFS

    The distributed file system adopts a master-slave architecture in which the master maintains the fi le namespace (metadata, directory structure, fi le to block mapping, location of blocks, and access permissions) and the slaves manage the actual data blocks.
    In GFS, the master is called the GFS master, and the slaves are called GFS chunkservers.

    HDFS

    In HDFS, the roles are filled by the namenode and datanodes, respectively.
    HDFS

Namenode Respondibilities:

  • Namespace management. The namenode is responsible for maintaining the fi le namespace, which includes metadata, directory structure, fi le-to-block mapping, location of blocks, and access permissions. These data are held in memory for fast access and all mutations are persistently logged.

  • Coordinating fi le operations. The namenode directs application clients to datanodes for read operations, and allocates blocks on suitable datanodes for write operations.
    All data transfers occur directly between clients and datanodes. No data is moved through the namenode.
    数据只在用户和datanode之间传输。

  • Maintaining overall health of the fi le system. Periodic contact with the datanodes; Block re-replication and rebalancing; Garbage collection.

Putting everything together-Hadoop Cluster Architecture

Putting everything together, we get the architecture of a complete Hadoop cluster.
Hadoop
Architecture of a complete Hadoop cluster, which consists of three separate components: the HDFS master (called the namenode), the job submission node (called the jobtracker), and many slave nodes (three shown here). Each of the slave nodes runs a tasktracker for executing map and reduce tasks and a datanode daemon for serving HDFS data.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值