Map Reduce

In this page, I will explain the following important MR concepts.

1) Job: how the job is inited , executed.

2) MR components: How they work to process data.

3) Data Flow: Word counter to illustrate point 2.

4) Shuffle: the whole picture.

Job

Job submitter computes the input splits, submits the job.

JobScheduler initializes the job object, creating map tasks,reduce tasks,jobsetup task and jobCleanUp task. 

For each input split, one map task is created.  a split is a block which is the default. But a split can be configured to be multiple blocks. The FileSplit java class represents a split which has the path, start(offset),length, hosts(List) attributes , with these information , HDFS knows where/how/what size to get the data for the split. Note, DataReader/FSInputStream reads completed line . That means if one line text(last line) across 2 blocks, the reader will get the whole line by reading the next block / split. You can also configure the split size to be n*block size, but usually, it is not a good idea because it breaks the rule of data locality optimazation(The other blocks will be copied from another data nodes to the task). 

number of reduce task is configured in the MR program.

JobTracker assigns these tasks to TaskTrackers based on Data locality optimization and how much load is on the trasktracker. Tasktracker sends HB to job trackers . HB inclues the resource available(cpu,memory....).  It also sends the progress to job tracker periodically , and it will forward to the application so , the progress is known.

TaskTracker will run these tasks on separate JVM in case the application bring down the task tracker due to fault.

jobsetup task is run firstly to create output folder, then map tasks are run, then reduce tasks run. finally, job cleanup task is run. the job is successfuly only job cleanup task is run finished without problem. It renames the folder to _success. Before that, all the output is stored in temp folder and copied to working folder by job cleanup task. Reduce tasks knows the output of the map task from the job tracker which assigns the map tasks.  When to run reduce task can be configured based on what percentage of the map task is completed . 

Map output is written to local FS. Reduce output is written to HDFS(as we want to be reliable).

Components

Below is very clear to describe the data flow among the components. For some job , like map side join, there is no need to run the reduce task . So, not all the final output is generated by reduce task.

Data Flow

Some thing to note:

1) the client / application/ submiter specifies the number of reduce tasks which will determine how many partitions.

2) Values(or records) with the same key will be in the same partition, and later, will go into the same reducer. 

3) The records in the partition are sorted in memory(what algorithm?)

 

Shuffle in all

  • Map output is not directly written to disk but to a memory buffer(100MB default). Writing to disk for each record is very slow.  What's more, sort can not be done. When the bufer is 80% full, a background thread will start to do the partition and then sorting in the memory and then dump the part of the data to disk as a file(immutable split) while the map task is not blocked unless the buffer is full.
  • Before writing the data(partitioned & sorted) to disk, If needed, you can set a combiner ,for this example, it has the same logic with reducer , to optimize the program(save / transfer less data). See the changes to the third one that is the values(each is 1) are summed based on the key. In a result, the number of output records to disk is reduced to 4(originally 6).

  • As we know, hdfs file does not support modification(hdfs file is Immutable). We call each of this immutable file as a split. As the map task continues to output records/<key-value> pairs to the memory buffer, there will be more and more splits generated/dumped. A background thread will merge those splits (merge 10 splits by default one time) into a single large split which is still partitioned,sorted, combined. When the map task is done, all the data will be merged into a single file(per partition??? on behalf of reducer?) and copied to local FS(ext3/4). Sure, you can compress it to save I/O and network transermission to reducer tasks. They will be removed once job is completed.

 

 

  • Reducers run without waiting all map task to complete. It runs as soon as there is map output available to copy. 5 threads by default copies the map output to memory buffer. If it is 80% full, just like before, splits files starts to be written to disk. Before writting to a split to disk , it is merged, sorted, combined in memory(Because the data is from multiple map output with the same partition, there possible be records with same keys).  Multiple splits are also merged/sorted/combined concurrently by the background thread in reduce task. Finally, the larger splits will be consumed by reduce function. You can see the sort/combine are executed in both map and reduce side.

 

转载于:https://www.cnblogs.com/nativestack/p/9670690.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值