Hadoop（4）：MapReduce on Yarn工作流程

最新推荐文章于 2022-04-05 14:12:01 发布

RayBreslin

最新推荐文章于 2022-04-05 14:12:01 发布

阅读量942

点赞数 1

分类专栏：大数据开发 hadoop Yarn 文章标签： Hadoop MapReduce Yarn 工作流程

本文链接：https://blog.csdn.net/u010886217/article/details/89299416

版权

大数据开发同时被 3 个专栏收录

204 篇文章 8 订阅

订阅专栏

hadoop

25 篇文章 1 订阅

订阅专栏

Yarn

7 篇文章 0 订阅

订阅专栏

一、关键概念

1.Client客户端

作用：提交mapreduce任务的电脑。

2.Resource manager

作用：用于管理整个集群资源调度分配，包含Applications manager和Resource Scheduler。

（1）Applications manager：管理每个提交任务，创建每个任务的Application master。

（2）Resource Schedule：用于管理每个任务的资源分配，会分配给每个任务相应资源（container）

3.ApplicationMaster

作用：负责客户端提交Job调度的进程，每个job对应一个applicationMaster。

4.Node manager

作用：单个节点的资源管理器。包含计算资源Container，是每个map或者reduce任务运行的位置。一个Node manager可以有多个Container，每个Container运行一个map或者reduce任务。

备注：maptask和reducetask默认资源是1G内存和1核CPU，

二、工作流程

1.框架图

2.提交流程

（1）Clinet向RM申请资源，RM上有所有NM的节点资源信息，RM将资源信息(Node Manager的hostname、以及分配的内存和CPU大小)发送给Client

（2）Client根据请求到资源信息发送到对应的Node Manager，Node Manager中产生Container对象，然后在Container对象中调用相关代码，启动Application Master

（3）Application Master 开始获取客户端提交job的相关设置信息。例如，获得得到map task数量(由InputFormat的getSplits方法决定)和reduce task数量(由参数mapreduce.job.reduces影响)

（4）Application Master向RM申请Map Task运行的资源(一个Map Task或者Reduce Task就需要申请一个container)，RM将分配的资源发送给Application Master

（5）和（6）Application Master远程调用NM的相关方法启动对应的Container，并在Container中启动对应的Map Task

（7）包含两种任务：

-》map Task：当一个Map Task执行完成后，会通知AM进程，当前Map Task执行完成；

-》reduce Task：当总Map Task中有5%执行完成，AM向RM申请reduce task运行资源(一个task需要一个container)，RM将资源信息发送给AM，AM在对应的NM节点启动对应的Container，并在Container中运行对应的reduce task任务

（8）当reduce task任务执行完成后，会通知Application Master进程，以及进一步通知Applications Manager。当所有的reduce task执行完成，AM通知client表示程序执行完成.

三、Map和Reduce细节调度

1.map task运行位置

map task运行位置是由RM返回给AM资源的时候决定的，因为RM上会有全部CPU和内存，已使用CPU和内存，RM会根据底层写好的算法，返回客户端NM的信息。

RM在分配资源的时候，会尽可能的将Map Task启动到对应的数据节点上。这样的话AM在启动RM的时候就会在数据节点上启动处理该数据的task任务，该机制叫做mapreduce的数据本地化机制。但是，如果数据节点本地资源不够，就会选择同一个机架的其他机器，最次是其他机架的服务器。

一个maptask默认使用1核cpu，使用1G内存。

2.reduce task运行位置

默认算法分配，这个一般无法指定。

3.map数量设置

MapReduce有两套API，org.apache.hadoop.mapred(旧API)和org.apache.hadoop.mapreduce(新API)，开发时基本用新的API。Map和reduce的数量受两个操作的影响，split和partition，一个split就是对应一个maptask，一个partition对应一个reduce数据输入。

新版本mapreduce的textInputFormat使用参数：mapreduce.input.fileinputformat.split.maxsize和mapreduce.input.fileinputformat.split.minsize来控制split也就是map的数量。

（1）公式：split_size = max(minsize,min(maxsize,blocksize))

其中，参数默认maxsize为Long.MaxValue,minsize为0，所以默认map大小等于blocksize，也就是128M；如果要增多map数量，就将maxsize的值设置比blocksize小；如果要减少map数量，就将minsize的值设置比blocksize大。

（2）方法一：提交job时

bin/hadoop jar Hadoop.jar Hadoop.WordCountOnline /core-site.xml /output8 -Dmapreduce.input.fileinputformat.split.maxsize=20191226l -Dmapreduce.input.fileinputformat.split.minsize=1000

结果：

[root@hadoop01 hadoop-2.6.0-cdh5.7.0]# bin/hadoop jar Hadoop.jar Hadoop.WordCountOnline /core-site.xml /output8 -Dmapreduce.input.fileinputformat.split.maxsize=20191226l -Dmapreduce.input.fileinputformat.split.minsize=1000 

结果：
19/04/04 12:52:19 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=1357
		FILE: Number of bytes written=337515
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1080
		HDFS: Number of bytes written=998
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=2
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3219
		Total time spent by all reduces in occupied slots (ms)=13345
		Total time spent by all map tasks (ms)=3219
		Total time spent by all reduce tasks (ms)=13345
		Total vcore-seconds taken by all map tasks=3219
		Total vcore-seconds taken by all reduce tasks=13345
		Total megabyte-seconds taken by all map tasks=3296256
		Total megabyte-seconds taken by all reduce tasks=13665280
	Map-Reduce Framework
		Map input records=28
		Map output records=133
		Map output bytes=1513
		Map output materialized bytes=1357
		Input split bytes=99
		Combine input records=133
		Combine output records=87
		Reduce input groups=87
		Reduce shuffle bytes=1357
		Reduce input records=87
		Reduce output records=87
		Spilled Records=174
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=366
		CPU time spent (ms)=3300
		Physical memory (bytes) snapshot=625266688
		Virtual memory (bytes) snapshot=8269815808
		Total committed heap usage (bytes)=445120512
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=981
	File Output Format Counters 
		Bytes Written=998
success:0

（2）方法二：通过api在程序中设置

FileInputFormat.setMaxInputSplitSize(job,20190101l);
FileInputFormat.setMinInputSplitSize(job,1000);

4.reduce数量设置

方法一：通过api设置

job.setNumReduceTasks(2);

四、MapReduce程序

1.参考：大数据调试环境配置（2）：IDEA外部链接Hadoop调试环境配置以及部署jar包到服务器https://blog.csdn.net/u010886217/article/details/89278390

RayBreslin

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Hadoop（4）：MapReduce on Yarn工作流程

一、关键概念1.Client客户端作用：提交mapreduce任务的电脑。2.Resource manager作用：用于管理整个集群资源调度分配，包含Applications manager和Resource Scheduler。（1）Applications manager：管理每个提交任务，创建每个任务的Application master。（2）Resource Sc...
复制链接

扫一扫