MapReduce原理

liweihope

已于 2023-03-10 23:35:37 修改

阅读量476

点赞数

文章标签：大数据 mapreduce 学习

于 2019-03-02 11:59:57 首次发布

本文链接：https://blog.csdn.net/liweihope/article/details/88073653

版权

本节的主要内容有：

1.MapReduce了解

2.MapReduce on Yarn架构

3.提交jar到yarn的过程剖析

4.shuffle剖析

5.MapReduce常用命令

1.MapReduce了解

MapReduce 分布式计算框架
企业开发很少用不用，因为它要用java代码去写，很复杂很累赘，它的shuffle是基于磁盘的，很慢，而spark是基于内存的框架。
但是面试要用和学习其他组件参照的时候要用，因为它是第二个大数据开发的计算框架，

MapReduce是由两个组成，Map计算和Reduce计算

Map：映射

Reduce：归约

Map：
x --》(x,1) key,value 键值对
y --》(y,1)
z --》(z,1)
x --》(x,1)

Reduce:
x,2
y,1
z,1

相当于一个sql语句：select name,sum(value) from xxx group by name

2.MapReduce on Yarn 架构（面试必问--要背下来的）

MapReduce是在yarn上面运行的，是没有进程的，那为什么还要去谈架构？

当面试的时候问到，MapReduce 架构设计、Yarn架构设计、Yarn的工作流程、MapReduce job 提交到 Yarn的工作流程（面试题为同一题），其实都是同一个问题。

架构图：

面试：请问容器是什么？

（这段话复制巨人博客）自己理解：如果对容器不是很理解的话就理解为是CPU和memory，生产调优就调优容器

答：容器container是一个抽象的概念，封装了cpu和memory，运行了Map Task和Reduce Task任务。

Yarn的NodeManager将机器的cpu，内存，网络，磁盘，等等的资源的情况，把它封装成一个水缸，里面能放很多东西，是一个独立的模块，只运行自己的东西，容器里存放（CPU memory）

Container（容器）：Yarn的一个资源的抽象（ps查询不到的，也不再子进程，但是在代码里封装的设置多少个CPU，内存等）

MapReduce理解为是一个jar包或一个程序，提交到Yarn上面，这个程序要运行在Yarn上面，上面有两个进程，ResourceManager和NodeManager，ResourceManager里面两个模块是什么，Applications Manager：应用程序管理器和 ResourceScheduler：调度器，NodeManager相当于执行一个容器,这个容器里面有 CPU+Memory，这个容器运行一个封装的任务，MapTask（映射任务）或者跑ReduceTask（归约任务）

如上图，

hadoop001上面运行了一个ResourceManager进程
hadoop002上面运行了一个NodeManager进程
hadoop002上面运行了一个NodeManager进程
ResourceManager资源作业管理者有两个组成部分，Applications Manager应用程序管理器（或者叫作业管理）和ResourceScheduler资源调度器。

Applications Manager还有其他叫法比如：job 、app 、application 、作业。

1.用户向yarn（yarn上面的ResourceManager的Applications Manager）提交应用程序（作业job），其中包括applicationMaster程序、启动applicationMaster命令等。

2.ResourceManager为该job（该app程序）分配第一个容器，那这个容器是哪台机器上的？并与对应的NM通信，要求NM在这个容器中去启动job的MapReduce applicationMaster程序（这里是MapReduce引导程序/主程序，类似编程中的main函数，一个入口，如果是spark，就是spark引导程序/主程序）。

3.applicationMaster首先向Applications Manager注册，用户就可以直接在web界面查看job的整个运行状态和日志。

4.applicationMaster向Resource Scheduler ，采用轮询的方式通过RPC协议去申请和领取资源列表（CPU和内存），（并不是真正的资源，只是一个资源信息，比如哪台机器启用多少cpu多少内存，然后你去领取吧）

（采用轮询的方式：比如作业过多，资源过少，它会先执行一部分的task，等这些task执行完之后，资源会释放，完成之后，会再去申请。）

5.一旦applicationMaster申请到资源的后，便与对应的NodeManager节点通信，（可以跟自己所在hadoop002的NM进行通信，也可以与较远的hadoop003的NM进行通信），要求启动任务（说白了就是启动container容器，在容器里启动task）。

（NodeManager收到请求通信之后，拿到请求的资源列表后，你需要在我这台机器上启动多少个container容器，每个container容器它的内存多少CPU多少、运行什么样的任务、跑什么样的代码，拿到这些清单之后，就要去设置它的容器了。）

（一般容器里map映射多一些，reduce少一些）

6.NodeManager为任务task设置好运行环境(环境变量、jar包等)，将任务的启动命令写在一个脚本文件中，并通过这个脚本【启动任务】（面试的时候说清楚在哪一步启动任务）；

7.各个task通过rpc协议向applicationMaster汇报自己的状态和进度。以让applicationMaster随时掌握各个任务的运行状态，从而在任务失败时可以重新启动任务。用户在web界面可以实时查看job的当前的运行状态。

8.job运行完成后，applicationMaster向ResourceManager注销并关闭自己。

其实总的来说就是分为两个阶段：
2.1 启动applicationMaster
2.2 由applicationMaster创建job，为它申请资源，并监控它的整个运行过程，直到运行完成。

3.提交jar的过程剖析

[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$ hadoop jar ./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount \
> /wordcount/input /wordcount/output1
19/03/02 16:16:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/03/02 16:16:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 （client端去连接RM）
19/03/02 16:16:46 INFO input.FileInputFormat: Total input paths to process : 2 （文件输入、传入了两个路径参数）
19/03/02 16:16:46 INFO mapreduce.JobSubmitter: number of splits:2 （有两个文件a.txt、b.txt，每个文件只有一个块，因为太小了，小于128M，所以总共两个块）
19/03/02 16:16:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1551513522488_0001 （提交作业的名字）
19/03/02 16:16:46 INFO impl.YarnClientImpl: Submitted application application_1551513522488_0001
19/03/02 16:16:46 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1551513522488_0001/ （可以打开浏览器查看作业的内容）
19/03/02 16:16:46 INFO mapreduce.Job: Running job: job_1551513522488_0001 （正在执行作业）
19/03/02 16:16:54 INFO mapreduce.Job: Job job_1551513522488_0001 running in uber mode : false
19/03/02 16:16:54 INFO mapreduce.Job: map 0% reduce 0%
19/03/02 16:17:00 INFO mapreduce.Job: map 50% reduce 0%
19/03/02 16:17:01 INFO mapreduce.Job: map 100% reduce 0%
19/03/02 16:17:06 INFO mapreduce.Job: map 100% reduce 100%
19/03/02 16:17:07 INFO mapreduce.Job: Job job_1551513522488_0001 completed successfully
19/03/02 16:17:07 INFO mapreduce.Job: Counters: 49
File System Counters
...........
Job Counters
Launched map tasks=2 （map task有两个）
Launched reduce tasks=1 （reduce task有1个）

..........

从上图可以看出来，在运行container容器的时候，是有container进程的，它是一个独立的进程，但是container容器用完释放了之后就没有container进程了。

4.shuffle剖析洗牌（重要--这个在以后将经常用到）

参考官网: hadoop.apache.org -------→MapReduce 看相应的地方比如wordcount的源码（java代码）、shuffle的原理等。

（wordcount的源码如果用spark写的话就两行）

map --> shuffle -->reduce shuffle是map和reduce中间的一个部分。

假如input的这个文件是260M，一个块是128M，那么这个文件将被分隔成3个块，并行的去运行。

splitting：是根据块来划分并行度。

具体不解释了。

仔细看一下博客：MapReduce工作原理图文详解 _ITPUB博客

6.MapReduce常用命令

MapReduce命令不常用，用的很少，面试可能用到，有时候web界面打不开，比如web界面打不开，但是你想知道运行了哪些job，那就需要这些命令了。

可以查看帮助：mapred --help

[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$ mapred --help
Usage: mapred [--config confdir] COMMAND
where COMMAND is one of:
pipes run a Pipes job
job manipulate MapReduce jobs
queue get information regarding JobQueues
classpath prints the class path needed for running
mapreduce subcommands
historyserver run job history servers as a standalone daemon
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
archive-logs combine aggregated logs into hadoop archives
hsadmin job history server admin interface

Most commands print help when invoked w/o parameters.
[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$ mapred job
Usage: CLI <command> <args>
[-submit <job-file>]
[-status <job-id>]
[-counter <job-id> <group-name> <counter-name>]
[-kill <job-id>]
[-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW
[-events <job-id> <from-event-#> <#-of-events>]
[-history [all] <jobHistoryFile|jobId> [-outfile <file>] [-format <human|json>]]
[-list [all]]
[-list-active-trackers]
[-list-blacklisted-trackers]
[-list-attempt-ids <job-id> <task-type> <task-state>]. Valid values for <task-type> are MAP REDUCE. Valid values for <task-state> are running, completed
[-kill-task <task-attempt-id>]
[-fail-task <task-attempt-id>]
[-logs <job-id> <task-attempt-id>]

Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$ mapred job -list
19/03/02 18:32:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/03/02 18:32:15 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total jobs:0
JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$

比如先mapred job -list 查看job的列表，然后kill掉你想杀死的job，或者查看你想查看的job的日志等。

比如：mapred job -logs job_1551521482026_0001

也可以用yarn：

[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$ yarn --help
Usage: yarn [--config confdir] COMMAND
where COMMAND is one of:
resourcemanager -format-state-store deletes the RMStateStore
resourcemanager run the ResourceManager
nodemanager run a nodemanager on each slave
timelineserver run the timeline server
rmadmin admin tools
version print the version
jar <jar> run a jar file
application prints application(s)
report/kill application
applicationattempt prints applicationattempt(s)
report
container prints container(s) report
node prints node report(s)
queue prints queue information
logs dump container logs
classpath prints the class path needed to
get the Hadoop jar and the
required libraries
daemonlog get/set the log level for each
daemon
top run cluster usage tool
or
CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.
[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$

[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$ yarn application
19/03/02 18:43:18 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/03/02 18:43:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Invalid Command Usage :
usage: application
-appStates <States> Works with -list to filter applications
based on input comma-separated list of
application states. The valid application
state can be one of the following:
ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUN
NING,FINISHED,FAILED,KILLED
-appTypes <Types> Works with -list to filter applications
based on input comma-separated list of
application types.
-help Displays help for all commands.
-kill <Application ID> Kills the application.
-list List applications. Supports optional use
of -appTypes to filter applications based
on application type, and -appStates to
filter applications based on application
state.
-movetoqueue <Application ID> Moves the application to a different
queue.
-queue <Queue Name> Works with the movetoqueue command to
specify which queue to move an
application to.
-status <Application ID> Prints the status of the application.
[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$

[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$ yarn application -list
19/03/02 18:44:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/03/02 18:44:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
Application-Id Application-Name Application-Type User Queue State Final-Stat Progress Tracking-URL
[hadoop@10-9-140-90 hadoop-2.6.0-cdh5.7.0]$