MapReduce:是提交到yarn上做计算的,本身不需要部署
Yarn:资源(CPU、MEMORY)和作业调度,eg:运行在上海or杭州的机房,CPU我们有48个core,给这个作业10个core;内存有120G,分配给他20G。
YARN on a Single Node:
You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.
翻译:通过设置几个参数并运行ResourceManager和NodeManager进程,你可以在伪分布式
的yarn上提交一个MapReduce作业。
一、配置xml文件参数
1、 Configure parameters as follows : vi etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
2、vi etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name> //洗牌
<value>mapreduce_shuffle</value>
</property>
</configuration>
二、启动yarn
Start ResourceManager daemon and NodeManger daemon:
[hadoop@hadoop004 hadoop]$ sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/logs/yarn- hadoop-resourcemanager-hadoop004.out
hadoop004: starting nodemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/logs/yarn-hadoop-nodemanager-hadoop004.out
[hadoop@hadoop004 hadoop]$ jps
31158 Jps
31128 NodeManager
31037 ResourceManager
YARN的部署okay。
在我们启动过程中,如果有哪个进程没启动成功,自己去hadoop目录下log文件夹中使用tail -200f 打印出对应的以log为结尾的日志内容进行排错。
面试题:hadoop-hadoop-datanode-hadoop002.log
hadoop-用户-进程名称-机器名称 //hdfs,这里面用hadoop表示
yarn-hadoop-nodemanager-hadoop004.log //yarn
如何通过日志快速定位问题:
1、ll -h * //查看日志文件大小
如果文件小,直接vi进编辑模式,尾行模式输入 : 再输入 /error,通过搜索查看error。
2、倒序查看1000行,tail -1000f
新开一个窗口重启进程再配合tail -200f 使用
eg. call from hadoop004/10.0.0.135 to hadoop004:9000 failed on connection exception: java.net.ConnectException: Connection refused
是不是端口号占用?
3、使用sz把linux上文件上传到windows,再用editplus打开排错。
三、MR JOB案例运行
在hadoop目录下,使用find ./ -name “example.jar” //在当前目录下模糊匹配搜索文件名中带“example”,并且结尾是jar的文件。
[hadoop@hadoop004 hadoop]$ find ./ -name "*example*.jar"
./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar
./share/hadoop/mapreduce2/sources/hadoop-mapreduce-examples-2.6.0-cdh5.7.0-sources.jar
./share/hadoop/mapreduce2/sources/hadoop-mapreduce-examples-2.6.0-cdh5.7.0-test-sources.jar
./share/hadoop/mapreduce1/hadoop-examples-2.6.0-mr1-cdh5.7.0.jar
此时我们新学习一个命令:
案例一:计算圆周率
hadoop jar
./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar //此时回车查看命令使用
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
hadoop jar ./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar pi 5 20
案例二:词频统计
解释:wordcount 后要跟一个输入路径/wordcount/input 和一个输出路径/wordcount/output1
创建在hdfs文件系统上的,自己编写两个.log文件,进行上传。
hdfs dfs -mkdir -p /wordcount/input
hdfs dfs -put /home/hadoop/data/a.txt /wordcount/input
注意:output1目录我们不用创建,执行时会自动创建,做好命名即可;若提前创建好,则程序压根不会执行。
hadoop jar ./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /wordcount/input /wordcount/output1
HDFS命令:hdfs dfs -get /wordcount/output1/ ./ 把output1的目录文件下载到当前目录。
5.运行mr
map:映射
reduce:规约
map、reduce执行先后顺序,并没有绝对地说,要先map执行完后才执行reduce。