摘要:
主要基于三台机器之上的hadoop2.7.3的下载、安装,及相关参数配置,所遇问题,Demo等。其中配置,包含hadoop运行环境,yarn运行环境配置,目的是搭建成基于yarn之上的RM运行环境,另外,也对资源限制的情况下作了一个示范性的设置。
前置
- 有一个局域网集群,例如在虚拟机上搭建的那样[1]大数据学习前夕[01]:系统-网络-SSH
- 安装好JDK,及环境变量配置好;例如[2]大数据学习前夕[02]:JDK安装升级
下载
wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
解压
[hadoop@hadoop01 ~]$ tar -zxvf hadoop-2.7.3.tar.gz
[hadoop@hadoop01 ~]$ mv hadoop-2.7.3 hadoop
新文件目录
mkdir dfs
mkdir dfs/name
mkdir dfs/data
配置
1. 修改hadoop-env.sh
[hadoop@hadoop01 ~]$ vim hadoop/etc/hadoop/hadoop-env.sh
2. yarn-env.sh
[hadoop@hadoop01 hadoop]$ vim /home/hadoop/hadoop/etc/hadoop/yarn-env.sh
配置JAVA_HOME,打开那个注解
export JAVA_HOME=/home/hadoop/jdk1.8.0_144
3. slaves
这里把hadoop03与hadoop02作为slaves,也可以把hadoop01加入来。
[hadoop@hadoop01 hadoop]$ vim /home/hadoop/hadoop/etc/hadoop/slaves
4. core-site.xml
[hadoop@hadoop01 hadoop]$ vim /home/hadoop/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://haoop01:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hadoop/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
</configuration>
5. hdfs-site.xml
[hadoop@hadoop01 hadoop]$ vim /home/hadoop/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop01:9001</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
6. mapred-site.xml
MR运行需要设置一下它的内存限制,要不跑MR程序时会卡死,后面的yarn也会设置这个内存及CPU的资源问题。
[hadoop@hadoop01 hadoop]$ vim /home/hadoop/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>haooop01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop01:19888</value>
</property>
<!-- 设置mr运行内存-->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>512</value>
</property>
</configuration>
7. yarn-site.xml
[hadoop@hadoop01 hadoop]$ vim /home/hadoop/hadoop/etc/hadoop/yarn-site.xml
这里要包含设置内存及CPU的资源选项,除非机器的配置大过默认配置
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop01:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop01:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop01:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop01:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop01:8088</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!--单个任务可申请的最少物理内存量,默认是2048(MB) -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>256</value>
</property>
<!--单个任务可申请的最多物理内存量,默认是8192(MB) -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
<description>The amount of memory the MR AppMaster needs.</description>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.cpu-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>1</value>
</property>
</configuration>
8.复制配置文件
配置完之后,复制到基它的集群机器:
[hadoop@hadoop01 ~]$ scp -r hadoop hadoop@hadoop02:~/
[hadoop@hadoop01 ~]$ scp -r hadoop hadoop@hadoop03:~/
9.Hadoop环境变量
[hadoop@hadoop01 ~]$ sudo vim /etc/profile
#hadoop
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin
export PATH=$PATH:$HADOOP_HOME/bin
[hadoop@hadoop01 ~]$ source /etc/profile
格式化
[hadoop@hadoop01 hadoop]$ bin/hdfs namenode -format
启动
[hadoop@hadoop01 hadoop]$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [hadoop01]
hadoop01: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-hadoop01.out
hadoop02: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-hadoop02.out
hadoop03: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-hadoop03.out
Starting secondary namenodes [hadoop01]
hadoop01: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop01.out
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-hadoop01.out
hadoop03: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop03.out
hadoop02: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop02.out
hadoop01
hadoop02
hadoop03
运行Demo
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /data/input /data/output/result
web上查看
结果
提示:上面配置时,如果没有加入资源的限制,会一直卡在这里,不会运行;如果配置得不够会报错,例如
原因是程序使用的程序已经超过了内存,还要设置一个参数:
在yarn-site.xml文件中加入[前面已加]
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
可能出现的基它异常
问题1,当访问hadoop文件系统时出现:
ls: Call From localhost/127.0.0.1 to hadoop01:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
问题2,当向fs上传文件时,即put操作时:
出这些问题的原因会比较多,解决方法1:关闭防火墙
[hadoop@hadoop01 sbin]$ sudo service iptables stop
[hadoop@hadoop01 sbin]$ sudo chkconfig iptables off
解决方法2:重新格式化
把hadoop/tmp全删除了,格式化 hadoop namenode -format
只启动:start-dfs.sh
查看报告
[hadoop@hadoop01 sbin]$ hadoop dfsadmin -report
或采用WEB查看:
http://192.168.137.101:50070/dfshealth.html#tab-overview
后记
配置到这里,hadoop完成了,只要是两个东西,一个是文件系统,一个MR分布式程序,这个配置过程中,用到了yarn来管理MR,可以解决MR存在大部分问题。
参考引用
[1] 大数据学习前夕[01]:系统-网络-SSH
[2] 大数据学习前夕[02]:JDK安装升级
【作者:happyprince】