hadoop集群2.7.2搭建
本文首先搭建两节点的集群,然后动态添加节点。各个节点环境配置如下:
master 192.168.101.26
master 节点,拥有所有节点的公共密钥
hadoop1 192.168.101.28
首次创建使用的datanode节点
hadoop2 192.168.101.29
hadoop3 192.168.101.30
追加的datanode节点
将对应关系追加到每台主机的/etc/hosts文件后面:
192.168.101.26 master
192.168.101.28 hadoop1
192.168.101.29 hadoop2
192.168.101.30 hadoop3
准备文件
删除openjdk:
$ rpm -qa|grep openjdk -i #查找已经安装的OpenJDK,-i表示忽略“openjdk”的大小写
$ sudo yum remove java-1.6.0-openjdk-devel-1.6.0.0-6.1.13.4.el7_0.x86_64 \
java-1.7.0-openjdk-devel-1.7.0.65-2.5.1.2.el7_0.x86_64 \
java-1.7.0-openjdk-headless-1.7.0.65-2.5.1.2.el7_0.x86_64 \
java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el7_0.x86_64 \
java-1.6.0-openjdk-1.6.0.0-6.1.13.4.el7_0.x86_64
下载 jdk,解压到指定目录。
#如若没有/usr/lib/jdk路径,则执行此句予以创建jdk文件夹
$ sudo tar -zxvf jdk-8u60-linux-x64.tar.gz -C /usr/lib/jdk #注意:-C, --directory=DIR 改变至目录 DIR
$ sudo mv /usr/lib/jdk1.8.0_60/ /usr/lib/jdk
配置环境变量
$ sudo vim /etc/profile
..
..
JAVA Environment 在最后一行插入
export JAVA_HOME=/usr/local/java/jdk1.8.0_92
export JRE_HOME=/usr/local/java/jdk1.8.0_92/jre
export PATH=
JAVAHOME/bin:
JRE_HOME/bin:
PATHexportCLASSPATH=
CLASSPATH:.:
JAVAHOME/lib:
JRE_HOME/lib
..
$ sudo source /etc/profile
下载 hadoop-2.7.2 ,然后解压到指定目录(/usr/local/hadoop),并创建所需文件夹。
# tar -xzvf hadoop-2.7.2.tar.gz
免密码ssh
在搭建hadoop集群时,需要机器相互可以免密码ssh,操作如下(四台机器都要操作):
# ssh-keygen -t rsa
讲产生的公钥复制到master机器上并重命名,以hadoop1为例子:
# scp ~/.ssh/id_rsa.pub root@host:~/id_rsa.pub.1
讲所有的迷药追加到authorized_keys中:
# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
cat ~/id_rsa.pub.1 >> ~/.ssh/authorized_keys
cat ~/id_rsa.pub.2 >> ~/.ssh/authorized_keys
cat ~/id_rsa.pub.3 >> ~/.ssh/authorized_keys
在master上更改权限
# chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
然后分发下去:
# scp ~/.ssh/authorized_keys root@hadoop1:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys root@hadoop2:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys root@hadoop3:~/.ssh/authorized_keys
这样就可以免密码登陆了。
解压hadoop文件到指定目录,这里指定为/opt/hadoop-2.7.2。 这里要配置的文件:
/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/yarn-env.sh
/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/slaves
/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/core-site.xml
/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hdfs-site.xml
/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/mapred-site.xml
/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/yarn-site.xml
1、配置 hadoop-env.sh文件–>修改JAVA_HOME
# The java implementation to use.
export JAVA_HOME=/usr/local/java/jdk1.8.0_92
2、配置 yarn-env.sh 文件–>>修改JAVA_HOME
# some Java parameters
export JAVA_HOME=/usr/local/java/jdk1.8.0_92
3、配置slaves文件–>>增加slave节点
hadoop1
hadoop2
hadoop3
4、配置 core-site.xml文件–>>增加hadoop核心配置(hdfs文件端口是9000、file:/hadoop/tmp、)
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/hadoop/tmp</value>
<description>Abasefor other temporary directories.</description>
</property>
<property>
<name>hadoop.proxyuser.spark.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.spark.groups</name>
<value>*</value>
</property>
</configuration>
5、配置 hdfs-site.xml 文件–>>增加hdfs配置信息(namenode、datanode端口和目录位置)
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:9001</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
6、配置 mapred-site.xml 文件–>>增加mapreduce配置(使用yarn框架、jobhistory使用地址以及web地址)
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
</configuration>
7、配置 yarn-site.xml 文件–>>增加yarn功能
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
</configuration>
将配置好的hadoop文件copys所有的slave机器上
1、格式化namenode(master)
[root@master hadoop-2.7.2]$ ./bin/hdfs namenode -format
2、启动hdfs:
[root@master hadoop-2.7.2]$ ./sbin/start-dfs.sh
3、启动yarn:
[root@master hadoop-2.7.2]$ ./sbin/start-yarn.sh
4、查看集群状态:
[root@master hadoop-2.7.2]$ ./bin/hdfs dfsadmin -report
5、查看hdfs:http://master:50070/
6、查看RM:http://master:8088/
测试 wordcount 程序
我们来创建一个测试用例。创建一个shell脚本,脚本内容如下:
ouch data1 data2
for ((i=1;i<999999;i++))do
echo “this is a test data1” >> data1
echo "and the data2 will be always created">>data2
done
该脚本运行结束后会创建两个文件,data1 和 data2 ,文件大小总和约为50MB,我们将此文件放到集群文件目录下:
# ./bin/hadoop fs -put ~/data1 /tmp/input
./bin/hadoop fs -put ~/data2 /tmp/input
然后运行wordcount
# ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /tmp/input /output
运行完成后显示类似如下信息:
15/09/16 20:39:21 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.4.140:8032
15/09/16 20:39:23 INFO input.FileInputFormat: Total input paths to process : 2
15/09/16 20:39:24 INFO mapreduce.JobSubmitter: number of splits:2
15/09/16 20:39:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442448885469_0002
15/09/16 20:39:24 INFO impl.YarnClientImpl: Submitted application application_1442448885469_0002
15/09/16 20:39:24 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1442448885469_0002/
15/09/16 20:39:24 INFO mapreduce.Job: Running job: job_1442448885469_0002
15/09/16 20:39:37 INFO mapreduce.Job: Job job_1442448885469_0002 running in uber mode : false
15/09/16 20:39:37 INFO mapreduce.Job: map 0% reduce 0%
15/09/16 20:39:56 INFO mapreduce.Job: map 7% reduce 0%
15/09/16 20:39:59 INFO mapreduce.Job: map 34% reduce 0%
15/09/16 20:40:03 INFO mapreduce.Job: map 42% reduce 0%
15/09/16 20:40:06 INFO mapreduce.Job: map 45% reduce 0%
15/09/16 20:40:09 INFO mapreduce.Job: map 69% reduce 0%
15/09/16 20:40:16 INFO mapreduce.Job: map 83% reduce 0%
15/09/16 20:40:24 INFO mapreduce.Job: map 100% reduce 0%
15/09/16 20:40:32 INFO mapreduce.Job: map 100% reduce 100%
15/09/16 20:40:32 INFO mapreduce.Job: Job job_1442448885469_0002 completed successfully
15/09/16 20:40:32 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=580
FILE: Number of bytes written=318160
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=64000070
HDFS: Number of bytes written=148
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Killed map tasks=1
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=90046
Total time spent by all reduces in occupied slots (ms)=18039
Total time spent by all map tasks (ms)=90046
Total time spent by all reduce tasks (ms)=18039
Total vcore-seconds taken by all map tasks=90046
Total vcore-seconds taken by all reduce tasks=18039
Total megabyte-seconds taken by all map tasks=92207104
Total megabyte-seconds taken by all reduce tasks=18471936
Map-Reduce Framework
Map input records=1999996
Map output records=11999976
Map output bytes=111999776
Map output materialized bytes=205
Input split bytes=198
Combine input records=11999997
Combine output records=38
Reduce input groups=12
Reduce shuffle bytes=205
Reduce input records=17
Reduce output records=12
Spilled Records=65
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=2646
CPU time spent (ms)=24030
Physical memory (bytes) snapshot=368271360
Virtual memory (bytes) snapshot=6227111936
Total committed heap usage (bytes)=254312448
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=63999872
File Output Format Counters
Bytes Written=148
我们打开集群文件目录,找到output文件中的part-r-00000文件,查看:
a 999998
always 999998
and 999998
be 999998
created 999998
data1” 999998
data2 999998
is 999998
test 999998
the 999998
will 999998
“this 999998
本文主要从基础准备,添加DataNode和添加NodeManager三个部分详细说明在Hadoop2.6.0环境下,如何动态新增节点到集群中。
基础准备
在基础准备部分,主要是设置hadoop运行的系统环境
- 修改系统hostname(通过hostname和/etc/sysconfig/network进行修改)
- 修改hosts文件,将集群所有节点hosts配置进去(集群所有节点保持hosts文件统一)
- 设置NameNode(两台HA均需要)到DataNode的免密码登录(ssh-copy-id命令实现,可以免去cp *.pub文件后的权限修改)
- 修改主节点slave文件,添加新增节点的ip信息(集群重启时使用)
- 将hadoop的配置文件scp到新的节点上
添加DataNode
对于新添加的DataNode节点,需要启动datanode进程,从而将其添加入集群
- 在新增的节点上,运行sbin/hadoop-daemon.sh start datanode即可
- 然后在namenode通过hdfs dfsadmin -report查看集群情况
- 最后还需要对hdfs负载设置均衡,因为默认的数据传输带宽比较低,可以设置为64M,即hdfs dfsadmin -setBalancerBandWidth 67108864即可
- 默认balancer的threshold为10%,即各个节点与集群总的存储使用率相差不超过10%,我们可将其设置为5%
- 然后启动Balancer,sbin/start-balancer.sh -threshold 5,等待集群自均衡完成即可
添加Nodemanager
由于Hadoop 2.X引入了YARN框架,所以对于每个计算节点都可以通过NodeManager进行管理,同理启动NodeManager进程后,即可将其加入集群
- 在新增节点,运行sbin/yarn-daemon.sh start nodemanager即可
- 在ResourceManager,通过yarn node -list查看集群情况