环境信息
三台虚拟机节点(192.168.100.171<debian171>, 192.168.100.172<debian172>)
Debian jessie 8.5
Hadoop 2.7.2
注:单机和伪分布式环境搭建可以参考hadoop官网上的说明,在此仅为分布式环境搭建
分布式环境搭建
#安装ssh和rsync
sudo apt-get install ssh rsync
#配置ssh免密登录
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
在171和172的机器上都执行上面的脚本,然后使用scp复制对方的id_dsa.pub
debian171:
scp surfin@192.168.100.172:/home/surfin/.ssh/id_dsa.pub ~/.ssh/id_dsa_172.pub
cat ~/.ssh/id_dsa_172.pub >> ~/.ssh/authorized_keys
debian172:
scp surfin@192.168.100.171:/home/surfin/.ssh/id_dsa.pub ~/.ssh/id_dsa_171.pub
cat ~/.ssh/id_dsa_171.pub >> ~/.ssh/authorized_keys
尝试ssh:
surfin@debian171:~/.ssh$ ssh debian171
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Jul 25 10:17:29 2016 from debian171.surfin.org
surfin@debian171:~$ exit
注意:有些教程会直接复制authorized_keys到其他节点,有可能会报警告类似:“Warning: Permanently added 'debian171,192.168.100.171' (ECDSA) to the list of known hosts.”,遇到这种情况,可以删除.ssh下的authorized_keys和known_hosts文件,重新cat一遍,并尝试ssh后,第二次ssh时就不会有这种警告了。
#安装hadoop
sudo mkidr -p /usr/local/hadoop
sudo chown -R surfin:surfin /usr/local/hadoop
tar xvf hadoop-2.7.2.tar.gz -C /usr/local/hadoop
配置hadoop-env.sh,修改export JAVA_HOME=${JAVA_HOME}
# The java implementation to use.
export JAVA_HOME=/usr/local/java/jdk1.8.0_92
配置core-site.xml,添加如下参数
<property>
<name>fs.defaultFS</name>
<value>hdfs://debian171:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.7.2/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
注意:这里强烈建议配置hadoop.tmp.dir,该参数默认在/tmp目录,如果linux重启,会丢失namenode的信息,需要重新format
配置hdfs-site.xml
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>debian171:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.7.2/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.7.2/tmp/dfs/data</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
注:dfs.replication默认为3,这里暂时用1;dfs.permissions.enabled默认为true,由于之后需要搭建window开发环境(window用户名与linux不同),故暂时改为false,生产环境不建议关闭
配置mapred-site.xml
cp mapred-site.xml.template mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>debian171:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>debian171:19888</value>
</property>
注:这里使用yarn管理
配置yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>debian171</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
配置slaves
debian172
注意:确保所有机器上/etc/hosts已添加debian171和debian172的域名
复制debian171下etc/hadoop下的文件到debian172的hadoop对应目录
#格式化HDFS
bin/hdfs namenode -format
#OK,启动hadoop
sbin/start-dfs.sh
#启动yarn
sbin/start-yarn.sh
#启动jobhistoryserver
sbin/mr-jobhistory-daemon.sh start historyserver
#查看集群状态
bin/hdfs dfsadmin -report
Configured Capacity: 121511374848 (113.17 GB)
Present Capacity: 111463813120 (103.81 GB)
DFS Remaining: 111463260160 (103.81 GB)
DFS Used: 552960 (540 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (1):
Name: 192.168.100.172:50010 (debian172.surfin.org)
Hostname: debian172.surfin.org
Decommission Status : Normal
Configured Capacity: 121511374848 (113.17 GB)
DFS Used: 552960 (540 KB)
Non DFS Used: 10047561728 (9.36 GB)
DFS Remaining: 111463260160 (103.81 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.73%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Jul 25 11:04:23 HKT 2016
注意Live datanodes的值
#通过http://debian171:8088/cluster/nodes 查看集群
#测试集群
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put etc/hadoop input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar grep input output 'dfs[a-z.]+'
查看结果
bin/hdfs dfs -cat output/*
参考资料
http://my.oschina.net/u/2338162/blog/610683
http://www.powerxing.com/install-hadoop-cluster/
http://blog.chinaunix.net/uid-28379399-id-4555364.html