Pre-installation
保证所有主机上已经安装JDK 1.6+和ssh。
添加主机名到/etc/hosts
修改/etc/hosts
sudo vi /etc/hosts
添加
10.5.5.3 master 10.5.5.4 slave1 10.5.5.5 slave2 10.5.5.6 slave3
hostname修改在/etc/sysconfig/network
配置无密码的ssh连接(使得master无密码ssh到slave)
在所有主机上生成ssh的公钥和私钥
ssh-keygen -t rsa
在master主机上,生成authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
将master公交scp到其他slave上,并追加到authorized_keys中
Installation
Download Apache Hadoop
Apache Hadoop: http://www.apache.org/dyn/closer.cgi/hadoop/common/
保证master和slaves上面的hadoop解压的相同的目录
Set Environment Variables
#export HADOOP_HOME=/home/hadoop/hadoop-2.2.0 #export JAVA_HOME=/home/hadoop/jdk1.6.0_45
java home环境变量设置成本机的jdk路径,不知道jre路径可以不可以
写到.bashrc中也可以
网上资料还需要在/etc/profile中添加
#hadoop variable settings export HADOOP_HOME=/root/hadoop-2.2.0 export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/lib #export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native #export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
不确定是否必须。
Hadoop Configuration
修改conf文件夹下面的几个文件(根据各种资料修改的):
- core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/root/hadoop-2.2.0/tmp</value> <description>A base for other temporarydirectories.</description> </property> </configuration>
- hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/root/hadoop-2.2.0/name</value> <description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>/root/hadoop-2.2.0/data</value> <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> <final>true</final> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
- mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
- yarn-site.xml
<configuration> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> <description>host is the hostname of the resource manager and port is the port on which the NodeManagers contact the Resource Manager. </description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> <description>host is the hostname of the resourcemanager and port is the port on which the Applications in the cluster talk to the Resource Manager. </description> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> <description>In case you do not want to use the default scheduler</description> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> <description>the host is the hostname of the ResourceManager and the port is the port on which the clients can talk to the Resource Manager. </description> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>${hadoop.tmp.dir}/nodemanager/local</value> <description>the local directories used by the nodemanager</description> </property> <property> <name>yarn.nodemanager.address</name> <value>0.0.0.0:8034</value> <description>the nodemanagers bind to this port</description> </property> <property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>${hadoop.tmp.dir}/nodemanager/remote</value> <description>directory on hdfs where the application logs are moved to </description> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>${hadoop.tmp.dir}/nodemanager/logs</value> <description>the directories used by Nodemanagers as log directories</description> </property> <!-- Use mapreduce_shuffle instead of mapreduce.suffle (YARN-1229)--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
- slaves
slave1 slave2 slave3
- hadoop-env.sh
export JAVA_HOME=/home/hadoop/jdk1.6.0_45
此处JAVA_HOME可以根据每天Server情况设定。
格式化NameNode
第一次启动,需要先格式化NameNode。
hadoop namenode -
format
|
Start Hadoop
~
/hadoop-2
.2.0
/sbin/start-all
.sh
|
会输出各种结点日志创建过程
完成后,分别在master和slave的shell上输入jps,分别会显示:
[root@master ~]# jps
15096 ResourceManager
7980 Jps
14782 NameNode
14956 SecondaryNameNode
[root@slave1 ~]# jps
7839 NodeManager
7733 DataNode
20035 Jps
Test
HDFS Web UI: http://10.108.100.18:50070/
YARN Web UI: http://10.108.100.18:8088/
YARN && MapReduce 测试(运行 wordcount)
[root@master hadoop-2.2.0]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /root/output
其中两个路径都是hadoop的namespace下的路径,对namespace中文件和文件夹操作格式:hadoop dfs -shell -filename
eg:hadoop dfs -mkdir /home/hadoop/input hadoop dfs -cat /home/hadoop/output/part-r-00000
其中,本地文件系统向hdfs传文件指令:hadoop dfs -put input/* /home/hadoop/input
前者是本地文件系统路径,后者是hdfs中路径。
至此,成功在openstack中几台虚拟机中搭建了hadoop集群。值得注意的是,每个slave应该在不同的计算结点上部署,才会达到很好的效率。
2014年3月,openstack中的hadoop项目Sahara(旧称:Savanna),从OpenStack孵化项目顺利毕业,将从OpenStack下一版本Juno开始作为OpenStack核心项目之一。
不知道Hortonworks的大牛们会怎样调度分配nova的VM来搭建hadoop集群