一、环境准备阶段:
假设你需要配置集群的设备有5台:分别为master,slave1,slave2,slave3,slave4 ;
1、每台机器都创建一个账户hadoop;
2、修改每台机器的主机名:/etc/sysconfig/network
如master的机器:
NETWORKING=yes
HOSTNAME=master(这个名字可以随便起,方便记忆)
slave1:
NETWORKING=yes
HOSTNAME=slave1
slave2:
.....以此类推;
修改完文件后,最后记得在相应的机器上执行hostname master(你修改后的名字) ,hostname slave1等;
3、修改每台机器的/etc/hosts,保证每台机器间都可以通过机器名解析,注意master和slave每台机器都要修改,保证所有机器的hosts文件内容一样;
如:
192.168.30.60 master
192.168.30.61 slave1
192.168.30.62 slave2
192.168.30.63 slave3
192.168.30.65 slave4
4、实现无密码登陆ssh
由于hadoop需要通过ssh服务在各个节点之间登陆并运行服务,因此必须确保安装hadoop的各个节点之间的网络畅通;
确保机器上安装了ssh
(1) 用hadoop用户登陆master机器:
(2)执行:ssh-keygen -t rsa 一路回车(记得不要输入任何字符),将在/home/hadoop/.ssh下生成密钥id_rsa和公钥id_rsa.pub
id_rsa.pub的可能内容:
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3XYLxqxNfltkbKuCpJJDTuQekVJ0L3XA6dLoLQpPLbZxJNQ7DsogcMYM9opg+R1baTMvm1Cbj/cfIwELHPSRLFjN7E6x9S7PWnS2tObXosBNZ/eo6+eZiAF0h0LL+1Rsfsne2cP3amhdztbudSzm1ezLRPBLNUh0FKwDjbgnK2ZZy49h6vCvOZRKJPQf+B3xTSTbix/omalecCdYc1bCFvifOy1pgWVchKSQsynN0V901dA7CAfIjsAKc4DfyGcdoFNFp+POz6+q4AiYUmO+QTh7wPRa2vTg6FRlaaqvTUfnep6prFSVPe/Jh6dt6yyH0k7sIPDIl/kca6cZX0YgNw== hadoop@master
(3)把公钥id_rsa.pub内容拷贝到authorized_keys
cat /home/hadoop/.ssh/id_rsa.pub >>/home/hadoop/.ssh/authorized_keys
(4)把authorized_keys复制到其他的slave机器上:scp /home/hadoop/.ssh/authorized_keys hadoop@192.168.30.61:/home/hadoop/.ssh/、scp /home/hadoop/.ssh/authorized_keys hadoop@192.168.30.62:/home/hadoop/.ssh/ ......等,先确定slave机器上都有.ssh目录,如果没有手动创建一个;
(5)设置目录权限(所有机器)
chmod 750 hadoop
chmod 750 .ssh
chmod 644 authorized_keys
(6)验证ssh是否成功
在master机器上执行ssh slave1
如果不需要输入密码即可
5、安装JDK
这里和普通的安装JDK步骤一样;
首先下载最近的JDK,安装程序,修改环境变量等等;
二、安装hadoop
1、获取cdh3 yum 源并安装Hadoop
(1)wget -c http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch.rpm
(2)yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm //安装后将得到 cloudera-cdh3.repo 文件
(3)rpm --import http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera // 导入 rpm key
(4)yum install hadoop-0.20
(5)yum install hadoop-0.20-namenode (安装到要作为namenode的机器,在 /etc/hadoop/conf/core-site.xml中配置,后面会讲到)
yum install hadoop-0.20-datanode (安装到所有的slave机器,也可以安装到namenode机器,把namenode也作为一台datanode)
yum install hadoop-0.20-jobtracker (安装到作为jobtracker机器,jobtrancker机器配置是在/etc/hadoop/conf/hdfs-site.xml 里面配置)
yum install hadoop-0.20-tasktracker
不同的角色安装不同服务;安装datanode的机器需要安装tasktracker,namenode机器也可以用来作为datanode
2、修改配置文档 (hdfs 方面)
//slaves 配置文件 namenode 上配置即可
cat /etc/hadoop/conf/slaves
192.168.30.61
192.168.30.62
192.168.30.63
192.168.30.64
cat /etc/hadoop/conf/masters
192.168.30.60
3、修改/etc/hadoop/conf/hdfs-site.xml 配置文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<!-- Immediately exit safemode as soon as one DataNode checks in.
On a multi-node cluster, these configurations must be removed. -->
<property>
<name>dfs.safemode.extension</name>
<value>0</value>
</property>
<property>
<name>dfs.safemode.min.datanodes</name>
<value>1</value>
</property>
<!--
<property>
specify this so that running 'hadoop namenode -format' formats the right dir
<name>dfs.name.dir</name>
<value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name</value>
</property>
-->
<!-- add by dongnan -->
<property>
<name>dfs.data.dir</name>
<value>/data/dfs/data</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/dfs/tmp</value>
</property>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>200000</value>
</property>
</configuration>
4、修改/etc/hadoop/conf/core-site.xml 配置文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode:8020</value>
</property>
</configuration>
5、修改/etc/hadoop/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.30.61:9001</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m -XX:+UseConcMarkSweepGC</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/data1/hdfs/</value>
<description>The local directory where MapReduce stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk i/o.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/mapred/system</value>
</property>
<property>
<name>io.sort.mb</name>
<value>256</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>
<property>
<name>io.sort.factor</name>
<value>64</value>
</property>
<property>
<name>mapred.max.map.failures.percent</name>
<value>10</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>1</value>
<description>jvm reuse tasks count. default is 1. If it is -1, there is no limit</description>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>64</value>
</property>
<!--
<property>
<name>job.end.notification.url</name>
<value>http://182.61.128.18:50030/test_url.jsp?jobid=$jobId&jobStatus=$jobStatus</value>
<description>jvm reuse tasks count. default is 1. If it is -1, there is no limit</description>
</property>
-->
<!--
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
<property>
<name>mapred.queue.names</name>
<value>default,ca</value>
</property>
-->
</configuration>
6、启动hadoop 相应进程
root@namenode ~]# /etc/init.d/hadoop-0.20-namenode start (1台namenode)
[root@slave1 /]# /etc/init.d/hadoop-0.20-datanode start (4台datanode)
[root@slave2 /]# /etc/init.d/hadoop-0.20-datanode start
[root@slave1 /]# /etc/init.d/hadoop-0.20-tasktracker start (4台tasktracker 跟datanode相应)
[root@slave1 /]# /etc/init.d/hadoop-0.20-jobtracker start (1台jobtracker)
在相应的机器上启动相应的服务;
7、OK安装完毕
http://192.168.30.60:50070/ (namenode)
假设你需要配置集群的设备有5台:分别为master,slave1,slave2,slave3,slave4 ;
1、每台机器都创建一个账户hadoop;
2、修改每台机器的主机名:/etc/sysconfig/network
如master的机器:
NETWORKING=yes
HOSTNAME=master(这个名字可以随便起,方便记忆)
slave1:
NETWORKING=yes
HOSTNAME=slave1
slave2:
.....以此类推;
修改完文件后,最后记得在相应的机器上执行hostname master(你修改后的名字) ,hostname slave1等;
3、修改每台机器的/etc/hosts,保证每台机器间都可以通过机器名解析,注意master和slave每台机器都要修改,保证所有机器的hosts文件内容一样;
如:
192.168.30.60 master
192.168.30.61 slave1
192.168.30.62 slave2
192.168.30.63 slave3
192.168.30.65 slave4
4、实现无密码登陆ssh
由于hadoop需要通过ssh服务在各个节点之间登陆并运行服务,因此必须确保安装hadoop的各个节点之间的网络畅通;
确保机器上安装了ssh
(1) 用hadoop用户登陆master机器:
(2)执行:ssh-keygen -t rsa 一路回车(记得不要输入任何字符),将在/home/hadoop/.ssh下生成密钥id_rsa和公钥id_rsa.pub
id_rsa.pub的可能内容:
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3XYLxqxNfltkbKuCpJJDTuQekVJ0L3XA6dLoLQpPLbZxJNQ7DsogcMYM9opg+R1baTMvm1Cbj/cfIwELHPSRLFjN7E6x9S7PWnS2tObXosBNZ/eo6+eZiAF0h0LL+1Rsfsne2cP3amhdztbudSzm1ezLRPBLNUh0FKwDjbgnK2ZZy49h6vCvOZRKJPQf+B3xTSTbix/omalecCdYc1bCFvifOy1pgWVchKSQsynN0V901dA7CAfIjsAKc4DfyGcdoFNFp+POz6+q4AiYUmO+QTh7wPRa2vTg6FRlaaqvTUfnep6prFSVPe/Jh6dt6yyH0k7sIPDIl/kca6cZX0YgNw== hadoop@master
(3)把公钥id_rsa.pub内容拷贝到authorized_keys
cat /home/hadoop/.ssh/id_rsa.pub >>/home/hadoop/.ssh/authorized_keys
(4)把authorized_keys复制到其他的slave机器上:scp /home/hadoop/.ssh/authorized_keys hadoop@192.168.30.61:/home/hadoop/.ssh/、scp /home/hadoop/.ssh/authorized_keys hadoop@192.168.30.62:/home/hadoop/.ssh/ ......等,先确定slave机器上都有.ssh目录,如果没有手动创建一个;
(5)设置目录权限(所有机器)
chmod 750 hadoop
chmod 750 .ssh
chmod 644 authorized_keys
(6)验证ssh是否成功
在master机器上执行ssh slave1
如果不需要输入密码即可
5、安装JDK
这里和普通的安装JDK步骤一样;
首先下载最近的JDK,安装程序,修改环境变量等等;
二、安装hadoop
1、获取cdh3 yum 源并安装Hadoop
(1)wget -c http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch.rpm
(2)yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm //安装后将得到 cloudera-cdh3.repo 文件
(3)rpm --import http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera // 导入 rpm key
(4)yum install hadoop-0.20
(5)yum install hadoop-0.20-namenode (安装到要作为namenode的机器,在 /etc/hadoop/conf/core-site.xml中配置,后面会讲到)
yum install hadoop-0.20-datanode (安装到所有的slave机器,也可以安装到namenode机器,把namenode也作为一台datanode)
yum install hadoop-0.20-jobtracker (安装到作为jobtracker机器,jobtrancker机器配置是在/etc/hadoop/conf/hdfs-site.xml 里面配置)
yum install hadoop-0.20-tasktracker
不同的角色安装不同服务;安装datanode的机器需要安装tasktracker,namenode机器也可以用来作为datanode
2、修改配置文档 (hdfs 方面)
//slaves 配置文件 namenode 上配置即可
cat /etc/hadoop/conf/slaves
192.168.30.61
192.168.30.62
192.168.30.63
192.168.30.64
cat /etc/hadoop/conf/masters
192.168.30.60
3、修改/etc/hadoop/conf/hdfs-site.xml 配置文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<!-- Immediately exit safemode as soon as one DataNode checks in.
On a multi-node cluster, these configurations must be removed. -->
<property>
<name>dfs.safemode.extension</name>
<value>0</value>
</property>
<property>
<name>dfs.safemode.min.datanodes</name>
<value>1</value>
</property>
<!--
<property>
specify this so that running 'hadoop namenode -format' formats the right dir
<name>dfs.name.dir</name>
<value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name</value>
</property>
-->
<!-- add by dongnan -->
<property>
<name>dfs.data.dir</name>
<value>/data/dfs/data</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/dfs/tmp</value>
</property>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>200000</value>
</property>
</configuration>
4、修改/etc/hadoop/conf/core-site.xml 配置文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode:8020</value>
</property>
</configuration>
5、修改/etc/hadoop/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.30.61:9001</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m -XX:+UseConcMarkSweepGC</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/data1/hdfs/</value>
<description>The local directory where MapReduce stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk i/o.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/mapred/system</value>
</property>
<property>
<name>io.sort.mb</name>
<value>256</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>
<property>
<name>io.sort.factor</name>
<value>64</value>
</property>
<property>
<name>mapred.max.map.failures.percent</name>
<value>10</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>1</value>
<description>jvm reuse tasks count. default is 1. If it is -1, there is no limit</description>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>64</value>
</property>
<!--
<property>
<name>job.end.notification.url</name>
<value>http://182.61.128.18:50030/test_url.jsp?jobid=$jobId&jobStatus=$jobStatus</value>
<description>jvm reuse tasks count. default is 1. If it is -1, there is no limit</description>
</property>
-->
<!--
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
<property>
<name>mapred.queue.names</name>
<value>default,ca</value>
</property>
-->
</configuration>
6、启动hadoop 相应进程
root@namenode ~]# /etc/init.d/hadoop-0.20-namenode start (1台namenode)
[root@slave1 /]# /etc/init.d/hadoop-0.20-datanode start (4台datanode)
[root@slave2 /]# /etc/init.d/hadoop-0.20-datanode start
[root@slave1 /]# /etc/init.d/hadoop-0.20-tasktracker start (4台tasktracker 跟datanode相应)
[root@slave1 /]# /etc/init.d/hadoop-0.20-jobtracker start (1台jobtracker)
在相应的机器上启动相应的服务;
7、OK安装完毕
http://192.168.30.60:50070/ (namenode)
http://192.168.30.61:50030/jobtracker.jsp (jobtracker)
original artical from this site: http://blog.chinaunix.net/uid-12014716-id-3987394.html