hadoop完全分布式环境搭建笔记

最新推荐文章于 2022-07-01 15:30:35 发布

kunfeng

最新推荐文章于 2022-07-01 15:30:35 发布

阅读量896

点赞数

hadoop完全分布式环境搭建笔记

0.环境简介
        在centos6.3系统中，安装了xen-4.1.2+linux-2.6.31.8，由此配置了带xen的系统。然后作为测试，起4个虚拟机，分别为centso1,centos2,centos3,centos4来模拟分布式集群，用它们搭建hadoop的完全分布式环境。没有这么多的物理机只有这样学习了，刚好也能体现出一点虚拟化的云计算的感觉。。。哈哈

1. 准备
   (1)分配机器节点如下：
         物理机              192.168.77.88            监控机（浏览器）
         centos1            192.168.77.89 作为namenode
      centos2            192.168.77.90           作为datanode
      centos3            192.168.77.91           作为datanode
         centos4            192.168.77.92           作为datanode

     (2)修改部署环境中的/etc/hosts文件
         物理机：
        [root@as img]# cat /etc/hosts
        127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
        ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
        127.0.0.1   as as.localdomanin
        192.168.77.89   centos1
        192.168.77.90   centos2
        192.168.77.91   centos3
        192.168.77.92   centos4
        虚拟机（四个虚拟机都一样的配置）
        [root@centos1 ～]# cat /etc/hosts
        #127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
        #::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
        #127.0.0.1   as as.localdomanin
        192.168.77.89 centos1
        192.168.77.90 centos2
        192.168.77.91 centos3
        192.168.77.92 centos4

2.set-SSH
      设置ssh，使得namenode节点[centos1]向其它3个datanode节点[centos2,centos3,centos4]无密码登录。
       (1)集群分配表：
           192.168.77.89 centos1    //作为namenode
           192.168.77.90 centos2    //作为datanode
           192.168.77.91 centos3    //作为datanode
           192.168.77.92 centos4    //作为datanode

       (2)分别在centso1，centso2，centso3，centso4中生成密钥：（一路回车或者“y“）
       [root@centos1 ~]# ssh-keygen -t rsa
       Generating public/private rsa key pair.
       Enter file in which to save the key (/root/.ssh/id_rsa):
       Enter passphrase (empty for no passphrase):
       Enter same passphrase again:
       Your identification has been saved in /root/.ssh/id_rsa.
       Your public key has been saved in /root/.ssh/id_rsa.pub.
       The key fingerprint is:
       68:a7:db:bf:13:9b:f3:e8:65:f9:92:0b:a9:f1:16:16 root@centos1
       The key's randomart image is:
       +--[ RSA 2048]----+
       |                             |
       |                             |
       |                             |
       |       .  E                 |
       |      o S  .               |
       |     . o  +. .             |
       |      . ..o=+.            |
       |       o +*=o.           |
       |      . o+*=oo.         |
      +-----------------------+

      (3)在centso2，centso3，centso4同样操作命令：ssh-keygen -t rsa。

      (4)centos1复制公钥：
      [root@centos1 ~]# cd /root/.ssh
      [root@centos1 .ssh]# ls
      id_rsa    id_rsa.pub  known_hosts
      [root@centos1 .ssh]# cp id_rsa.pub authorized_keys
      [root@centos1 .ssh]# ls
      authorized_keys  id_rsa  id_rsa.pub  known_hosts

      (5)centos1分发公钥给datanode节点[centos2,centos3,centos4]：
      [root@centos1 .ssh]#  scp authorized_keys centos2:/root/.ssh
      [root@centos1 .ssh]#  scp authorized_keys centos3:/root/.ssh
      [root@centos1 .ssh]#  scp authorized_keys centos4:/root/.ssh

      (6)更改文件权限，分别在centos1,centos2,centos3,centos4中执行：（注意均是在目录/root/.ssh/中）
       [root@centos1 .ssh]#  chmod 644 authorized_keys
       [root@centos2 .ssh]#  chmod 644 authorized_keys
       [root@centos3 .ssh]#  chmod 644 authorized_keys
       [root@centos4 .ssh]#  chmod 644 authorized_keys

       此时从centos1中向其他的centos2，centos3，centos4发起SSH连接时，只有在第一次登录时需要密码，以后则不再需要。


3.set-hadoop
    无论是namenod还是datanode的配置，其实可以都一样（本例便如此）。当然为了具体的工作和测试可以根据实际情况修改配置。（详细配置方案请参考官方文档）
     以下操作均在namenode即centos1的/usr/hadoop/conf/目录下，我的hadoop是放在/usr/目录下的！
      (1)去官方网站下载最新的hadoop，我下载的是hadoop-0.21.0.tar.gz
         参考网址： http://hadoop.apache.org/
下载后放在centos1的/usr/目录下，解压缩，重命名为hadoop。
         [root@centos1 usr]# tar xvzf hadoop-0.21.0.tar.gz
         [root@centos1 usr]# mv hadoop-0.21.0  hadoop
         [root@centos1 usr]# cd   /usr/hadoop/conf/

     (2)配置hadoop -env.sh文件
         主要是配置JAVA_HOME路径，填写本系统中的java JDK路径。若四个虚拟机中JDK的路径不同，就得需要单独配置此文件。
         [root@centos1 conf]# cat hadoop-env.sh
        # Set Hadoop-specific environment variables here.

        # The only required environment variable is JAVA_HOME.  All others are
        # optional.  When running a distributed configuration it is best to
        # set JAVA_HOME in this file, so that it is correctly defined on
        # remote nodes.

        # The java implementation to use.  Required.
        # export JAVA_HOME=/usr/lib/j2sdk1.6-sun
        export JAVA_HOME=/usr/java/jdk1.7.0_05

        # Extra Java CLASSPATH elements.  Optional.
        # export HADOOP_CLASSPATH=
          ...........................
          ...........................

     (3)配置core-site.xml文件
       [root@centos1 conf]# cat core-site.xml
       <?xml version="1.0"?>
       <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
       
       <configuration>
           <property>
                 <name>fs.default.name</name>
                 <value>hdfs://192.168.77.89:9000/</value>
           </property>
          <property>
                 <name>hadoop.tmp.dir</name>
                 <value>/usr/local/hadoop/hadooptmp</value>
          </property>
       </configuration>

     (4)配置 mapred-site.xml文件
       [root@centos1 conf]# cat mapred-site.xml
       <?xml version="1.0"?>
       <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
       
       <configuration>
           <property>
                  <name>mapred.job.tracker</name>
                  <value>192.168.77.89:9001</value>
           </property>
           <property>
                  <name>mapred.local.dir</name>
                  <value>/usr/local/hadoop/mapred/local</value>
           </property>
           <property>
                  <name>mapred.system.dir</name>
                  <value>/tmp/hadoop/mapred/system</value>
           </property>
       </configuration>

     (5)配置 hdfs-site.xml
       [root@centos1 conf]# cat hdfs-site.xml
       <?xml version="1.0"?>
       <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
       
       <configuration>
            <property>
                   <name>dfs.name.dir</name>
                   <value>/usr/local/hadoop/hdfs/name</value>
            </property>
            <property>
                   <name>dfs.data.dir</name>
                   <value>/usr/local/hadoop/hdfs/data</value>
            </property>
            <property>
                   <name>dfs.replication</name>
                   <value>3</value>
            </property>
        </configuration>

      (6)配置masters和slaves文件
        [root@centos1 conf]# cat masters
        #localhost
        192.168.77.89  #centos1
        [root@centos1 conf]# cat slaves
        #localhost
        192.168.77.90 #centos2
        192.168.77.91 #centos3
        192.168.77.92 #centos4

      (7)分发配置好的hadoop
          [root@centos1 usr]# scp -r hadoop centos2:/root/usr/hadoop
          [root@centos1 usr]# scp -r hadoop centos3:/root/usr/hadoop
          [root@centos1 usr]# scp -r hadoop centos4:/root/usr/hadoop

4.run-hadoop
     运行hadoop之前将物理机和四个虚拟机的防火墙都暂时关闭，用命令service iptables stop，也可以用别的命令永久关闭。
      (1)格式化这个分布式集群的文件系统
[root@centos1 hadoop]# bin/hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

12/07/31 02:13:49 INFO namenode.NameNode: STARTUP_MSG:

12/07/31 02:13:49 WARN common.Util: Path /usr/local/hadoop/hdfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
12/07/31 02:13:49 WARN common.Util: Path /usr/local/hadoop/hdfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
12/07/31 02:13:49 INFO namenode.FSNamesystem: defaultReplication = 3
12/07/31 02:13:49 INFO namenode.FSNamesystem: maxReplication = 512
12/07/31 02:13:49 INFO namenode.FSNamesystem: minReplication = 1
12/07/31 02:13:49 INFO namenode.FSNamesystem: maxReplicationStreams = 2
12/07/31 02:13:49 INFO namenode.FSNamesystem: shouldCheckForEnoughRacks = false
12/07/31 02:13:49 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/07/31 02:13:49 INFO namenode.FSNamesystem: fsOwner=root
12/07/31 02:13:49 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/31 02:13:49 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/31 02:13:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/07/31 02:13:50 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/07/31 02:13:50 INFO common.Storage: Storage directory /usr/local/hadoop/hdfs/name has been successfully formatted.
12/07/31 02:13:50 INFO namenode.NameNode: SHUTDOWN_MSG:

      (2)运行hadoop
[root@centos1 hadoop-0.21.0]# bin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-mapred.sh
starting namenode, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-namenode-centos1.out
192.168.77.90: starting datanode, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-datanode-centos2.out
192.168.77.92: starting datanode, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-datanode-centos4.out
192.168.77.91: starting datanode, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-datanode-centos3.out
The authenticity of host '192.168.77.89 (192.168.77.89)' can't be established.
RSA key fingerprint is 2b:e9:15:76:32:35:6b:d5:c4:29:2c:40:6f:5b:30:25.
Are you sure you want to continue connecting (yes/no)? yes
192.168.77.89: Warning: Permanently added '192.168.77.89' (RSA) to the list of known hosts.
192.168.77.89: starting secondarynamenode, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-secondarynamenode-centos1.out
starting jobtracker, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-jobtracker-centos1.out
192.168.77.90: starting tasktracker, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-tasktracker-centos2.out
192.168.77.92: starting tasktracker, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-tasktracker-centos4.out
192.168.77.91: starting tasktracker, logging to /usr/hadoop-0.21.0/bin/../logs/hadoop-root-tasktracker-centos3.out

      (3)查看此时hadoop中的JAVA任务
[root@centos1 hadoop]# jps
1869 NameNode
2058 SecondaryNameNode
2154 JobTracker
2258 Jps

[root@centos2 hadoop]# jps
1806 DataNode
1933 Jps
1892 TaskTracker
centos3,centos4与centos2的jps运行任务一致！！！

此时在物理机中的浏览器中已经可以看到集群的大致情况了：
centos1 Hadoop Map/Reduce Administration：
http://http://192.168.77.89:50030/jobtracker.jsp
OR: http://centos1:50030/jobtracker.jsp

NameNode 'centos1:9000'：
http://192.168.77.89:50070/dfshealth.jsp
OR: http://centos1:50070/dfshealth.jsp

      (4)测试集群（用wordcount例程）
查看集群中的文件：
[root@centos1 hadoop]# bin/hadoop fs -ls
12/08/01 10:08:56 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/08/01 10:08:56 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

准备文件：
[root@centos1 hadoop]# mkdir in
[root@centos1 hadoop]# cp conf/*xml  in
[root@centos1 hadoop]# ls in/
capacity-scheduler.xml    fair-scheduler.xml  hdfs-site.xml      mapred-site.xml
core-site.xml        hadoop-policy.xml   mapred-queues.xml

上传文件：
[root@centos1 hadoop]# bin/hadoop fs -put in in
12/08/01 10:10:50 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/08/01 10:10:50 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
[root@centos1 hadoop]# bin/hadoop fs -ls
12/08/01 10:10:57 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/08/01 10:10:57 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 1 items
drwxr-xr-x   - root supergroup          0 2012-08-01 10:10 /user/root/in

计算文件中各个单词的频率：
[root@centos1 hadoop]# bin/hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount in out
12/08/01 10:11:41 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/08/01 10:11:41 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
12/08/01 10:11:41 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/08/01 10:11:42 INFO input.FileInputFormat: Total input paths to process : 7
12/08/01 10:11:42 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/08/01 10:11:42 INFO mapreduce.JobSubmitter: number of splits:7
12/08/01 10:11:42 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
12/08/01 10:11:42 INFO mapreduce.Job: Running job: job_201208010835_0001
12/08/01 10:11:43 INFO mapreduce.Job:  map 0% reduce 0%
12/08/01 10:11:53 INFO mapreduce.Job:  map 57% reduce 0%
12/08/01 10:11:54 INFO mapreduce.Job:  map 85% reduce 0%
12/08/01 10:11:59 INFO mapreduce.Job:  map 100% reduce 0%
12/08/01 10:12:05 INFO mapreduce.Job:  map 100% reduce 100%
12/08/01 10:12:07 INFO mapreduce.Job: Job complete: job_201208010835_0001
12/08/01 10:12:07 INFO mapreduce.Job: Counters: 33
    FileInputFormatCounters
        BYTES_READ=12940
    FileSystemCounters
        FILE_BYTES_READ=10491
        FILE_BYTES_WRITTEN=21242
        HDFS_BYTES_READ=13783
        HDFS_BYTES_WRITTEN=5894
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    Job Counters
        Data-local map tasks=7
        Total time spent by all maps waiting after reserving slots (ms)=0
        Total time spent by all reduces waiting after reserving slots (ms)=0
        SLOTS_MILLIS_MAPS=26944
        SLOTS_MILLIS_REDUCES=9322
        Launched map tasks=7
        Launched reduce tasks=1
    Map-Reduce Framework
        Combine input records=1477
        Combine output records=630
        Failed Shuffles=0
        GC time elapsed (ms)=326
        Map input records=332
        Map output bytes=17598
        Map output records=1477
        Merged Map outputs=7
        Reduce input groups=416
        Reduce input records=630
        Reduce output records=416
        Reduce shuffle bytes=10527
        Shuffled Maps =7
        Spilled Records=1260
        SPLIT_RAW_BYTES=843

查看结果：
[root@centos1 hadoop]# bin/hadoop fs -ls
12/08/01 10:13:07 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/08/01 10:13:07 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 2 items
drwxr-xr-x   - root supergroup          0 2012-08-01 10:10 /user/root/in
drwxr-xr-x   - root supergroup          0 2012-08-01 10:12 /user/root/out
[root@centos1 hadoop]# bin/hadoop fs -cat out/*
12/08/01 10:13:16 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/08/01 10:13:16 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
"*"    10
"AS    2
"License");    2
"alice,bob    10
'*',    2
':'    1
'aclsEnabled'    1
后面一堆的运行结果，在此省略！