使用Docker搭建hadoop集群

最新推荐文章于 2024-09-07 01:03:32 发布

wwwlxz123

最新推荐文章于 2024-09-07 01:03:32 发布

阅读量2.1k

点赞数

分类专栏： docker 文章标签： hadoop

本文链接：https://blog.csdn.net/wwwlxz/article/details/50707261

版权

docker 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

使用Docker搭建hadoop集群

参考文档：http://blog.mymusise.com/?p=150

1.准备工作

1.1下载软件

下载一些相关的文件：

jdk-8u60-linux-x64.tar.gz

hadoop-2.7.0.tar.gz

由于hadoop官方提供的hadoop版本是32位的，如果在64位的系统上运行需要编译下，这里提供一个编译好的64位的hadoop 2.7.0的安装包：http://pan.baidu.com/s/1c0HD0Nu

1.2准备挂载卷

在宿主机上创建一个文件夹来存放刚才下载的文件，这里我创建了一个~/dockerspace/hadoop-docker/。将 jdk-8u60-linux-x64.tar.gz、hadoop-2.7.0.tar.gz复制到该目录下。同时创建文件sources.list。该文件是用来修改container的源的，这里用的是阿里的源。

sources.list:


deb http://mirrors.aliyun.com/ubuntu/ vivid main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-backports main restricted universe multiverse

另附上网易源：


deb http://mirrors.163.com/ubuntu/ vivid main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-security main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-updates main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-backports main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-security main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-updates main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-backports main restricted universe multiverse

1.3下载镜像

下载一个镜像作为基础镜像，hadoop镜像是基于基础镜像构建的。

建议在拉取镜像之前先把ubuntu系统的源换成国内的源


docker pull ubuntu:14.04

2.安装jdk

进入容器：



xx@xx-desktop:~/dockerspace/hadoop-docker/config$ docker run -it -v ~/dockerspace/hadoop-docker/:/root/software ubuntu:14.04 #进入ubuntu容器

root@6ccfe3ce6d3b:/# cd /root/

root@6ccfe3ce6d3b:~# ls

software

上面的命令docker run -it -v ~/dockerspace/hadoop-docker/:/root/software ubuntu:14.04 将宿主机上的文件夹~/dockerspace/hadoop-docker/挂载到container的/root/software目录下

修改源：


root@6ccfe3ce6d3b:~# cp /etc/apt/sources.list /etc/apt/sources.list.bak #备份源

root@6ccfe3ce6d3b:~# cp /root/software/sources.list /etc/apt/ #更新源，使用阿里源

root@6ccfe3ce6d3b:~# apt-get update #更新软件

安装vim


apt-get install vim

安装Java环境

创建文件夹/root/jdk，将jdk-8u60-linux-x64.tar.gz解压到/root/jdk下，并重命名：


root@fc3caf2b3183:~# ls

software

root@fc3caf2b3183:~# cd software/

root@fc3caf2b3183:~/software# ls

authorized_keys      hosts                      sources.list

hadoop-2.7.0.tar.gz  jdk-8u60-linux-x64.tar.gz  zookeeper-3.4.6.tar.gz

root@fc3caf2b3183:~/software# tar -zxf jdk-8u60-linux-x64.tar.gz #解压jdk安装包

root@fc3caf2b3183:~/software# ls 

authorized_keys      hosts                      jdk1.8.0_60   zookeeper-3.4.6.tar.gz

hadoop-2.7.0.tar.gz  jdk-8u60-linux-x64.tar.gz  sources.list

root@fc3caf2b3183:~/software# cd ..                 

root@fc3caf2b3183:~# ls

software

root@fc3caf2b3183:~# mkdir jdk

root@fc3caf2b3183:~# mv software/jdk1.8.0_60/ jdk/

root@fc3caf2b3183:~# cd jdk/

root@fc3caf2b3183:~/jdk# ls

jdk1.8.0_60

root@fc3caf2b3183:~/jdk# mv jdk1.8.0_60/ jdk-1.8     #重命名

root@fc3caf2b3183:~/jdk# ls

jdk-1.8

root@fc3caf2b3183:~/jdk#

配置java环境变量，vim /etc/profile，在profile的最后一行添加如下内容：


export JAVA_HOME=/root/jdk/jdk-1.8

export JRE_HOME=${JAVA_HOME}/jre

export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

export PATH=$PATH:${JAVA_HOME}/bin

export HADOOP_HOME=/root/hadoop/hadoop-2.7.0

export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

让其生效，并验证jdk是否安装成功：


root@fc3caf2b3183:~/jdk/jdk-1.8# source /etc/profile #使环境变量生效

root@fc3caf2b3183:~/jdk/jdk-1.8# java -version #查看jdk是否安装成功

java version "1.8.0_60"

Java(TM) SE Runtime Environment (build 1.8.0_60-b27)

Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

root@fc3caf2b3183:~/jdk/jdk-1.8#

3.配置hadoop

上面的HADOOP_HOME我们接下来创建


root@fc3caf2b3183:~# ls

jdk  software

root@fc3caf2b3183:~# mkdir hadoop #创建hadoop文件夹

root@fc3caf2b3183:~# cd software/

root@fc3caf2b3183:~/software# ls

authorized_keys      hosts                      sources.list

hadoop-2.7.0.tar.gz  jdk-8u60-linux-x64.tar.gz  zookeeper-3.4.6.tar.gz

root@fc3caf2b3183:~/software# tar -zxf hadoop-2.7.0.tar.gz #解压hadoop文件

root@fc3caf2b3183:~/software# ls

authorized_keys  hadoop-2.7.0.tar.gz  jdk-8u60-linux-x64.tar.gz  zookeeper-3.4.6.tar.gz

hadoop-2.7.0     hosts                sources.list

root@fc3caf2b3183:~/software# mv hadoop-2.7.0 ../hadoop/ #将hadoop文件拷贝到$HADOOP_HOME目录

root@fc3caf2b3183:~/software# cd ../hadoop/

root@fc3caf2b3183:~/hadoop# ls

hadoop-2.7.0

root@fc3caf2b3183:~/hadoop# cd hadoop-2.7.0/

root@fc3caf2b3183:~/hadoop/hadoop-2.7.0# ls

LICENSE.txt  NOTICE.txt  README.txt  bin  etc  include  lib  libexec  sbin  share

root@fc3caf2b3183:~/hadoop/hadoop-2.7.0#

在/root/hadoop/hadoop-2.7.0文件夹下创建如下文件夹


root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir namenode

root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir datanode

root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir tmp

root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# cd $HADOOP_CONFIG_HOME

root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0/etc/hadoop#

配置hadoop-env.sh，设置JAVA_HOME变量，找到这个变量改成下面的：


export JAVA_HOME=/root/jdk/jdk-1.8

配置core-site.xml


<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--

  Licensed under the Apache License, Version 2.0 (the "License");

  you may not use this file except in compliance with the License.

  You may obtain a copy of the License at



    http://www.apache.org/licenses/LICENSE-2.0



  Unless required by applicable law or agreed to in writing, software

  distributed under the License is distributed on an "AS IS" BASIS,

  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  See the License for the specific language governing permissions and

  limitations under the License. See accompanying LICENSE file.

-->



<!-- Put site-specific property overrides in this file. -->



<configuration>

        <property>

                <name>hadoop.tmp.dir</name>

                <value>/root/hadoop/hadoop-2.7.0/tmp/</value>

                <description>A base for other temporary directories.</description>

        </property>

        <property>

                <name>fs.default.name</name>

                <value>hdfs://master:9000</value>

                <final>true</final>

                <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>

        </property>

</configuration>

配置hdfs-site.xml


<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--

  Licensed under the Apache License, Version 2.0 (the "License");

  you may not use this file except in compliance with the License.

  You may obtain a copy of the License at



    http://www.apache.org/licenses/LICENSE-2.0



  Unless required by applicable law or agreed to in writing, software

  distributed under the License is distributed on an "AS IS" BASIS,

  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  See the License for the specific language governing permissions and

  limitations under the License. See accompanying LICENSE file.

-->



<!-- Put site-specific property overrides in this file. -->



<configuration>

        <property>

                <name>dfs.replication</name>

                <value>2</value>

                <final>true</final>

                <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>

        </property>



        <property>

                <name>dfs.namenode.name.dir</name>

                <value>/root/hadoop/hadoop-2.7.0/namenode</value>

                <final>true</final>

        </property>



        <property>

                <name>dfs.datanode.data.dir</name>

                <value>/root/hadoop/hadoop-2.7.0/datanode</value>

                <final>true</final>

        </property>

</configuration>

配置mapred-site.xml，但是由于这里没有mapred-site.xml，使用cp mapred-site.xml.template mapred-site.xml创建一个，然后配置mapred-site.xml


<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--

  Licensed under the Apache License, Version 2.0 (the "License");

  you may not use this file except in compliance with the License.

  You may obtain a copy of the License at



    http://www.apache.org/licenses/LICENSE-2.0



  Unless required by applicable law or agreed to in writing, software

  distributed under the License is distributed on an "AS IS" BASIS,

  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  See the License for the specific language governing permissions and

  limitations under the License. See accompanying LICENSE file.

-->



<!-- Put site-specific property overrides in this file. -->



<configuration>

        <property>

                <name>mapred.job.tracker</name>

                <value>master:9001</value>

                <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>

        </property>

</configuration>

然后进行下文件系统的格式化，hadoop namenode -format


root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0/etc/hadoop# hadoop namenode -format

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.



16/02/22 08:45:13 INFO namenode.NameNode: STARTUP_MSG: 

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = 6ccfe3ce6d3b/172.17.0.6

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 2.7.0

....

....

16/02/22 08:45:14 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB

16/02/22 08:45:14 INFO util.GSet: capacity      = 2^15 = 32768 entries

16/02/22 08:45:14 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1624198475-172.17.0.6-1456130714548

16/02/22 08:45:14 INFO common.Storage: Storage directory /root/hadoop/hadoop-2.7.0/namenode has been successfully formatted.

16/02/22 08:45:14 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

16/02/22 08:45:14 INFO util.ExitUtil: Exiting with status 0

16/02/22 08:45:14 INFO namenode.NameNode: SHUTDOWN_MSG: 

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at 6ccfe3ce6d3b/172.17.0.6

************************************************************/

4.配置ssh

安装和配置ssh


apt-get install ssh #安装ssh

root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# ssh-keygen -t rsa #创建公秘钥

Generating public/private rsa key pair.

Enter file in which to save the key (/root/.ssh/id_rsa): 

Created directory '/root/.ssh'.

Enter passphrase (empty for no passphrase): 

Enter same passphrase again: 

Your identification has been saved in /root/.ssh/id_rsa.

Your public key has been saved in /root/.ssh/id_rsa.pub.

The key fingerprint is:

cd:68:e6:cc:42:13:68:55:50:33:5c:1e:da:c9:41:f0 root@fc3caf2b3183

The key's randomart image is:

+---[RSA 2048]----+

|      o=*+=      |

|     o  .O +     |

|    o . . E      |

|   .   . +       |

|      o S o      |

|     . B         |

|      . +        |

|       .         |

|                 |

+-----------------+

root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# ssh-keygen -t dsa #创建公秘钥

Generating public/private dsa key pair.

Enter file in which to save the key (/root/.ssh/id_dsa): 

Enter passphrase (empty for no passphrase): 

Enter same passphrase again: 

Your identification has been saved in /root/.ssh/id_dsa.

Your public key has been saved in /root/.ssh/id_dsa.pub.

The key fingerprint is:

4e:49:2c:31:90:83:34:8a:6e:eb:32:c9:36:f4:b6:02 root@fc3caf2b3183

The key's randomart image is:

+---[DSA 1024]----+

| .o..oo          |

|....o  +         |

|o    .. o        |

|.      o .       |

| o      S        |

|E..    o         |

|+o.     .        |

|== o             |

|oo+..            |

+-----------------+

root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# cd ~/.ssh/

root@fc3caf2b3183:~/.ssh# ls

id_dsa  id_dsa.pub  id_rsa  id_rsa.pub

root@fc3caf2b3183:~/.ssh# cat id_rsa.pub >> authorized_keys #实现无密码登陆

root@fc3caf2b3183:~/.ssh# cat id_dsa.pub >> authorized_keys #实现无密码登陆


root@fc3caf2b3183:~/.ssh# /etc/init.d/ssh start #启动下ssh服务

 * Starting OpenBSD Secure Shell server sshd                                                    [ OK ] 

root@fc3caf2b3183:~/.ssh#

测试下：



root@fc3caf2b3183:~/.ssh# ssh localhost

Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-49-generic x86_64)



 * Documentation:  https://help.ubuntu.com/



The programs included with the Ubuntu system are free software;

the exact distribution terms for each program are described in the

individual files in /usr/share/doc/*/copyright.



Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

applicable law.



root@fc3caf2b3183:~#

成功登入

5.将容器创建成镜像

exit退出这个容器，然后我们把这个容器生成一个docker镜像。


xx@xx-desktop:~$ docker ps -a #查看容器的id

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES

fc3caf2b3183        ubuntu:14.04        "/bin/bash"         41 minutes ago      Exited (0) 3 seconds ago                       kickass_davinci     

xx@xx-desktop:~$ docker commit -m "hadoop install" fc3c ubuntu:hadoop #将容器生成一个镜像

bfc32f70813f1a6f3ec68dd4b5514ec59c3dbcf1516114a57b5f8b9e933b8ded

xx@xx-desktop:~$ docker images #查看刚才生成的镜像

REPOSITORY               TAG                 IMAGE ID            CREATED             VIRTUAL SIZE

ubuntu                   hadoop              bfc32f70813f        15 seconds ago      941.8 MB

6.启动集群

现在我们开始真正来搭建这个分布式集群了。我们开三个终端，通过刚创建的ubuntu:hadoop镜像生成三个容器：master，slave1.slave2


docker run -it -h=master ubuntu:hadoop

docker run -it -h=slave1 ubuntu:hadoop

docker run -it -h=slave2 ubuntu:hadoop

编辑三个容器的/etc/hosts文件，在/etc/hosts文件中添加其他几个容器的ip


172.17.0.5      master

172.17.0.6 slave1

172.17.0.7 slave2

127.0.0.1       localhost

::1     localhost ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

在容器启动后有的服务可能开启的没有开启，比如这里的ssh，此时我们可能需要手动开启，还有环境变量我们也需要重新source /etc/profit


root@master:/# /etc/init.d/ssh start

 * Starting OpenBSD Secure Shell server sshd                                                    [ OK ] 

root@master:/# source /etc/profile

对每个容器进行上述操作，然后我们验证容器之间是否能够相互无密码登陆


root@master:/# ssh slave1

The authenticity of host 'slave1 (172.17.0.6)' can't be established.

ECDSA key fingerprint is 74:d2:98:c8:dc:f2:ad:4b:48:80:b0:47:dc:37:ae:d5.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'slave1,172.17.0.6' (ECDSA) to the list of known hosts.

Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-49-generic x86_64)



 * Documentation:  https://help.ubuntu.com/

Last login: Tue Feb 23 03:01:15 2016 from localhost


root@slave1:~#

然后再配置一下master的slaves文件：


root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# vim slaves 

root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# cat slaves 

slave1

slave2

root@master:~/hadoop/hadoop-2.7.0/etc/hadoop#

此时整个环境都搭建好了，我们现在只需要在master上启动hadoop即可：


root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# start-all.sh

最后在两个slave里面输入jps，可以看到下面的三个服务：


root@slave2:~/hadoop/hadoop-2.7.0/etc/hadoop# jps

146 DataNode

254 NodeManager

351 Jps

此时，我们可以在宿主机上的浏览器中访问：http://172.17.0.5:50070。其中ip为master主机的ip

7.验证集群

最后我们来测试一下hadoop下的wordcount程序


hdfs dfs -mkdir /user

hdfs dfs -mkdir /user/root

hdfs dfs -put etc/hadoop input

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'

如果成功的话可以看到下面的显示：


root@master:~/hadoop/hadoop-2.7.0# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'

16/02/23 03:34:40 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id

16/02/23 03:34:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

16/02/23 03:34:40 INFO input.FileInputFormat: Total input paths to process : 30

16/02/23 03:34:40 INFO mapreduce.JobSubmitter: number of splits:30

16/02/23 03:34:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427717513_0001

16/02/23 03:34:40 INFO mapreduce.Job: The url to track the job: http://localhost:8080/

16/02/23 03:34:40 INFO mapreduce.Job: Running job: job_local1427717513_0001

16/02/23 03:34:40 INFO mapred.LocalJobRunner: OutputCommitter set in config null

....

....

....

16/02/23 03:34:45 INFO mapreduce.Job: Counters: 35

File System Counters

FILE: Number of bytes read=1224838

FILE: Number of bytes written=2240055

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=159752

HDFS: Number of bytes written=1271

HDFS: Number of read operations=155

HDFS: Number of large read operations=0

HDFS: Number of write operations=16

Map-Reduce Framework

Map input records=13

Map output records=13

Map output bytes=323

Map output materialized bytes=355

Input split bytes=127

Combine input records=0

Combine output records=0

Reduce input groups=5

Reduce shuffle bytes=355

Reduce input records=13

Reduce output records=13

Spilled Records=26

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=0

Total committed heap usage (bytes)=1062207488

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters 

Bytes Read=513

File Output Format Counters 

Bytes Written=245

最后的统计结果


root@master:~/hadoop/hadoop-2.7.0# hdfs dfs -get output output

16/02/23 03:40:47 WARN hdfs.DFSClient: DFSInputStream has been closed already

16/02/23 03:40:47 WARN hdfs.DFSClient: DFSInputStream has been closed already

root@master:~/hadoop/hadoop-2.7.0# cat output/*

6   dfs.audit.logger

4   dfs.class

3   dfs.server.namenode.

2   dfs.audit.log.maxbackupindex

2   dfs.period

2   dfs.audit.log.maxfilesize

1   dfsmetrics.log

1   dfsadmin

1   dfs.servers

1   dfs.replication

1   dfs.file

1   dfs.datanode.data.dir

1   dfs.namenode.name.dir

root@master:~/hadoop/hadoop-2.7.0#