使用Docker搭建hadoop集群
1.准备工作
1.1下载软件
下载一些相关的文件:
jdk-8u60-linux-x64.tar.gz
hadoop-2.7.0.tar.gz
由于hadoop官方提供的hadoop版本是32位的,如果在64位的系统上运行需要编译下,这里提供一个编译好的64位的hadoop 2.7.0的安装包:http://pan.baidu.com/s/1c0HD0Nu
1.2准备挂载卷
在宿主机上创建一个文件夹来存放刚才下载的文件,这里我创建了一个~/dockerspace/hadoop-docker/。将 jdk-8u60-linux-x64.tar.gz、hadoop-2.7.0.tar.gz复制到该目录下。同时创建文件sources.list。该文件是用来修改container的源的,这里用的是阿里的源。
sources.list:
deb http://mirrors.aliyun.com/ubuntu/ vivid main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ vivid-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ vivid-backports main restricted universe multiverse
另附上网易源:
deb http://mirrors.163.com/ubuntu/ vivid main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-security main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-updates main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ vivid-backports main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-security main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-updates main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-proposed main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ vivid-backports main restricted universe multiverse
1.3下载镜像
下载一个镜像作为基础镜像,hadoop镜像是基于基础镜像构建的。
建议在拉取镜像之前先把ubuntu系统的源换成国内的源
docker pull ubuntu:14.04
2.安装jdk
进入容器:
xx@xx-desktop:~/dockerspace/hadoop-docker/config$ docker run -it -v ~/dockerspace/hadoop-docker/:/root/software ubuntu:14.04 #进入ubuntu容器
root@6ccfe3ce6d3b:/# cd /root/
root@6ccfe3ce6d3b:~# ls
software
上面的命令docker run -it -v ~/dockerspace/hadoop-docker/:/root/software ubuntu:14.04 将宿主机上的文件夹~/dockerspace/hadoop-docker/挂载到container的/root/software目录下
修改源:
root@6ccfe3ce6d3b:~# cp /etc/apt/sources.list /etc/apt/sources.list.bak #备份源
root@6ccfe3ce6d3b:~# cp /root/software/sources.list /etc/apt/ #更新源,使用阿里源
root@6ccfe3ce6d3b:~# apt-get update #更新软件
安装vim
apt-get install vim
安装Java环境
创建文件夹/root/jdk,将jdk-8u60-linux-x64.tar.gz解压到/root/jdk下,并重命名:
root@fc3caf2b3183:~# ls
software
root@fc3caf2b3183:~# cd software/
root@fc3caf2b3183:~/software# ls
authorized_keys hosts sources.list
hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz zookeeper-3.4.6.tar.gz
root@fc3caf2b3183:~/software# tar -zxf jdk-8u60-linux-x64.tar.gz #解压jdk安装包
root@fc3caf2b3183:~/software# ls
authorized_keys hosts jdk1.8.0_60 zookeeper-3.4.6.tar.gz
hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz sources.list
root@fc3caf2b3183:~/software# cd ..
root@fc3caf2b3183:~# ls
software
root@fc3caf2b3183:~# mkdir jdk
root@fc3caf2b3183:~# mv software/jdk1.8.0_60/ jdk/
root@fc3caf2b3183:~# cd jdk/
root@fc3caf2b3183:~/jdk# ls
jdk1.8.0_60
root@fc3caf2b3183:~/jdk# mv jdk1.8.0_60/ jdk-1.8 #重命名
root@fc3caf2b3183:~/jdk# ls
jdk-1.8
root@fc3caf2b3183:~/jdk#
配置java环境变量,vim /etc/profile,在profile的最后一行添加如下内容:
export JAVA_HOME=/root/jdk/jdk-1.8
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=$PATH:${JAVA_HOME}/bin
export HADOOP_HOME=/root/hadoop/hadoop-2.7.0
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
让其生效,并验证jdk是否安装成功:
root@fc3caf2b3183:~/jdk/jdk-1.8# source /etc/profile #使环境变量生效
root@fc3caf2b3183:~/jdk/jdk-1.8# java -version #查看jdk是否安装成功
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
root@fc3caf2b3183:~/jdk/jdk-1.8#
3.配置hadoop
上面的HADOOP_HOME我们接下来创建
root@fc3caf2b3183:~# ls
jdk software
root@fc3caf2b3183:~# mkdir hadoop #创建hadoop文件夹
root@fc3caf2b3183:~# cd software/
root@fc3caf2b3183:~/software# ls
authorized_keys hosts sources.list
hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz zookeeper-3.4.6.tar.gz
root@fc3caf2b3183:~/software# tar -zxf hadoop-2.7.0.tar.gz #解压hadoop文件
root@fc3caf2b3183:~/software# ls
authorized_keys hadoop-2.7.0.tar.gz jdk-8u60-linux-x64.tar.gz zookeeper-3.4.6.tar.gz
hadoop-2.7.0 hosts sources.list
root@fc3caf2b3183:~/software# mv hadoop-2.7.0 ../hadoop/ #将hadoop文件拷贝到$HADOOP_HOME目录
root@fc3caf2b3183:~/software# cd ../hadoop/
root@fc3caf2b3183:~/hadoop# ls
hadoop-2.7.0
root@fc3caf2b3183:~/hadoop# cd hadoop-2.7.0/
root@fc3caf2b3183:~/hadoop/hadoop-2.7.0# ls
LICENSE.txt NOTICE.txt README.txt bin etc include lib libexec sbin share
root@fc3caf2b3183:~/hadoop/hadoop-2.7.0#
在/root/hadoop/hadoop-2.7.0文件夹下创建如下文件夹
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir namenode
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir datanode
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# mkdir tmp
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0# cd $HADOOP_CONFIG_HOME
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0/etc/hadoop#
配置hadoop-env.sh,设置JAVA_HOME变量,找到这个变量改成下面的:
export JAVA_HOME=/root/jdk/jdk-1.8
配置core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hadoop/hadoop-2.7.0/tmp/</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<final>true</final>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
配置hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
<final>true</final>
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop/hadoop-2.7.0/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop/hadoop-2.7.0/datanode</value>
<final>true</final>
</property>
</configuration>
配置mapred-site.xml,但是由于这里没有mapred-site.xml,使用cp mapred-site.xml.template mapred-site.xml创建一个,然后配置mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>
</property>
</configuration>
然后进行下文件系统的格式化,hadoop namenode -format
root@6ccfe3ce6d3b:~/hadoop/hadoop-2.7.0/etc/hadoop# hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
16/02/22 08:45:13 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = 6ccfe3ce6d3b/172.17.0.6
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.7.0
....
....
16/02/22 08:45:14 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
16/02/22 08:45:14 INFO util.GSet: capacity = 2^15 = 32768 entries
16/02/22 08:45:14 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1624198475-172.17.0.6-1456130714548
16/02/22 08:45:14 INFO common.Storage: Storage directory /root/hadoop/hadoop-2.7.0/namenode has been successfully formatted.
16/02/22 08:45:14 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
16/02/22 08:45:14 INFO util.ExitUtil: Exiting with status 0
16/02/22 08:45:14 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at 6ccfe3ce6d3b/172.17.0.6
************************************************************/
4.配置ssh
安装和配置ssh
apt-get install ssh #安装ssh
root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# ssh-keygen -t rsa #创建公秘钥
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
cd:68:e6:cc:42:13:68:55:50:33:5c:1e:da:c9:41:f0 root@fc3caf2b3183
The key's randomart image is:
+---[RSA 2048]----+
| o=*+= |
| o .O + |
| o . . E |
| . . + |
| o S o |
| . B |
| . + |
| . |
| |
+-----------------+
root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# ssh-keygen -t dsa #创建公秘钥
Generating public/private dsa key pair.
Enter file in which to save the key (/root/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
4e:49:2c:31:90:83:34:8a:6e:eb:32:c9:36:f4:b6:02 root@fc3caf2b3183
The key's randomart image is:
+---[DSA 1024]----+
| .o..oo |
|....o + |
|o .. o |
|. o . |
| o S |
|E.. o |
|+o. . |
|== o |
|oo+.. |
+-----------------+
root@fc3caf2b3183:~/hadoop/hadoop-2.7.0/etc/hadoop# cd ~/.ssh/
root@fc3caf2b3183:~/.ssh# ls
id_dsa id_dsa.pub id_rsa id_rsa.pub
root@fc3caf2b3183:~/.ssh# cat id_rsa.pub >> authorized_keys #实现无密码登陆
root@fc3caf2b3183:~/.ssh# cat id_dsa.pub >> authorized_keys #实现无密码登陆
root@fc3caf2b3183:~/.ssh# /etc/init.d/ssh start #启动下ssh服务
* Starting OpenBSD Secure Shell server sshd [ OK ]
root@fc3caf2b3183:~/.ssh#
测试下:
root@fc3caf2b3183:~/.ssh# ssh localhost
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-49-generic x86_64)
* Documentation: https://help.ubuntu.com/
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
root@fc3caf2b3183:~#
成功登入
5.将容器创建成镜像
exit退出这个容器,然后我们把这个容器生成一个docker镜像。
xx@xx-desktop:~$ docker ps -a #查看容器的id
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fc3caf2b3183 ubuntu:14.04 "/bin/bash" 41 minutes ago Exited (0) 3 seconds ago kickass_davinci
xx@xx-desktop:~$ docker commit -m "hadoop install" fc3c ubuntu:hadoop #将容器生成一个镜像
bfc32f70813f1a6f3ec68dd4b5514ec59c3dbcf1516114a57b5f8b9e933b8ded
xx@xx-desktop:~$ docker images #查看刚才生成的镜像
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
ubuntu hadoop bfc32f70813f 15 seconds ago 941.8 MB
6.启动集群
现在我们开始真正来搭建这个分布式集群了。我们开三个终端,通过刚创建的ubuntu:hadoop镜像生成三个容器:master,slave1.slave2
docker run -it -h=master ubuntu:hadoop
docker run -it -h=slave1 ubuntu:hadoop
docker run -it -h=slave2 ubuntu:hadoop
编辑三个容器的/etc/hosts文件,在/etc/hosts文件中添加其他几个容器的ip
172.17.0.5 master
172.17.0.6 slave1
172.17.0.7 slave2
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
在容器启动后有的服务可能开启的没有开启,比如这里的ssh,此时我们可能需要手动开启,还有环境变量我们也需要重新source /etc/profit
root@master:/# /etc/init.d/ssh start
* Starting OpenBSD Secure Shell server sshd [ OK ]
root@master:/# source /etc/profile
对每个容器进行上述操作,然后我们验证容器之间是否能够相互无密码登陆
root@master:/# ssh slave1
The authenticity of host 'slave1 (172.17.0.6)' can't be established.
ECDSA key fingerprint is 74:d2:98:c8:dc:f2:ad:4b:48:80:b0:47:dc:37:ae:d5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave1,172.17.0.6' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-49-generic x86_64)
* Documentation: https://help.ubuntu.com/
Last login: Tue Feb 23 03:01:15 2016 from localhost
root@slave1:~#
然后再配置一下master的slaves文件:
root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# vim slaves
root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# cat slaves
slave1
slave2
root@master:~/hadoop/hadoop-2.7.0/etc/hadoop#
此时整个环境都搭建好了,我们现在只需要在master上启动hadoop即可:
root@master:~/hadoop/hadoop-2.7.0/etc/hadoop# start-all.sh
最后在两个slave里面输入jps,可以看到下面的三个服务:
root@slave2:~/hadoop/hadoop-2.7.0/etc/hadoop# jps
146 DataNode
254 NodeManager
351 Jps
此时,我们可以在宿主机上的浏览器中访问:http://172.17.0.5:50070。其中ip为master主机的ip
7.验证集群
最后我们来测试一下hadoop下的wordcount程序
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root
hdfs dfs -put etc/hadoop input
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
如果成功的话可以看到下面的显示:
root@master:~/hadoop/hadoop-2.7.0# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
16/02/23 03:34:40 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/02/23 03:34:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/02/23 03:34:40 INFO input.FileInputFormat: Total input paths to process : 30
16/02/23 03:34:40 INFO mapreduce.JobSubmitter: number of splits:30
16/02/23 03:34:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427717513_0001
16/02/23 03:34:40 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/02/23 03:34:40 INFO mapreduce.Job: Running job: job_local1427717513_0001
16/02/23 03:34:40 INFO mapred.LocalJobRunner: OutputCommitter set in config null
....
....
....
16/02/23 03:34:45 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=1224838
FILE: Number of bytes written=2240055
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=159752
HDFS: Number of bytes written=1271
HDFS: Number of read operations=155
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Map-Reduce Framework
Map input records=13
Map output records=13
Map output bytes=323
Map output materialized bytes=355
Input split bytes=127
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=355
Reduce input records=13
Reduce output records=13
Spilled Records=26
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=1062207488
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=513
File Output Format Counters
Bytes Written=245
最后的统计结果
root@master:~/hadoop/hadoop-2.7.0# hdfs dfs -get output output
16/02/23 03:40:47 WARN hdfs.DFSClient: DFSInputStream has been closed already
16/02/23 03:40:47 WARN hdfs.DFSClient: DFSInputStream has been closed already
root@master:~/hadoop/hadoop-2.7.0# cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.audit.log.maxbackupindex
2 dfs.period
2 dfs.audit.log.maxfilesize
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
1 dfs.datanode.data.dir
1 dfs.namenode.name.dir
root@master:~/hadoop/hadoop-2.7.0#