最近两个月在做一个日志分析项目,先前创建了一台虚拟机,挂载了1.6TB block storage,不够用了。恰好另一个OpenStack集群上空闲出大约12 TB存储,以及一些CPU和RAM资源,就干脆搭一套Hadoop/HDFS集群吧,将来如果需要上Spark,可以直接在此基础上安装。
目录
0. 参考资料
- https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/
- https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
- https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
1. 环境准备
1.1 虚拟机
共5台虚拟机,1台作为master node(nameNode和resourceManager),4台作为slave node(dataNode和nodeManager)。
- Flavor:10 vCPU,61 GB memory,213 GB disk
- Boot image:ubuntu-18.04-2018-11-22
- 在Horizon上创建时选择预先准备的nova key-pair用来登录
- 均使用tenant network。为master关联一个floating IP以便从外界访问
我所使用的cloud集群是由公司专门的部门搭建、管理和维护的,这个奇怪的flavor,是他们根据物理机的配置设置的,这样的flavor可以最有效的利用物理资源。这个image也是根据cloud IaaS进行制作的,与Ubuntu官方的cloud image稍有差别。最主要的,官方image的默认登录用户是ubuntu,而这里是root。下面在虚拟机上执行的命令均用root。
创建好之后,用nova list看到的信息如下:
nova list | grep hadoop
| a6ac60ca-b6f0-4c27-bafc-6282bafd2f8f | hadoop-master-1 | ACTIVE | - | Running | cIMS-RnD-1=192.168.0.37, 10.129.65.45 |
| 03c8fe62-8cde-43a7-9d4c-23ed995e42f4 | hadoop-slave-1 | ACTIVE | - | Running | cIMS-RnD-1=192.168.0.12 |
| 1f191e0c-281d-431c-8823-5f18ebf47457 | hadoop-slave-2 | ACTIVE | - | Running | cIMS-RnD-1=192.168.0.17 |
| 5a07624e-20b6-4a23-ac26-e92c69066eee | hadoop-slave-3 | ACTIVE | - | Running | cIMS-RnD-1=192.168.0.20 |
| 8beb47a7-484e-422c-ae34-024e52a02aef | hadoop-slave-4 | ACTIVE | - | Running | cIMS-RnD-1=192.168.0.15 |
1.2 免IP、默认key-pair登录
在每台虚拟机上执行如下操作:
- 将事先准备好的ssh key-pair传上来。将其放在/root/.ssh/下面,将private key命名为id_rsa,将publick key的内容放进/root/.ssh/authorized_keys文件中。这里我用的key-pair同时也配在了gitlab中,将来提交代码也可以直接使用。
- 在的/etc/hosts文件中,增加以下信息,将来我们在使用时,以及在hadoop的配置文件中,可以用slave1这样的host name代替IP地址
192.168.0.37 master1
192.168.0.12 slave1
192.168.0.17 slave2
192.168.0.20 slave3
192.168.0.15 slave4
1.3 Cinder Volumes
创建5个Cinder volume,1个400 GB给master使用,4个 2800 GB,给slave使用。以下是创建master 的volume命令:
cinder create --name hadoop-master-1 400
以下是所有volume的信息
[root@eecloud-56-key haxue]# cinder list | grep hadoop
| 06a53038-8241-41ab-9890-79a3b31dc77a | available | hadoop-slave-2 | 2800 | ceph | false | |
| 9b40cd6c-cd3c-49fa-8602-48de956e4df1 | available | hadoop-master-1 | 400 | ceph | false | |
| 9fba113c-c454-4ce5-83aa-86d9edbc6a2e | available | hadoop-slave-4 | 2800 | ceph | false | |
| c6595a51-ddf0-4c31-a4dd-a60c7774794b | available | hadoop-slave-1 | 2800 | ceph | false | |
| cf771466-21e4-40fa-ae80-98d8f55a74d7 | available | hadoop-slave-3 | 2800 | ceph | false | |
分别挂载到对应的虚拟机上,例如master节点:
nova volume-attach hadoop-master-1 9b40cd6c-cd3c-49fa-8602-48de956e4df1 auto
+----------+--------------------------------------+
| Property | Value |
+----------+--------------------------------------+
| device | /dev/vdb |
| id | 9b40cd6c-cd3c-49fa-8602-48de956e4df1 |
| serverId | a6ac60ca-b6f0-4c27-bafc-6282bafd2f8f |
| volumeId | 9b40cd6c-cd3c-49fa-8602-48de956e4df1 |
+----------+--------------------------------------+
登录到各台虚拟机上,对volume进行格式化,本文格式化为ext4(注意device不要写错,根据上方volume-attach的输出,是/dev/vdb)。然后挂载到/mnt/data目录。
root@hadoop-master-1:~# mkfs.ext4 /dev/vdb
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 104857600 4k blocks and 26214400 inodes
Filesystem UUID: a15ae607-581d-4c69-8f17-ade75e8782d1
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information:
done
root@hadoop-master-1:~# mkdir /mnt/data
root@hadoop-master-1:~# mount /dev/vdb /mnt/data
root@hadoop-master-1:~# ls /mnt/data
lost+found
1.4 hadoop用户准备和安装
后续的操作可以批量进行了,包括:
- 创建hadoop用户
- 设置ssh key
- 切换/mnt/data的owner
- 安装Java8
/root/script.sh脚本如下,并设置好执行权限。
#!/bin/bash
useradd -m -p hadoop -s /bin/bash hadoop
mkdir /home/hadoop/.ssh
cp /root/.ssh/id_rsa /root/.ssh/authorized_keys /home/hadoop/.ssh/
chown -R hadoop:hadoop /home/hadoop/.ssh
chown -R hadoop:hadoop /mnt/data
apt update
apt install openjdk-8-jdk openjdk-8-jre
echo JAVA_HOME=\"/usr/lib/jvm/java-8-openjdk-amd64\" >> /etc/environment
批量运行:
# 将脚本传到每一台虚拟机上
for host in slave1 slave2 slave3 slave4; do scp script.sh root@$host:~/; done
# 运行
for host in master1 slave1 slave2 slave3 slave4; do ssh root@$host /root/script.sh; done
2 Hadoop配置与运行
2.1 下载,解压缩,设置环境变量
切换到hadoop用户,准备如下脚本/home/hadoop/script.sh,设置好执行权限
#!/bin/bash
wget http://us.mirrors.quenda.co/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar -zxf hadoop-3.2.1.tar.gz
mv hadoop-3.2.1 hadoop
echo "export HADOOP_HOME=/home/hadoop/hadoop" >> .bashrc
echo "export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin" >> .bashrc
对所有节点批量操作:
# 将脚本传到每一台虚拟机上
for host in slave1 slave2 slave3 slave4; do scp script.sh hadoop@$host:~/; done
# 运行
for host in master1 slave1 slave2 slave3 slave4; do ssh hadoop@$host /home/hadoop/script.sh; done
2.2 配置文件
共修改以下五个文件,并将它们传到所有虚拟机的相同目录下,覆盖原文件。先贴在这里,后续再增加解释。
(目前只是根据官方文档和网上的其它参考资料配置的,难免有不合理的地方,欢迎建议和指正)
/home/hadoop/etc/hadoop/workers
slave1
slave2
slave3
slave4
/home/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master1:9000</value>
</property>
</configuration>
/home/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/mnt/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/mnt/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
/home/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>20480</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>256</value>
</property>
</configuration>
/home/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>
</configuration>
2.3 运行,测试
先格式化hdfs:
~/hadoop/bin/hdfs namenode -format
然后就可以启动了:
~/hadoop/bin/start-all.sh
在master节点上指定jps命令,可以看到NameNode和ResourceManager:
hadoop@hadoop-master-1:~$ jps
506 Jps
30972 SecondaryNameNode
30716 NameNode
31231 ResourceManager
在slave上可以看到DataNode和NodeManager
hadoop@hadoop-slave-2:~$ jps
27072 DataNode
27219 NodeManager
16549 Jps
通过<master-ip>:9870可以打开HDFS的界面,用过<master-ip>:8088可以打开Hadoop的任务管理界面。
进行简单的测试,通过命令行创建目录,上传文件:
hadoop@hadoop-master-1:~$ ~/hadoop/bin/hdfs dfs -mkdir test_dir
hadoop@hadoop-master-1:~$ echo A test file. > test.txt
hadoop@hadoop-master-1:~$ ~/hadoop/bin/hdfs dfs -put test.txt test_dir
2019-12-30 09:52:57,146 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
hadoop@hadoop-master-1:~$ ~/hadoop/bin/hdfs dfs -ls test_dir
Found 1 items
-rw-r--r-- 1 hadoop supergroup 13 2019-12-30 09:52 test_dir/test.txt
3. 补充
3.1 两个master节点
目前只创建了一台master节点。其实我也尝试了另创建一台master2,把ResourceManager跑在上面,但在运行start-all.sh或者start-yarn.sh之后,ResourceManager总是起不来,log里报错说无法bind address,port in use,到master2节点上执行start-yarn.sh没问题。但是这样做,还需要给master2也配一个floating IP,或者做端口转发。目前还没有使用master2的需求,所以还是把ResourceManager和SecondaryNameNode都跑在master1上,没有使用master2。
3.2 pydoop安装遇到的坑
项目要用到pydoop这个Python library,在Python中操作Hadoop。在安装pydoop时报错:
ValueError: hadoop home not found, try setting HADOOP_HOME
HADOOP_HOME环境变量实际上没有任何问题。
从stackoverflow上找到了解决方案:https://stackoverflow.com/questions/29645985/python-2-7-6-pydoop-installation-fail-on-ubuntu
sudo sh -c "export PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH; export HADOOP_HOME=/home/hadoop/hadoop; export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64; pip3 install pydoop"
可能是因为sudo没有继承当前的环境变量。