目录
- Hadoop Installation on VMs,VMware 虚拟机安装 Hadoop 集群
- Setup Three VMs with CentOS7,安装三台 CentOS7 虚拟机(Macbook M1,ARM 架构)
- Config static IP address,设置静态 IP 地址
- SSH without password,免密登录 SSH
- Install Java8 and Hadoop3.4.0,安装 Java8 和 Hadoop3.4.0
- Config HDFS, Yarn, MapReduce,配置组件
- Start Hadoop (HDFS, YARN),启动 Hadoop
- Run MapReduce Example Jar, 运行示例 MapReduce 程序
- Stop hadoop, 关闭 hadoop
Hadoop Installation on VMs,VMware 虚拟机安装 Hadoop 集群
Setup Three VMs with CentOS7,安装三台 CentOS7 虚拟机(Macbook M1,ARM 架构)
- VM1: hadoop1: 4G RAM + 20G Disk
- VM2: hadoop2: 2G RAM + 20G Disk
- VM3: hadoop3: 2G RAM + 20G Disk
Take host “hadoop2” as VM setup example
以安装 hadoop2 虚拟机作为例子
-
Select ISO Image,选择宿主机上的镜像文件
-
Select OS,选择 Debian10 操作系统
-
Select install CentOS7,选择安装
-
Select start up disk,选择硬盘
-
Select GNOME GUI,选择安装桌面
-
Select timezone,选择时区
-
Enable network and set host name,开启网络,设置主机名
- Note down the network interface,网口: ens160
- IP address,IP 地址: 192.168.57.135
- Default route (gateway),网关: 192.168.57.2
-
Create user hadoop,创建 hadoop 用户
-
Begin installation,开始安装
-
During installation,安装中
-
Finished installation and reboot,安装完毕点击重启
-
Accept license,接受声明
-
Complete CentOS installation,完成
-
Login GUI as user hadoop,hadoop 用户登录
-
Enable date & time update, 同步节点之间的时间
Config static IP address,设置静态 IP 地址
-
Use FinalShell software to SSH,使用 FinalShell 软件远程登录到三台虚拟机
-
Login as hadoop user,hadoop 用户登录
-
-
Edit network file in 3 machines,修改网络配置文件
- hadoop1 192.168.57.134
- hadoop2 192.168.57.135
- hadoop3 192.168.57.136
# Edit ifcfg-{network interface}
sudo vim /etc/sysconfig/network-scripts/ifcfg-ens160
...
BOOTPROTO=static
...
# append,追加
IPADDR=192.168.57.134
GATEWAY=192.168.57.2
NETMASK=255.255.255.0
DNS1=192.168.57.2
DNS2=114.114.114.114
PREFIX=24
- Restart network,重启网络
/etc/init.d/network restart
- Check new network,查看静态 IP
ifconfig
- Try to ping google.com,尝试连接 google
SSH without password,免密登录 SSH
- Config known host names on 3 hosts,三台机器声明 host 对应的 IP
sudo vim /etc/hosts
# append,追加
192.168.57.134 hadoop1
192.168.57.135 hadoop2
192.168.57.136 hadoop3
2. Generate key pair under user “hadoop” on 3 hosts,三台机器上生成 hadoop 用户的密钥
su hadoop
ssh-keygen -t rsa
- Distribute pub key to all 3 hosts,每台机器分发自己的公钥给所有三台机器
ssh-copy-id hadoop@hadoop1
ssh-copy-id hadoop@hadoop2
ssh-copy-id hadoop@hadoop3
# check added pub keys for each hosts,查看本台机器保存的其他机器的公钥
cat ~/.ssh/authorized_keys
4. Test SSH to other hosts without password,测试免密登录
# from hadoop1
ssh hadoop@hadoop2
# from hadoop3
ssh hadoop@hadoop1
- Disable firewall on 3 hosts (important), 关闭三台机器防火墙 (重要)
sudo systemctl stop firewalld
sudo systemctl disable firewalld
Install Java8 and Hadoop3.4.0,安装 Java8 和 Hadoop3.4.0
- Download packages,下载软件包
- jdk (ARM64 Compressed Archive): https://www.oracle.com/java/technologies/downloads/#java8
- hadoop 3.4.0: https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz
- Create /opt/software and /opt/modules on 3 hosts,三台机器创建文件夹/opt/modules
sudo mkdir /opt/modules
sudo mkdir /opt/software
sudo chown hadoop:hadoop /opt/modules
sudo chown hadoop:hadoop /opt/software
3. Upload both to /opt/software on 3 hosts via FinalShell GUI,用 FinalShell 上传软件包到 hadoop1 目录/opt/software
- Extract to /opt/modules on hadoop1,hadoop1 上解压缩到/opt/modules
su hadoop
tar -zxvf /opt/software/hadoop-3.4.0-aarch64.tar.gz -C /opt/modules
tar -zxvf /opt/software/jdk-8u411-linux-aarch64.tar.gz -C /opt/modules
cd /opt/modules
mv jdk1.8.0_411 jdk1.8.0
ls -l
- Change default java,修改系统默认 Java(不是必须)
su -
# add my jdk to list
update-alternatives --install /usr/bin/java java /opt/modules/jdk1.8.0/bin/java 1
update-alternatives --install /usr/bin/javac javac /opt/modules/jdk1.8.0/bin/javac 1
# choose my jdk as default
update-alternatives --config java
update-alternatives --config javac
# check default java
ls -l /etc/alternatives/java
ls -l /etc/alternatives/javac
java -version
javac -version
6. Add JDK and Hadoop to $PATH,添加 jdk 和 hadoop 软件到全局环境变量
# root user
su -
vim /etc/profile
# append, 追加到最后
export JAVA_HOME=/opt/modules/jdk1.8.0
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/opt/modules/hadoop-3.4.0
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
# test hadoop
hadoop version
Config HDFS, Yarn, MapReduce,配置组件
- Navigate to hadoop config dir,去到 hadoop 配置文件夹
su hadoop
cd /opt/modules/hadoop-3.4.0/etc/hadoop
- Edit config files,修改以下配置文件
vi core-site.xml
or
Use VS code,也可以使用其他编辑器
# core-site.xml
<configuration>
<!-- 设置hdfs内部端口 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop1:9000</value>
</property>
<!-- 设置数据/元数据存储位置 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/data</value>
</property>
</configuration>
# hadoop-env.sh
export JAVA_HOME=/opt/modules/jdk1.8.0
# The language environment in which Hadoop runs. Use the English
# environment to ensure that logs are printed as expected.
export LANG=en_US.UTF-8
# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
export HADOOP_HOME=/opt/modules/hadoop-3.4.0
# hdfs-site.xml,HDFS配置
<configuration>
<!-- 设置namenode网页访问地址 -->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop1:9870</value>
</property>
<!-- 设置secondarynamenode网页访问地址 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop2:9868</value>
</property>
</configuration>
# mapred-site.xml,MapReduce配置
<configuration>
<!-- 设置mapreduce为yarn模式 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
# yarn-site.xml,Yarn配置
<configuration>
<!-- 设置hadoop1为resourcemanager -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop1</value>
</property>
<!-- 开启shuffle服务 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--NodeManager在启动时加载shuffleHandler类-->
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop1:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留7天 -->
<property>
<name>yarn.log-aggregation。retain-seconds</name>
<value>604800</value>
</property>
</configuration>
# workers,声明所有DataNode机器,无空格
hadoop1
hadoop2
hadoop3
- Copy /opt/modules to other 2 hosts,复制/modules 到另外两台机器
# on hadoop1
scp -r /opt/modules/* hadoop@hadoop2:/opt/modules
scp -r /opt/modules/* hadoop@hadoop3:/opt/modules
- Copy /etc/profile to other 2 hosts,拷贝环境变量设置到另外两台机器
# on hadoop1
scp /etc/profile root@hadoop2:/etc
# on hadoop2
source /etc/profile
# on hadoop1
scp /etc/profile root@hadoop3:/etc
# on hadoop3
source /etc/profile
# change default Java for hadoop2, hadoop3 (不是必须)
su -
# add my jdk to list
update-alternatives --install /usr/bin/java java /opt/modules/jdk1.8.0/bin/java 1
update-alternatives --install /usr/bin/javac javac /opt/modules/jdk1.8.0/bin/javac 1
# choose my jdk as default
update-alternatives --config java
update-alternatives --config javac
- Sync hadoop config if config changes, hadoop1 修改 hadoop 配置后同步配置用以下命令
# on hadoop1
rsync -avz /opt/modules/hadoop-3.4.0/etc/hadoop/ hadoop@hadoop2:/opt/modules/hadoop-3.4.0/etc/hadoop/
rsync -avz /opt/modules/hadoop-3.4.0/etc/hadoop/ hadoop@hadoop3:/opt/modules/hadoop-3.4.0/etc/hadoop/
Start Hadoop (HDFS, YARN),启动 Hadoop
- hadoop1: NameNode, DataNode, ResourceManager, NodeManager
- hadoop2: SecondaryNameNode, DataNode, NodeManager
- hadoop3: DataNode, NodeManager
- Format namenode,格式化 NameNode
# on hadoop1
hdfs namenode -format
You should see namenode meta data dir is created, 检查 namenode 数据文件夹
# on hadoop1
cd /home/hadoop/data/dfs/name/current
cat VERSION
- Start all deamons,启动所有守护进程
# on hadoop1
start-all.sh
-
hadoop1,使用 jps 查看进程
-
hadoop2,使用 jps 查看进程
-
hadoop3,使用 jps 查看进程
- Start MapReduce job history server,启动 MapReduce 历史服务器
# on any host
mapred --daemon start historyserver
4. Check Web UI,虚拟机上查看服务对应网页
-
HDFS: http://hadoop1:9870
-
YARN: http://hadoop1:8088
Note 3 active node, 注意应该有 3 个活跃节点,如果只有一个,检查防火墙是否关闭
-
MapReduce History Server: http://hadoop1:19888
- Access on host machine,宿主机上访问网页
sudo vi /etc/hosts
192.168.57.134 hadoop1
192.168.57.135 hadoop2
192.168.57.136 hadoop3
Run MapReduce Example Jar, 运行示例 MapReduce 程序
- Create input dir in HDFS, HDFS 中创建输入文件夹
hdfs dfs -mkdir /input
- Create and upload files to input dir, 创建并上传 wordcount 文件
vim ~/words.txt
hello hadoop
hello world
hello hadoop
mapreduce
hdfs dfs -put ~/words.txt /input
hdfs dfs -ls /input
3. Run example program, 运行示例程序
# /output doesn't exist, /output路径不能存在
hadoop jar /opt/modules/hadoop-3.4.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar wordcount /input /output
- Print wordcount output, 打印 wordcount 结果
hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000
5. Yarn web UI and historyserver
Stop hadoop, 关闭 hadoop
- Stop all deamons,停止所有守护进程
# on hadoop1
mapred --daemon stop historyserver && stop-all.sh
- Poweroff 3 machines,关闭虚拟机
poweroff
3. Take snapshot for each machine,截取虚拟机快照
使用hadoop streaming创建python mapreduce程序请参考:https://blog.csdn.net/Jacob12138/article/details/138908010