Hadoop3是一个开源的分布式计算平台,它提供了存储大数据集和处理大数据集的能力。Hadoop3的核心是HDFS(Hadoop分布式文件系统)和MapReduce,它们共同为用户提供了高效、可靠的大数据处理能力。Hadoop3引入了许多新功能,包括支持多种容器化技术、支持GPU加速、支持Erasure Coding等,这些新功能使得Hadoop3更加适应现代大规模数据处理的需求。此外,Hadoop3还提供了许多API和工具,例如Hive、Pig、Spark等,使得用户可以更方便地处理和分析大数据。
下面是正文:要搭hadoop分布式集群环境,不用docker的话,得准备好几台虚拟机,耗时耗力电脑配置低的话还会卡,用docker容器装载hadoop主从服务是再好不过的选择了。
这里我选择用脚本制作容器并自动配置HDFS分布式集群环境,几分钟内就能自动生成hadoop环境,供学习使用
前置条件
- centos7 按理8 9应该也可以
- docker
配置文件见文末网盘链接,最新内容以网盘文件为主
介绍
默认为3节点(hadoop-master、hadoop-slave1、hadoop-slave2),如果需要增加节点修改start.sh文件,将3改成其它数量
# the default node number is 3
N=${1:-3}
目录结构
脚本文件,其中config是hadoop配置文件
构建镜像
Dockfile
# Use a base image
FROM centos:centos7.9.2009
MAINTAINER thly <nmmbwan@163.com>
# install openssh-server, openjdk and wget
RUN yum update -y && yum install -y openssh-server openssh-clients java-1.8.0-openjdk java-1.8.0-openjdk-devel wget net-tools sudo systemctl
# install hadoop 3.2.3
RUN wget https://mirrors.aliyun.com/apache/hadoop/core/hadoop-3.2.3/hadoop-3.2.3.tar.gz && \
tar -xzvf hadoop-3.2.3.tar.gz && \
mv hadoop-3.2.3 /usr/local/hadoop && \
rm -f hadoop-3.2.3.tar.gz
# set environment variable
ENV JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
ENV HADOOP_HOME=/usr/local/hadoop
ENV PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
# Create or overwrite the /etc/profile.d/my_env.sh file
RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk' > /etc/profile.d/my_env.sh && \
echo 'export HADOOP_HOME=/usr/local/hadoop' >> /etc/profile.d/my_env.sh && \
echo 'export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin' >> /etc/profile.d/my_env.sh
# Make the script executable
RUN chmod +x /etc/profile.d/my_env.sh
# Create user hadoop
RUN useradd hadoop
# Set password for user hadoop to 'hadoop'
RUN echo 'hadoop:hadoop' | chpasswd
RUN echo 'root:qwer1234' | chpasswd
COPY config/* /tmp/
RUN mv /tmp/hadoop-env.sh /usr/local/hadoop/etc/hadoop/hadoop-env.sh && \
mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \
mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
mv /tmp/workers $HADOOP_HOME/etc/hadoop/workers && \
# mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
mv /tmp/run-wordcount.sh ~/run-wordcount.sh
#chmod +x ~/start-hadoop.sh && \
RUN chmod +x ~/run-wordcount.sh && \
chmod +x $HADOOP_HOME/sbin/start-dfs.sh && \
chmod +x $HADOOP_HOME/sbin/start-yarn.sh
RUN echo 'hadoop ALL=(ALL) ALL' > /etc/sudoers.d/hadoop && \
chmod 440 /etc/sudoers.d/hadoop
WORKDIR /home/hadoop
# Switch to user hadoop
USER hadoop
RUN mkdir -p ~/hdfs/namenode && \
mkdir -p ~/hdfs/datanode && \
mkdir $HADOOP_HOME/logs && \
chmod -R 755 ~/hdfs && \
chmod -R 755 $HADOOP_HOME/logs
#USER root
#CMD [ "sh", "-c", "systemctl start sshd;bash" ]
#CMD ["/sbin/init"]
# ssh without key
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# format namenodepw
# RUN /usr/local/hadoop/bin/hdfs namenode -format
USER root
CMD [ "sh", "-c", "systemctl start sshd;bash" ]
CMD ["/sbin/init"]
生成镜像
build-image.sh脚本
#!/bin/bash
echo -e "\nbuild docker hadoop image\n"
###thly/hadoop:1.0,这个可以改成你自己的名称,改完之后,start.sh脚本里面启动的镜像就要改成####自己的这个镜像名称,注意
docker build -t thly/hadoop:3.0 .
echo -e "\n镜像打包成功,镜像名称是thly/hadoop:3.0"
运行脚本sh build-image.sh获取镜像
启动容器
运行sh start.sh启动容器, 默认启动2个datanode,可以改脚本多启动几个
#!/bin/bash
IMAGE_NAME="thly/hadoop:3.0"
NETWORK_NAME="bigdata"
SUBNET="172.10.1.0/24"
NODE_IP="172.10.1"
GATEWAY="172.10.1.1"
# Define a variable to store IP addresses and hostnames
declare -A IP_AND_HOSTNAMES
# the default node number is 3
N=${1:-3}
# 检查镜像是否存在
if docker image inspect $IMAGE_NAME &> /dev/null; then
echo "镜像 $IMAGE_NAME 存在。"
else
echo "镜像 $IMAGE_NAME 不存在。新建镜像"
sh build-image.sh
fi
# Check if the network already exists
if ! docker network inspect $NETWORK_NAME &> /dev/null; then
# Create the network with specified subnet and gateway
docker network create --subnet=$SUBNET --gateway $GATEWAY $NETWORK_NAME
echo "Created network $NETWORK_NAME with subnet $SUBNET and gateway $GATEWAY"
else
echo "Network $NETWORK_NAME already exists"
fi
IP_AND_HOSTNAMES["$NODE_IP.2"]="hadoop-master"
for ((j=1; j<$N; j++)); do
IP_AND_HOSTNAMES["$NODE_IP.$((j+2))"]="hadoop-slave$j"
done
# start hadoop master container
docker rm -f hadoop-master &> /dev/null
echo "start hadoop-master container..."
docker run -itd \
--net=$NETWORK_NAME \
--ip $NODE_IP.2 \
-p 9870:9870 \
-p 9864:9864 \
-p 19888:19888 \
--name hadoop-master \
--hostname hadoop-master \
--privileged=true \
thly/hadoop:3.0
# start hadoop slave container
i=1
while [ $i -lt $N ]
do
docker rm -f hadoop-slave$i &> /dev/null
echo "start hadoop-slave$i container..."
if [ $i -eq 1 ]; then
port="-p 8088:8088"
else
port=""
fi
docker run -itd \
--net=$NETWORK_NAME \
--ip $NODE_IP.$((i+2)) \
--name hadoop-slave$i \
--hostname hadoop-slave$i \
$port \
-p $((9864+i)):9864 \
--privileged=true \
thly/hadoop:3.0
i=$(( $i + 1 ))
done
# 遍历关联数组并在每个容器内添加IP和主机名到 /etc/hosts 文件
for key in "${!IP_AND_HOSTNAMES[@]}"; do
value="${IP_AND_HOSTNAMES[$key]}"
echo "Key: $key, Value: $value"
# 跳过特定情况的判断条件
if [ "$value" != "hadoop-master" ]; then
echo "Configure hadoop-master container"
docker exec -it hadoop-master bash -c "sudo echo '$key $value' >> /etc/hosts;sudo -u hadoop ssh-copy-id -o StrictHostKeyChecking=no $key"
fi
for ((i=1; i<$N; i++)); do
if [[ "$value" != hadoop-slave$i ]]; then
echo "Configure hadoop-slave$i container"
docker exec -it hadoop-slave$i bash -c "sudo echo '$key $value' >> /etc/hosts;sudo -u hadoop ssh-copy-id -o StrictHostKeyChecking=no $key"
fi
done
done
# format namenode
docker exec -it -u hadoop hadoop-master bash -c '$HADOOP_HOME/bin/hdfs namenode -format'
启动hadoop
这里有两种启动脚本,1使用docker exec启动,2使用ssh登录启动
1、使用docker exec启动
sh start-hadoop.sh启动hadoop服务,只需要启动NameNode 等服务
#!/bin/bash
echo "starting hadoop-master DFS..."
docker exec -d -u hadoop hadoop-master bash -c '$HADOOP_HOME/sbin/start-dfs.sh'
sleep 60
echo "HDFS started"
echo "starting hadoop-slave1 YARN..."
docker exec -d -u hadoop hadoop-slave1 bash -c '$HADOOP_HOME/sbin/start-yarn.sh'
echo "YARN started"
echo "starting hadoop-master historyserver ..."
docker exec -d -u hadoop hadoop-master bash -c '$HADOOP_HOME/bin/mapred --daemon start historyserver'
2、使用ssh启动
#!/bin/bash
# Get hadoop-master IP address
HADOOP_MASTER_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' hadoop-master)
# Get hadoop-slave1 IP address
HADOOP_SLAVE1_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' hadoop-slave1)
echo " =================== 启动 hadoop集群 ==================="
echo " --------------- 启动 hdfs ---------------"
ssh hadoop@$HADOOP_MASTER_IP '$HADOOP_HOME/sbin/start-dfs.sh'
echo " --------------- 启动 yarn ---------------"
ssh hadoop@$HADOOP_SLAVE1_IP '$HADOOP_HOME/sbin/start-yarn.sh'
echo " --------------- 启动 historyserver ---------------"
ssh hadoop@$HADOOP_MASTER_IP '$HADOOP_HOME/bin/mapred --daemon start historyserver'
访问hadoop
使用http://IP:9870访问web界面,其中ip为宿主机ip地址
如果要进入hadoop环境,则使用命令docker exec -it -u hadoop hadoop-master进入docker中,hadoop所有命名执行请切换成hadoop用户
hadoop账号密码为hadoop:hadoop
hadoop账号密码:qwer1234’ | chpasswd
关闭hadoop
1、使用docker exec关闭
#!/bin/bash
echo "stopping hadoop-master historyserver ..."
docker exec -it -u hadoop hadoop-master bash -c '$HADOOP_HOME/bin/mapred --daemon stop historyserver'
echo "stopping hadoop-slave1 YARN..."
docker exec -it -u hadoop hadoop-slave1 bash -c '$HADOOP_HOME/sbin/stop-yarn.sh'
echo "YARN stopped"
echo "stopping hadoop-master DFS..."
docker exec -it -u hadoop hadoop-master bash -c '$HADOOP_HOME/sbin/stop-dfs.sh'
echo "HDFS stopped"
2、使用ssh关闭
#!/bin/bash
# Get hadoop-master IP address
HADOOP_MASTER_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' hadoop-master)
# Get hadoop-slave1 IP address
HADOOP_SLAVE1_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' hadoop-slave1)
echo " =================== 关闭 hadoop集群 ==================="
echo " --------------- 关闭 historyserver ---------------"
ssh hadoop@$HADOOP_MASTER_IP '$HADOOP_HOME/bin/mapred --daemon stop historyserver'
echo " --------------- 关闭 yarn ---------------"
ssh hadoop@$HADOOP_SLAVE1_IP '$HADOOP_HOME/sbin/stop-yarn.sh'
echo " --------------- 关闭 hdfs ---------------"
ssh hadoop@$HADOOP_MASTER_IP '$HADOOP_HOME/sbin/stop-dfs.sh'
删除docker及镜像
删除hadoop-master、hadoop-slave和hadoop镜像
#!/bin/bash
prefixes=("hadoop-slave" "hadoop-master")
for prefix in "${prefixes[@]}"; do
# Get a list of container IDs with the specified prefix
container_ids=$(docker ps -aq --filter "name=${prefix}")
# start remove master and slave container
if [ -z "$container_ids" ]; then
echo "No containers with the prefix '${prefix}' found."
else
for container_id in $container_ids; do
# Stop and remove each container
docker stop "$container_id"
docker rm "$container_id"
echo "Container ${container_id} removed."
done
fi
done
docker rmi thly/hadoop:3.0
配置文件docker-hadoop
https://www.aliyundrive.com/s/DDgsifqEoML
提取码: 6n2u
参考https://zhuanlan.zhihu.com/p/56481503