Hadoop集群搭建（下）：centos 7为例（已将将安装所需压缩包统一放在了/opt/software目录下）

原创于 2025-11-12 17:37:16 发布 · 717 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#hadoop #centos #大数据

一. 集群简介

1.HADOOP 集群具体来说包含两个集群：

HDFS集群和YARN集群，两者逻辑上分离，但物理上常在一起。HDFS集群负责海量数据的存储，集群中的角色主要有：NameNode、DataNode、SecondaryNameNode。

YARN集群负责海量数据运算时的资源调度，集群中的角色主要有：ResourceManager、NodeManager。

2.那 MAPREDUCE 是什么呢？

它其实是一个分布式运算编程框架，是应用程序开发包，由用户按照编程规范进行程序开发，后打包运行在HDFS集群上，并且受到YARN集群的资源调度管理。

3.集群部署方式Hadoop 集群的部署方式分为 3 种，分别是独立模式（Standalone mode）、伪分布式模式（Pseudo-Distributed mode）、完全分布式模式（Cluster mode），具体介绍如下：

（1）独立模式：又称为单机模式，在该模式下，无须运行任何守护进程，所有的程序都在单个 JVM 上执行。独立模式下调试 Hadoop 集群的 MapReduce 程序非常方便，所以一般情况下，该模式在学习或者开发阶段调试使用。

（2）伪分布式模式：Hadoop 程序的守护进程运行在一台主机节点上，通常使用伪分布式模式来调试Hadoop 分布式程序的代码，以及程序执行是否正确，伪分布式模式是完全分布式模式的一个特例。

（3）完全分布式模式：Hadoop 的守护进程分别运行在由多个主机搭建的集群上，不同节点担任不同的角色，在实际工作应用开发中，通常使用该模式构建企业级 Hadoop 系统。在 Hadoop 环境中，所有服务器节点仅划分为两种角色，分别是 master（主节点，1个）和 slave（从节点，多个）。因此，伪分布模式是集群模式的特例，只是将主节点和从节点合二为一罢了。

接下来，以 1台虚拟机（Master）为例，阐述伪分布模式Hadoop 集群的安装与配置方法。

二.准备hadoop

1.将准备好的压缩文件解压：

cd /opt/software  #进入准备好的文件
tar zxvf hadoop-3.1.3.tar.gz -C /opt/moudle/  #将文件解压到指定目录

2.hadoop目录介绍：

bin : Hadoop 最基本的管理脚本和使用脚本的目录，这些脚本是 sbin 目录下管理脚本的基础实现，用户可以直接使用这些脚本管理和使用 Hadoop。

etc : Hadoop配置文件所在的目录include对外提供的编程库头文件（具体动态库和静态库在lib目录中），这些头文件均是用C++ 定义的，通常用于 C++ 程序访问 HDFS 或者编写 MapReduce 程序。

lib : 该目录包含了 Hadoop 对外提供的编程动态库和静态库，与include目录中的头文件结合使用。libexec各个服务对用的 shell 配置文件所在的目录，可用于配置日志输出、启动参数（比如JVM 参数）等基本信息。

sbin : Hadoop 管理脚本所在的目录，主要包含 HDFS 和 YARN 中各类服务的启动/关闭脚本。

share ：Hadoop 各个模块编译后的jar包所在的目录，官方自带示例。

三.编辑 Hadoop 配置文件

1.文件介绍：

Hadoop 默认提供了两种配置文件：一种是只读的默认配置文件，包括core-default.xml、hdfs-default.xml、mapred-default.xml和yarn-default.xml，这些文件包含了 Hadoop 系统各种默认配置参数；另一种是 Hadoop 集群自定义配置时编辑的配置文件（这些文件多数没有任何配置内容，存在于 Hadoop 解压包下的 etc/hadoop/目录中），包括core-site.xml、hdfs-site.xml、mapred-site.xml和yarn-site.xml等，可以根据需要在这些文件中对上一种默认配置文件中的参数进行修改，Hadoop 会优先选择这些配置文件中的参数。

2.配置文件功能描述：

hadoop-env.sh 配置 Hadoop 运行所需的环境变量：

#配置JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212/
#设置用户以执行对应角色shell命令
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

core-site.xmlHadoop 核心全局配置文件，可在其他配置文件中引用该文件：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- 默认文件系统的名称。通过URI中schema区分不同文件系统。-->
    <!-- file:///本地文件系统 hdfs:// hadoop分布式文件系统 gfs://。-->
    <!-- hdfs文件系统访问地址：http://nn_host:8020。-->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:8020</value>
    </property>
	<!-- hadoop本地磁盘存放数据的公共目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/data/hadoop-3.1.3</value>
    </property>
    <!-- 在Web UI访问HDFS使用的用户名。-->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>root</value>
    </property>
    <!-- 配置该root(superUser)允许通过代理访问的主机节点 -->
    <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
    <!-- 配置该root(superUser)允许通过代理用户所属组 -->
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
    <!-- 配置该root(superUser)允许通过代理的用户-->
    <property>
        <name>hadoop.proxyuser.root.users</name>
        <value>*</value>
    </property>
</configuration>

hdfs-site.xmlHDFS 配置文件，继承 core-site.xml 配置文件：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- 设定SNN运行主机和端口。-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>master:9868</value>
    </property>
</configuration>

mapred-site.xmlMapReduce 配置文件，继承 core-site.xml 配置文件：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- mr程序默认运行方式。yarn集群模式 local本地模式-->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <!-- MR App Master环境变量。-->
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <!-- MR MapTask环境变量。-->
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <!-- MR ReduceTask环境变量。-->
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
</configuration>

yarn-site.xmlYARN 配置文件，继承 core-site.xml 配置文件

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
    <!-- yarn集群主角色RM运行机器。-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    <!-- NodeManager上运行的附属服务。需配置成mapreduce_shuffle,才可运行MR程序。-->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- 每个容器请求的最小内存资源（以MB为单位）。-->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
    </property>
    <!-- 每个容器请求的最大内存资源（以MB为单位）。-->
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
    </property>
    <!-- 虚拟内存检查，默认打开，修改为关闭 -->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <!-- 容器最大CPU核数，默认4个，修改为1个 -->
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>1</value>
    </property>
</configuration>

3.将以上文件打包成一个目录hadoop_config上传至服务器，再利用cp命令直接拷贝到Hadoop配置文件目录下：

cd /hadoop_config 
cp core-site.xml hdfs-site.xml hadoop-env.sh mapred-site.xml yarn-site.xml /opt/module/hadoop-3.1.3/etc/hadoop/

4.works该文件记录Hadoop集群所有从节点（HDFS的DataNode和YARN的NodeManager）的主机名，以配合使用脚本一键启动集群的从节点，打开该配置文件，先删除里面的内容（默认localhost）然后进行如下配置:

vim /opt/moudle/hadoop-3.1.3/etc/hadoop/workers

master

5.配置Hadoop环境变量在 master 服务器上配置 Hadoop 环境变量

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

重新加载环境变量，验证是否生效:

source /etc/profile
hadoop version

四. 格式化首次启动HDFS时，必须对其进行格式化操作。

1.本质上是一些清理和准备工作，因为此时的HDFS在物理上还是不存在的。格式化只需要在 master 服务器上执行如下命令即可：

cd /opt/module/hadoop-3.1.3/
hdfs namenode -format #执行格式化动作

2.启停 Hadoop 集群


hdfs start-dfs.sh #一键启动

hdfs stop-dfs.sh #一键停止

3.启停 YARN集群

start-yarn.sh

stop-yarn.sh

4.也可以通过如下脚本实现HDFS集群和YARN集群的启动和停止:

start-all.sh
stop-all.sh