【Hadoop】【Hadoop集群设置】【Hadoop Cluster Setup】

资源存储库

已于 2024-02-17 21:09:45 修改

阅读量604

点赞数 10

文章标签： hadoop 大数据分布式

于 2024-02-17 21:05:10 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/136142117

版权

本文档详述了从小型到大规模的Hadoop集群安装和配置步骤，涵盖非安全模式下的配置、守护进程环境和日志，以及启动和监控集群的方法。强调生产环境中使用Kerberos确保安全性和权限管理的重要性。

摘要由CSDN通过智能技术生成

1 Purpose 目的

2 Prerequisites 先决条件

3 Installation 安装

4 Configuring Hadoop in Non-Secure Mode

4 在非安全模式下配置Hadoop

5 Configuring Environment of Hadoop Daemons

5 Hadoop守护进程的配置环境

6 Configuring the Hadoop Daemons

6 配置Hadoop守护进程

7 Monitoring Health of NodeManagers

7 监控节点管理器的运行状况

8 Slaves File

8 从节点文件

9 Hadoop Rack Awareness

9 Hadoop机架感知

10 Logging 日志

11 Operating the Hadoop Cluster

11 运行Hadoop集群

Hadoop Startup

Hadoop启动

Hadoop Shutdown hadoop

12 Web Interfaces

12 Web界面

1 Purpose 目的

This document describes how to install and configure Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. To play with Hadoop, you may first want to install it on a single machine (see Single Node Setup).
本文档描述了如何安装和配置Hadoop集群，范围从几个节点到具有数千个节点的超大型集群。要使用Hadoop，您可能首先希望将其安装在单台计算机上（请参阅单节点设置）。

This document does not cover advanced topics such as High Availability.
本文档不涉及高可用性等高级主题。

Important: all production Hadoop clusters use Kerberos to authenticate callers and secure access to HDFS data as well as restriction access to computation services (YARN etc.).
重要提示：所有生产Hadoop集群都使用Hadoop来验证调用者和保护对HDFS数据的访问，以及限制对计算服务（YARN等）的访问。

These instructions do not cover integration with any Kerberos services, -everyone bringing up a production cluster should include connecting to their organisation’s Kerberos infrastructure as a key part of the deployment.
这些说明不包括与任何服务的集成，每个提出生产集群的人都应该包括连接到其组织的服务基础设施作为部署的关键部分。

See Security for details on how to secure a cluster.
有关如何保护群集的详细信息，请参阅安全性。

2 Prerequisites 先决条件

Install Java. See the Hadoop Wiki for known good versions.
安装Java。查看Hadoop Wiki了解已知的良好版本。
Download a stable version of Hadoop from Apache mirrors.
从Apache镜像下载Hadoop的稳定版本。

3 Installation 安装

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions.
安装Hadoop集群通常涉及在集群中的所有计算机上解包软件，或者通过适合您的操作系统的打包系统安装软件。把硬件按功能划分是很重要的。

Typically one machine in the cluster is designated as the NameNode and another machine as the ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastructure, depending upon the load.
通常，集群中的一台机器被指定为NameNode，另一台机器被专门指定为ResourceManager。他们是大师。其他服务（如Web App Proxy Server和MapReduce Job History Server）通常在专用硬件或共享基础设施上运行，具体取决于负载。

The rest of the machines in the cluster act as both DataNode and NodeManager. These are the workers.
集群中的其余机器同时充当DataNode和NodeManager。这些是工人。

4 Configuring Hadoop in Non-Secure Mode

4 在非安全模式下配置Hadoop

Hadoop’s Java configuration is driven by two types of important configuration files:
Hadoop的Java配置由两种类型的重要配置文件驱动：

Read-only default configuration - core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
只读默认配置- core-default.xml 、 hdfs-default.xml 、 yarn-default.xml 和 mapred-default.xml 。
Site-specific configuration - etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
站点特定配置- etc/hadoop/core-site.xml 、 etc/hadoop/hdfs-site.xml 、 etc/hadoop/yarn-site.xml 和 etc/hadoop/mapred-site.xml 。

Additionally, you can control the Hadoop scripts found in the bin/ directory of the distribution, by setting site-specific values via the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.
此外，您可以通过 etc/hadoop/hadoop-env.sh 和 etc/hadoop/yarn-env.sh 设置特定于站点的值来控制分布的bin/目录中的Hadoop脚本。

To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons execute as well as the configuration parameters for the Hadoop daemons.
要配置Hadoop集群，您需要配置Hadoop守护进程在其中执行的 environment 以及Hadoop守护进程的 configuration parameters 。

HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN daemons are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.
HDFS守护进程包括NameNode、SecondaryNameNode和DataNode。YARN守护进程包括ResourceManager、NodeManager和WebAppProxy。如果要使用MapReduce，则MapReduce作业历史服务器也将运行。对于大型安装，它们通常运行在不同的主机上。

5 Configuring Environment of Hadoop Daemons

5 Hadoop守护进程的配置环境

Administrators should use the etc/hadoop/hadoop-env.sh and optionally the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts to do site-specific customization of the Hadoop daemons’ process environment.
管理员应该使用 etc/hadoop/hadoop-env.sh 以及可选的 etc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh 脚本来对Hadoop守护进程的进程环境进行特定于站点的自定义。

At the very least, you must specify the JAVA_HOME so that it is correctly defined on each remote node.
至少，您必须指定 JAVA_HOME ，以便在每个远程节点上正确定义它。

Administrators can configure individual daemons using the configuration options shown below in the table:
管理员可以使用下表中所示的配置选项配置各个守护程序：

Daemon 守护进程	Environment Variable 环境变量
NameNode	HDFS_NAMENODE_OPTS
DataNode	HDFS_DATANODE_OPTS
Secondary NameNode 辅助NameNode	HDFS_SECONDARYNAMENODE_OPTS
ResourceManager	YARN_RESOURCEMANAGER_OPTS
NodeManager	YARN_NODEMANAGER_OPTS 纱线_节点机_OPTS
WebAppProxy	YARN_PROXYSERVER_OPTS
Map Reduce Job History Server Map Reduce作业历史记录服务器	MAPRED_HISTORYSERVER_OPTS MAPRED_历史服务器_选项

For example, To configure Namenode to use parallelGC and a 4GB Java Heap, the following statement should be added in hadoop-env.sh :
例如，要将Namenode配置为使用parallelGC和4GB Java Heap，应在hadoop-env.sh中添加以下语句：

  export HDFS_NAMENODE_OPTS="-XX:+UseParallelGC -Xmx4g"

See etc/hadoop/hadoop-env.sh for other examples.
参见 etc/hadoop/hadoop-env.sh 其他示例。

Other useful configuration parameters that you can customize include:
您可以自定义的其他有用配置参数包括：

HADOOP_PID_DIR - The directory where the daemons’ process id files are stored.
HADOOP_PID_DIR -守护进程的进程ID文件存储的目录。
HADOOP_LOG_DIR - The directory where the daemons’ log files are stored. Log files are automatically created if they don’t exist.
HADOOP_LOG_DIR -守护进程日志文件存储的目录。如果日志文件不存在，则会自动创建日志文件。
HADOOP_HEAPSIZE_MAX - The maximum amount of memory to use for the Java heapsize. Units supported by the JVM are also supported here. If no unit is present, it will be assumed the number is in megabytes. By default, Hadoop will let the JVM determine how much to use. This value can be overriden on a per-daemon basis using the appropriate _OPTS variable listed above. For example, setting HADOOP_HEAPSIZE_MAX=1g and HADOOP_NAMENODE_OPTS="-Xmx5g" will configure the NameNode with 5GB heap.
HADOOP_HEAPSIZE_MAX -用于Java堆大小的最大内存量。这里也支持JVM支持的单元。如果不存在单位，则假定该数字以兆字节为单位。默认情况下，Hadoop将让JVM决定使用多少。可以使用上面列出的适当的 _OPTS 变量在每个守护程序的基础上覆盖该值。例如，设置 HADOOP_HEAPSIZE_MAX=1g 和 HADOOP_NAMENODE_OPTS="-Xmx5g" 将为NameNode配置5GB堆。

In most cases, you should specify the HADOOP_PID_DIR and HADOOP_LOG_DIR directories such that they can only be written to by the users that are going to run the hadoop daemons. Otherwise there is the potential for a symlink attack.
在大多数情况下，您应该指定 HADOOP_PID_DIR 和 HADOOP_LOG_DIR 目录，以便它们只能由将要运行Hadoop守护进程的用户写入。否则就有可能发生符号链接攻击。

It is also traditional to configure HADOOP_HOME in the system-wide shell environment configuration. For example, a simple script inside /etc/profile.d:
在系统范围的shell环境配置中配置 HADOOP_HOME 也是传统做法。例如，在 /etc/profile.d 中有一个简单的脚本：

  HADOOP_HOME=/path/to/hadoop
  export HADOOP_HOME

最低0.47元/天解锁文章

资源存储库

关注

10
点赞
踩
27

收藏

觉得还不错? 一键收藏
打赏
0
评论
【Hadoop】【Hadoop集群设置】【Hadoop Cluster Setup】

NodeManager能够定期检查本地磁盘的健康状况（特别是检查nodemanager-local-dirs和nodemanager-log-dirs），并且在达到基于配置属性yarn.nodemanager.disk-health-checker. min-health-disks设置的坏目录数量阈值后，整个节点被标记为不健康，并且此信息也被发送到资源管理器。但是，NodeManager会继续运行该脚本，因此，如果节点再次变得健康，它将自动从ResourceManager上的黑名单节点中删除。
复制链接

扫一扫