【Hadoop】【Hadoop集群设置】【Hadoop Cluster Setup】

目录

1 Purpose 目的

2 Prerequisites 先决条件

3 Installation 安装

4 Configuring Hadoop in Non-Secure Mode

4 在非安全模式下配置Hadoop

5 Configuring Environment of Hadoop Daemons

5 Hadoop守护进程的配置环境

6 Configuring the Hadoop Daemons

6 配置Hadoop守护进程

7 Monitoring Health of NodeManagers

7 监控节点管理器的运行状况

8 Slaves File 

8 从节点文件

9 Hadoop Rack Awareness 

9 Hadoop机架感知

10 Logging 日志

11 Operating the Hadoop Cluster

11 运行Hadoop集群

Hadoop Startup

 Hadoop启动

Hadoop Shutdown hadoop

12 Web Interfaces 

12 Web界面


1 Purpose 目的

This document describes how to install and configure Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. To play with Hadoop, you may first want to install it on a single machine (see Single Node Setup).
本文档描述了
如何安装和配置Hadoop集群,范围从几个节点到具有数千个节点的超大型集群。要使用Hadoop,您可能首先希望将其安装在单台计算机上(请参阅单节点设置)。

This document does not cover advanced topics such as High Availability.
本文档不涉及高可用性等高级主题。

Important: all production Hadoop clusters use Kerberos to authenticate callers and secure access to HDFS data as well as restriction access to computation services (YARN etc.).
重要提示:所有生产Hadoop集群都使用Hadoop来验证调用者和保护对HDFS数据的访问,以及限制对计算服务(YARN等)的访问。

These instructions do not cover integration with any Kerberos services, -everyone bringing up a production cluster should include connecting to their organisation’s Kerberos infrastructure as a key part of the deployment.
这些说明不包括与任何服务的集成,每个提出生产集群的人都应该包括连接到其组织的服务基础设施作为部署的关键部分。

See Security for details on how to secure a cluster.
有关如何保护群集的详细信息,请参阅安全性。

2 Prerequisites 先决条件

  • Install Java. See the Hadoop Wiki for known good versions.
    安装Java。查看Hadoop Wiki了解已知的良好版本。
  • Download a stable version of Hadoop from Apache mirrors.
    从Apache镜像下载Hadoop的稳定版本。

3 Installation 安装

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions.
安装Hadoop集群通常涉及在集群中的所有计算机上解包软件,或者通过适合您的操作系统的打包系统安装软件。把硬件按功能划分是很重要的。

Typically one machine in the cluster is designated as the NameNode and another machine as the ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastructure, depending upon the load.
通常,集群中的一台机器被指定为NameNode,另一台机器被专门指定为ResourceManager。他们是大师。其他服务(如Web App Proxy Server和MapReduce Job History Server)通常在专用硬件或共享基础设施上运行,具体取决于负载。

The rest of the machines in the cluster act as both DataNode and NodeManager. These are the workers.
集群中的其余机器同时充当DataNode和NodeManager。这些是工人。

4 Configuring Hadoop in Non-Secure Mode

4 在非安全模式下配置Hadoop

Hadoop’s Java configuration is driven by two types of important configuration files:
Hadoop的Java配置由两种类型的重要配置文件驱动:

  • Read-only default configuration - core-default.xmlhdfs-default.xmlyarn-default.xml and mapred-default.xml.
    只读默认配置- core-default.xml 、 hdfs-default.xml 、 yarn-default.xml 和 mapred-default.xml 。

  • Site-specific configuration - etc/hadoop/core-site.xmletc/hadoop/hdfs-site.xmletc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
    站点特定配置- etc/hadoop/core-site.xml 、 etc/hadoop/hdfs-site.xml 、 etc/hadoop/yarn-site.xml 和 etc/hadoop/mapred-site.xml 。

Additionally, you can control the Hadoop scripts found in the bin/ directory of the distribution, by setting site-specific values via the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.
此外,您可以通过 etc/hadoop/hadoop-env.sh 和 etc/hadoop/yarn-env.sh 设置特定于站点的值来控制分布的bin/目录中的Hadoop脚本。

To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons execute as well as the configuration parameters for the Hadoop daemons.
要配置Hadoop集群,您需要配置Hadoop守护进程在其中执行的 environment 以及Hadoop守护进程的 configuration parameters 。

HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN daemons are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.
HDFS守护进程包括NameNode、SecondaryNameNode和DataNode。YARN守护进程包括ResourceManager、NodeManager和WebAppProxy。如果要使用MapReduce,则MapReduce作业历史服务器也将运行。对于大型安装,它们通常运行在不同的主机上。

5 Configuring Environment of Hadoop Daemons

5 Hadoop守护进程的配置环境

Administrators should use the etc/hadoop/hadoop-env.sh and optionally the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts to do site-specific customization of the Hadoop daemons’ process environment.
管理员应该使用 etc/hadoop/hadoop-env.sh 以及可选的 etc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh 脚本来对Hadoop守护进程的进程环境进行特定于站点的自定义。

At the very least, you must specify the JAVA_HOME so that it is correctly defined on each remote node.
至少,您必须指定 JAVA_HOME ,以便在每个远程节点上正确定义它。

Administrators can configure individual daemons using the configuration options shown below in the table:
管理员可以使用下表中所示的配置选项配置各个守护程序:

Daemon  守护进程 Environment Variable  环境变量
NameNode HDFS_NAMENODE_OPTS
DataNode HDFS_DATANODE_OPTS
Secondary NameNode  辅助NameNode HDFS_SECONDARYNAMENODE_OPTS
ResourceManager YARN_RESOURCEMANAGER_OPTS
NodeManager YARN_NODEMANAGER_OPTS  纱线_节点机_OPTS
WebAppProxy YARN_PROXYSERVER_OPTS
Map Reduce Job History Server
Map Reduce作业历史记录服务器
MAPRED_HISTORYSERVER_OPTS
MAPRED_历史服务器_选项

For example, To configure Namenode to use parallelGC and a 4GB Java Heap, the following statement should be added in hadoop-env.sh :
例如,要将Namenode配置为使用parallelGC和4GB Java Heap,应在hadoop-env.sh中添加以下语句:

  export HDFS_NAMENODE_OPTS="-XX:+UseParallelGC -Xmx4g"

See etc/hadoop/hadoop-env.sh for other examples.
参见 etc/hadoop/hadoop-env.sh 其他示例。

Other useful configuration parameters that you can customize include:
您可以自定义的其他有用配置参数包括:

  • HADOOP_PID_DIR - The directory where the daemons’ process id files are stored.
    HADOOP_PID_DIR -守护进程的进程ID文件存储的目录。
  • HADOOP_LOG_DIR - The directory where the daemons’ log files are stored. Log files are automatically created if they don’t exist.
    HADOOP_LOG_DIR -守护进程日志文件存储的目录。如果日志文件不存在,则会自动创建日志文件。
  • HADOOP_HEAPSIZE_MAX - The maximum amount of memory to use for the Java heapsize. Units supported by the JVM are also supported here. If no unit is present, it will be assumed the number is in megabytes. By default, Hadoop will let the JVM determine how much to use. This value can be overriden on a per-daemon basis using the appropriate _OPTS variable listed above. For example, setting HADOOP_HEAPSIZE_MAX=1g and HADOOP_NAMENODE_OPTS="-Xmx5g" will configure the NameNode with 5GB heap.
    HADOOP_HEAPSIZE_MAX -用于Java堆大小的最大内存量。这里也支持JVM支持的单元。如果不存在单位,则假定该数字以兆字节为单位。默认情况下,Hadoop将让JVM决定使用多少。可以使用上面列出的适当的 _OPTS 变量在每个守护程序的基础上覆盖该值。例如,设置 HADOOP_HEAPSIZE_MAX=1g 和 HADOOP_NAMENODE_OPTS="-Xmx5g" 将为NameNode配置5GB堆。

In most cases, you should specify the HADOOP_PID_DIR and HADOOP_LOG_DIR directories such that they can only be written to by the users that are going to run the hadoop daemons. Otherwise there is the potential for a symlink attack.
在大多数情况下,您应该指定 HADOOP_PID_DIR 和 HADOOP_LOG_DIR 目录,以便它们只能由将要运行Hadoop守护进程的用户写入。否则就有可能发生符号链接攻击。

It is also traditional to configure HADOOP_HOME in the system-wide shell environment configuration. For example, a simple script inside /etc/profile.d:
在系统范围的shell环境配置中配置 HADOOP_HOME 也是传统做法。例如,在 /etc/profile.d 中有一个简单的脚本:

  HADOOP_HOME=/path/to/hadoop
  export HADOOP_HOME
  • 10
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值