Hadoop:单节点设置

Hadoop: Setting up a Single Node Cluster.

设置单节点集群

Purpose

目的

  This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

  本文描述如何设置和配置单节点Hadoop安装,以便您可以使用Hadoop MapReduce和Hadoop分布式文件系统(HDFS)快速执行简单操作。

Prerequisites

预备条件

Supported Platforms

  • GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
  • Windows is also a supported platform but the followings steps are for Linux only. To set up Hadoop on Windows, see wiki page.

支持的平台
  GNU/Linux作为开发和生产平台。Hadoop已经在拥有2000个节点的GNU/Linux集群上进行了演示。
  Windows也是一个被支持的平台,但是下面的步骤只适用于Linux。要在Windows上设置Hadoop,请参见wiki页面。

Required Software
Required software for Linux include:

  1. Java™ must be installed. Recommended Java versions are described at HadoopJavaVersions.
  2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. Additionally, it is recommmended that pdsh also be installed for better ssh resource management.

所需的软件
Linux所需软件包括:

  1. 必须安装Java,推荐的Java版本在HadoopJavaVersions上进行了描述。
  2. 如果要使用可选的启动和停止脚本,则必须安装ssh并使用管理远程Hadoop进程的Hadoop脚本运行sshd。另外,为了更好地管理ssh资源,还建议安装pdsh。

Installing Software
If your cluster doesn’t have the requisite software you will need to install it.
For example on Ubuntu Linux:

安装软件
如果您的集群没有以上必需的软件,您需要安装它。
例如在Ubuntu Linux上:

  $ sudo apt-get install ssh
  $ sudo apt-get install pdsh

Download

To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors.
要获得Hadoop发行版,请从Apache下载镜像下载最近的稳定版本。

Prepare to Start the Hadoop Cluster

准备启动Hadoop集群

以下操作请务必在hadoop安装目录下进行,因为文档提供的路径参数都是基于hadoop安装目录的相对路径!

  Unpack the downloaded Hadoop distribution. In the distribution, edit the file etc/hadoop/hadoop-env.sh to define some parameters as follows:

  解压下载的Hadoop发行版。在发行版中,编辑文件etc/hadoop/hadoop-env.sh定义一些参数如下:

  # set to the root of your Java installation
  export JAVA_HOME=/usr/java/latest

Try the following command:

尝试以下命令:

  $ bin/hadoop

This will display the usage documentation for the hadoop script.

这将显示hadoop脚本的使用文档。

Now you are ready to start your Hadoop cluster in one of the three supported modes:

  • Local (Standalone) Mode
  • Pseudo-Distributed Mode
  • Fully-Distributed Mode

现在,您已经准备好可以以下三种模式之一启动Hadoop集群:

  • 本地(单机)模式
  • 伪分布模式
  • 全分布模式

对三种模式有疑问?见另一篇博客:https://blog.csdn.net/qiulinsama/article/details/86216394

Standalone Operation

单机操作

  By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
  The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

  默认情况下,Hadoop被配置为作为单个Java进程以非分布式模式运行。这对于调试很有用。
  下面的示例复制未打包的conf目录作为输入,然后查找并显示给定正则表达式的每个匹配项。输出被写入到给定的输出目录。

  $ mkdir input
  $ cp etc/hadoop/*.xml input
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep input output 'dfs[a-z.]+'
  $ cat output/*

Pseudo-Distributed Operation

伪分布操作

  Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

  Hadoop还可以以伪分布式模式在单节点上运行,其中每个Hadoop进程运行在单独的Java进程中。

Configuration

配置
Use the following:

etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
Setup passphraseless ssh

设置免密

Now check that you can ssh to the localhost without a passphrase:

现在检查您是否可以不需要密码ssh到本地主机:

  $ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

如果没有密码不能ssh到localhost,执行以下命令:

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys
Execution

执行

  The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node.

下面的指令是在本地运行MapReduce作业。如果希望在YARN上执行作业,请参见单个节点上的YARN。

  1. Format the filesystem:
    格式化文件系统
  $ bin/hdfs namenode -format
  1. Start NameNode daemon and DataNode daemon:
    启动NameNode进程和DataNode进程:
  $ sbin/start-dfs.sh

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $ HADOOP_HOME/logs).

hadoop进程日志输出写到$ HADOOP_LOG_DIR 目录(默认为$HADOOP_HOME/logs)。

有人启动的时候可能会报错:

root@debdutta-Lenovo-G50-80:~# $HADOOP_PREFIX/sbin/start-dfs.sh
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Starting namenodes on [localhost]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. 
Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. 
Aborting operation.
Starting secondary namenodes [debdutta-Lenovo-G50-80]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.

以下参数未定义,将以下语句添加到etc/hadoop/hadoop-env.sh最后。

export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
  1. Browse the web interface for the NameNode; by default it is available at:
    浏览NameNode的web界面;默认情况下为:
NameNode - http://localhost:9870/
  1. Make the HDFS directories required to execute MapReduce jobs:
    创建执行MapReduce任务所需的HDFS目录
  $ bin/hdfs dfs -mkdir /user
  $ bin/hdfs dfs -mkdir /user/<username>
  1. Copy the input files into the distributed filesystem:
    将输入文件复制到分布式文件系统:
  $ bin/hdfs dfs -mkdir /input
  $ bin/hdfs dfs -put etc/hadoop/*.xml /input
  1. Run some of the examples provided:
    运行一些提供的例子:
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep input output 'dfs[a-z.]+'
  1. Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:
    检查输出文件:将输出文件从分布式文件系统复制到本地文件系统,并检查:
  $ bin/hdfs dfs -get /output/* output
  $ cat output/*

or

View the output files on the distributed filesystem:
或者查看分布式文件系统上的输出文件:

  $ bin/hdfs dfs -cat /output/*
  1. When you’re done, stop the daemons with:
    完成后,用以下命令停止进程:
  $ sbin/stop-dfs.sh

YARN on a Single Node

单节点上的YARN

  You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.
  The following instructions assume that 1. ~ 4. steps of the above instructions are already executed.

  通过设置一些参数并运行ResourceManager进程和NodeManager进程,您可以在伪分布式模式下在YARN上运行MapReduce任务。
  下面的指令假定上述1~4指令步骤已经执行。

  1. Configure parameters as follows:
    配置参数如下:

etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
<!--HADOOP_MAPRED_HOME环境变量没有配,先不配置这个属性 -->
<configuration>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
<!--容器可能覆盖的环境变量,而不是使用NodeManager的默认值。这里先不配这个属性。-->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>
  1. Start ResourceManager daemon and NodeManager daemon:
    启动ResourceManager进程和NodeManager进程
  $ sbin/start-yarn.sh
  1. Browse the web interface for the ResourceManager; by default it is available at:
    浏览ResourceManager的web界面;默认情况下为:
ResourceManager - http://localhost:8088/
  1. Run a MapReduce job.
    运行MapReduce作业。
  2. When you’re done, stop the daemons with:
    完成后,用以下命令停止进程:
  $ sbin/stop-yarn.sh

Fully-Distributed Operation

全分布操作

  For information on setting up fully-distributed, non-trivial clusters see Cluster Setup.

  有关设置全分布式、完整的集群的信息,请参见集群设置。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值