hadoop3.x教程_Hadoop安装教程（Hadoop 1.x）

最新推荐文章于 2024-07-31 19:22:42 发布

cuma2369

最新推荐文章于 2024-07-31 19:22:42 发布

阅读量516

点赞数

文章标签：大数据 hadoop linux java python

原文链接：https://www.systutorials.com/hadoop-installation-tutorial/

版权

hadoop3.x教程

Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x).

更新：如果您不熟悉Hadoop并尝试安装它。请检查较新的版本： Hadoop安装教程（Hadoop 2.x）。

Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed and implemented by Google for processing and generating large data sets [1]. HDFS is Hadoop’s underlying data persistency layer, which is loosely modelled after Google file system GFS [2]. Hadoop has seen active development activities and increasing adoption. Many cloud computing services, such as Amazon EC2, provide MapReduce functions, and the research community uses MapReduce and Hadoop to solve data-intensive problems in bioinformatics, computational finance, chemistry, and environmental science [3]. Although MapReduce has its limitations [3], it is an important framework to process large data sets.

Hadoop主要由两部分组成：Hadoop MapReduce和HDFS。 Hadoop MapReduce是用于编写应用程序的编程模型和软件框架，它是MapReduce的开源变体，最初由Google设计和实现，用于处理和生成大数据集[1]。 HDFS是Hadoop的基础数据持久性层，它是按照Google文件系统GFS [2]松散建模的。 Hadoop已经看到了活跃的开发活动和越来越多的采用。许多云计算服务，例如Amazon EC2，都提供MapReduce功能，研究团体使用MapReduce和Hadoop解决了生物信息学，计算金融，化学和环境科学中的数据密集型问题[3]。尽管MapReduce有其局限性[3]，但它是处理大型数据集的重要框架。

How to set up a Hadoop environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop cluster, one node runs as the NameNode, one node runs as the JobTracker and many nodes runs as the TaskTracker (slaves).

本教程介绍了如何在集群中设置Hadoop环境。在本教程中，我们设置了一个Hadoop集群，一个节点作为NameNode运行，一个节点作为JobTracker运行，许多节点作为TaskTracker（从属）运行。

First we assume we have created a Linux user “hadoop” on each nodes that we use and the “hadoop” user’s home directory is “/home/hadoop/”.

首先，我们假设我们在使用的每个节点上都创建了一个Linux用户“ hadoop”，并且“ hadoop”用户的主目录是“ / home / hadoop /”。

Just for our convenience, make sure the “hadoop” user from NameNode and JobTracker can ssh to the slaves without password so that we need not to input the password every time.

为了方便起见，请确保NameNode和JobTracker中的“ hadoop”用户可以不使用密码SSH到从站，这样我们就不必每次都输入密码。

Details about password-less SSH login can be found Enabling Password-less ssh Login.

有关无密码SSH登录的详细信息，请参见启用无密码ssh登录。

安装Hadoop所需的软件∞ (Install softwared needed by Hadoop ∞)

Java JDK ： ∞ (Java JDK: ∞)

Java JDK can be downloaded form: http://java.sun.com/. Then we can install (actually copy the jdk directory) Java JDK on all nodes of the Hadoop cluster.

Java JDK可以从http://java.sun.com/下载。然后，我们可以在Hadoop集群的所有节点上安装（实际上是复制jdk目录）Java JDK。

As an example in this tutorial, the JDK is installed into

作为本教程的示例，JDK已安装到

/home/hadoop/jdk1.6.0_24

I provide a simple bash script to duplicate the JDK directory to all nodes:

我提供了一个简单的bash脚本，将JDK目录复制到所有节点：

$ for i in `cat nodes`; do scp -rq /home/hadoop/jdk1.6.0_24 hadoop@$i:/home/hadoop/; done;

‘nodes’ is a file that contains all the nodes IPs or host names, one in one line.

“节点”是一个文件，其中包含所有节点IP或主机名，一行一行。

Hadoop的∞ (Hadoop ∞)

Hadoop softwar can be downloaded from here. In this tutorial, we use Hadoop 0.20.203.0.

可以从此处下载Hadoop软件战。在本教程中，我们使用Hadoop 0.20.203.0。

Then we can install Hadoop on all nodes of the Hadoop cluster.

然后，我们可以在Hadoop集群的所有节点上安装Hadoop。

We can directly unpack it to a directory. In this example, we store it in

我们可以直接将其解压缩到目录中。在此示例中，我们将其存储在

/home/hadoop/hadoop/

which is a directory under the hadoop Linux user’s home directory.

这是hadoop Linux用户主目录下的目录。

The hadoop directory can also be duplicated to all nodes using the script above.

hadoop目录也可以使用上面的脚本复制到所有节点。

配置“ hadoop”用户的环境变量∞ (Configure environment variables of “hadoop” user ∞)

We assume the “hadoop” user use bash as its shell.

我们假设“ hadoop”用户使用bash作为其外壳。

Add these two lines at the bottom of ~/.bashrc on all nodes:

在所有节点的〜/ .bashrc底部添加以下两行：

export HADOOP_COMMON_HOME="/home/hadoop/hadoop/"
export PATH=$HADOOP_COMMON_HOME/bin/:$PATH

The HADOOP_COMMON_HOME environment variable is used by Hadoop’s utility scripts, and it must be set, otherwise the scripts may report an error message “Hadoop common not found”.

Hadoop的实用程序脚本使用HADOOP_COMMON_HOME环境变量，并且必须设置该变量，否则脚本可能会报告错误消息“找不到Hadoop common”。

The second line adds hadoop’s bin directory to the PATH sothat we can directly run hadoop’s commands without specifying the full path to it.

第二行将hadoop的bin目录添加到PATH，以便我们无需指定完整路径即可直接运行hadoop的命令。

配置Hadoop的∞ (Configure Hadoop ∞)

CONF / Hadoop的env.h ∞ (conf/hadoop-env.h ∞)

Add or change these lines to specify the JAVA_HOME and directory to store the logs:

添加或更改这些行以指定JAVA_HOME和目录来存储日志：

export JAVA_HOME=/home/hadoop/jdk1.6.0_24
export HADOOP_LOG_DIR=/home/hadoop/data/logs

CONF /芯-site.xml中∞ (conf/core-site.xml ∞)

Here the NameNode runs on 10.1.1.30.

这里的NameNode在10.1.1.30上运行。

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://10.1.1.30:9000</value>
</property>
</configuration>

CONF / HDFS-site.xml中∞ (conf/hdfs-site.xml ∞)

<configuration>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>/lhome/hadoop/data/dfs/name/</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/lhome/hadoop/data/dfs/data/</value>
</property>

</configuration>

dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

dfs.replication是每个块的副本数。 dfs.name.dir是本地文件系统上NameNode存储名称空间和事务日志的路径。 dfs.data.dir是存储节点的DataNode本地文件系统上路径的逗号分隔列表。

CONF / mapred-site.xml中∞ (conf/mapred-site.xml ∞)

Here the JobTracker runs on 10.1.1.2.

这里的JobTracker在10.1.1.2上运行。

<configuration>

<property>
<name>mapred.job.tracker</name>
<value>10.1.1.2:9001</value>
</property>

<property>
<name>mapred.system.dir</name>
<value>/hadoop/data/mapred/system/</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/lhome/hadoop/data/mapred/local/</value>
</property>

</configuration>

mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

mapreduce.jobtracker.address是JobTracker的主机或IP和端口。 mapreduce.jobtracker.system.dir是HDFS上Map / Reduce框架存储系统文件的路径。 mapreduce.cluster.local.dir是用逗号分隔的本地文件系统上写入临时MapReduce数据的路径列表。

conf / slaves∞ (conf/slaves ∞)

Delete localhost and add all the names of the TaskTrackers, each in on line. For example:

删除localhost并添加TaskTrackers的所有名称，每个名称都在线。例如：

jobtrackname1
jobtrackname2
jobtrackname3
jobtrackname4
jobtrackname5
jobtrackname6

将Hadoop配置文件复制到所有节点∞ (Duplicate Hadoop configuration files to all nodes ∞)

We may duplicate the configuration files under conf diretory to all nodes. The script mentioned above can be used.

我们可以将配置文件下的配置文件复制到所有节点。可以使用上面提到的脚本。

By now, we have finished copying Hadoop softwares and configuring the Hadoop. Now let’s have some fun with Hadoop.

至此，我们已经完成了复制Hadoop软件和配置Hadoop的工作。现在，让我们玩一下Hadoop。

开始的Hadoop ∞ (Start Hadoop ∞)

We need to start both the HDFS and MapReduce to start Hadoop.

我们需要同时启动HDFS和MapReduce以启动Hadoop。

格式的新HDFS ∞ (Format a new HDFS ∞)

On NameNode (10.1.1.30):

在NameNode（10.1.1.30）上：

$&nbsp;hadoop namenode -format

Remember to delete HDFS’s local files on all nodes before re-formating it:

在重新格式化之前，请记住在所有节点上删除HDFS的本地文件：

$ rm /home/hadoop/data /tmp/hadoop-hadoop -rf

启动HDFS ∞ (Start HDFS ∞)

On NameNode (10.1.1.30):

在NameNode（10.1.1.30）上：

$ start-dfs.sh

检查HDFS状态： ∞ (Check the HDFS status: ∞)

On NameNode (10.1.1.30):

在NameNode（10.1.1.30）上：

$ hadoop dfsadmin -report

There may be less nodes listed in the report than we actually have. We can try it again.

报告中列出的节点可能少于我们实际拥有的节点。我们可以再试一次。

开始映射： ∞ (Start mapred: ∞)

On JobTracker (10.1.1.2):

在JobTracker（10.1.1.2）上：

$ start-mapred.sh

检查作业状态： ∞ (Check job status: ∞)

$ hadoop job -list

运行Hadoop作业∞ (Run Hadoop jobs ∞)

一个简单的例子∞ (A simple example ∞)

We run a simple example built in Hadoop’s distribution. For easy-to-run and more larger tests, please consider the A Simple Sort Benchmark on Hadoop.

我们运行一个构建在Hadoop发行版中的简单示例。对于易于运行且规模更大的测试，请考虑Hadoop上的A Simple Sort Benchmark 。

Copy the input files into the distributed filesystem:

将输入文件复制到分布式文件系统中：

$ hadoop fs -put /home/hadoop/hadoop/conf input

Run some of the examples:

运行一些示例：

$ hadoop jar /home/hadoop/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

Examine the output files:

检查输出文件：

Copy the output files from the distributed filesystem to the local
filesytem and examine them:

将输出文件从分布式文件系统复制到本地
文件系统并对其进行检查：

$ hadoop fs -get output output
$ cat output/*

要么

View the output files on the distributed filesystem:

查看分布式文件系统上的输出文件：

$ hadoop fs -cat output/*

关闭Hadoop集群∞ (Shut down Hadoop cluster ∞)

We can stop Hadoop when we no long use it.

我们可以在不再使用Hadoop时停止它。

Stop HDFS on NameNode (10.1.1.30):

在NameNode（10.1.1.30）上停止HDFS：

$ stop-dfs.sh

Stop JobTracker and TaskTrackers on JobTracker (10.1.1.2):

在JobTracker（10.1.1.2）上停止JobTracker和TaskTracker：

$ stop-mapred.sh

一些可能的问题∞ (Some possible problems ∞)

防火墙阻止连接∞ (Firewall blocks connections ∞)

Configure iptables: We can configure iptables to allow all connections, if these nodes are in a secure local area network which is most of the situation, by this command on all nodes:

配置iptables ：如果这些节点位于大多数情况下的安全局域网中，则可以配置iptables以允许所有连接，通过此命令在所有节点上：

# iptables -F
# service iptables save

For a list of the default ports used by Hadoop, please refer to: Hadoop Default Ports.

有关Hadoop使用的默认端口的列表，请参阅： Hadoop默认端口。

陷阱和教训∞ (Pitfalls and Lessons ∞)

Please also check Pitfalls and Lessons on Configuing and Tuning Hadoop.

另请参阅有关配置和调整Hadoop的陷阱和经验教训。

参考∞ (References ∞)

[1] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters.” in the 6th Conference on Symposium on Operating Systems Design & Implementation, vol. 6, San Francisco, CA, 2004, pp. 137–150.
[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google filesystem,” in Proc. of the 9th ACM Symposium on Operating Systems Principles (SOSP’03), 2003, pp. 29–43.
[3] Z. Ma and L. Gu. The limitation of MapReduce: A probing case and a lightweight solution. In CLOUD COMPUTING 2010: Proc. of the 1st Intl. Conf. on Cloud Computing, GRIDs, and Virtualization, pages 68–73, 2010.

[1] J. Dean和S. Ghemawat，“ MapReduce：简化大型集群上的数据处理。” 在第六届操作系统设计与实现专题研讨会上，第一卷。 6，加利福尼亚州，旧金山，2004年，第137-150页。
[2] S. Ghemawat，H。Gobioff和S.-T。 Leung，Proc中的“ Google文件系统”。第9届ACM操作系统原理研讨会（SOSP'03），2003年，第29-43页。
[3] Z. Ma和L. Gu。 MapReduce的局限性：一个探测案例和一个轻量级的解决方案。在2010年云计算中：Proc。第一国际机场 Conf。关于云计算，GRID和虚拟化的信息，第68-73页，2010年。

其他Hadoop教程∞ (Other Hadoop tutorials ∞)

Cluster Setup from Apache.
Managing a Hadoop Cluster from Yahoo.

来自Apache的集群设置。
通过Yahoo 管理Hadoop集群。

附加内容∞ (Additional content ∞)

Some additional content for this post.

这篇文章的一些其他内容。

Hadoop配置文件示例∞ (An example of Hadoop configuration files ∞)

Added on Dec. 20, 2012.

添加于2012年12月20日。

An example of Hadoop 1.0.3 configuration files.

Hadoop 1.0.3配置文件的示例。

Shown here as changes to the default conf directory.

在此处显示为对默认conf目录的更改。

diff -rupN conf/core-site.xml /lhome/hadoop/hadoop-1.0.3/conf/core-site.xml
--- conf/core-site.xml  2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/core-site.xml   2012-07-26 15:45:41.372840027 +0800
@@ -4,5 +4,8 @@
 <!-- Put site-specific property overrides in this file. -->

 <configuration>
-
+<property>
+<name>fs.default.name</name>
+<value>hdfs://hadoop0:9000</value>
+</property>
 </configuration>
diff -rupN conf/hadoop-env.sh /lhome/hadoop/hadoop-1.0.3/conf/hadoop-env.sh
--- conf/hadoop-env.sh  2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/hadoop-env.sh   2012-07-26 15:49:41.025839796 +0800
@@ -6,7 +6,7 @@
 # remote nodes.

 # The java implementation to use.  Required.
-# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
+export JAVA_HOME=/usr/java/jdk1.6.0_24/

 # Extra Java CLASSPATH elements.  Optional.
 # export HADOOP_CLASSPATH=
@@ -32,6 +32,7 @@ export HADOOP_JOBTRACKER_OPTS="-Dcom.sun

 # Where log files are stored.  $HADOOP_HOME/logs by default.
 # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
+export HADOOP_LOG_DIR=/lhome/hadoop/data/logs

 # File naming remote slave hosts.  $HADOOP_HOME/conf/slaves by default.
 # export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
diff -rupN conf/hdfs-site.xml /lhome/hadoop/hadoop-1.0.3/conf/hdfs-site.xml
--- conf/hdfs-site.xml  2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/hdfs-site.xml   2012-07-26 15:46:06.185839356 +0800
@@ -4,5 +4,18 @@
 <!-- Put site-specific property overrides in this file. -->

 <configuration>
+<property>
+<name>dfs.replication</name>
+<value>3</value>
+</property>

+<property>
+<name>dfs.name.dir</name>
+<value>/lhome/hadoop/data/dfs/name/</value>
+</property>
+
+<property>
+<name>dfs.data.dir</name>
+<value>/lhome/hadoop/data/dfs/data/</value>
+</property>
 </configuration>
diff -rupN conf/mapred-site.xml /lhome/hadoop/hadoop-1.0.3/conf/mapred-site.xml
--- conf/mapred-site.xml    2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/mapred-site.xml 2012-07-26 15:47:39.586907398 +0800
@@ -5,4 +5,24 @@

 <configuration>

+<property>
+<name>mapred.job.tracker</name>
+<value>hadoop0:9001</value>
+</property>
+
+<property>
+<name>mapred.tasktracker.reduce.tasks.maximum</name>
+<value>1</value>
+</property>
+
+<property>
+<name>mapred.tasktracker.map.tasks.maximum</name>
+<value>1</value>
+</property>
+
+<property>
+<name>mapred.local.dir</name>
+<value>/lhome/hadoop/data/mapred/local/</value>
+</property>
+
 </configuration>
diff -rupN conf/slaves /lhome/hadoop/hadoop-1.0.3/conf/slaves
--- conf/slaves 2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/slaves  2012-07-26 15:48:54.811839973 +0800
@@ -1 +1,2 @@
-localhost
+hadoop1
+hadoop2