精通HADOOP（四） - 初识Hadoop - 安装Hadoop

最新推荐文章于 2023-06-19 19:51:30 发布

罗伯特北京

最新推荐文章于 2023-06-19 19:51:30 发布

阅读量1.1w

点赞数

分类专栏：云计算 - 精通Hadoop（翻译）文章标签： hadoop java path installer jdk windows

云计算 - 精通Hadoop（翻译）专栏收录该内容

12 篇文章 0 订阅

订阅专栏

1.1 安装Hadoop

和其他的软件一样，使用Hadoop需要一些先决条件。如果你安装了Cywin，在Windows上执行和开发Hadoop应用程序也是可能。但是，我们强烈建议你使用Linux作为运行Hadoop的产品平台。

请注意，你需要有Linux和Java的基础知识才能使用Hadoop。我们使用Bash脚本来启动这本书的样例程序。

1.1.1 安装的前提条件

我们需要在下列的环境下运行这本书的样例程序，

Fedora 8
Sun Java 6
Hadoop 0.19.0 or 更新版本

早于0.18.2 的Hadoop版本并不是通用的，我们不能在这些版本上编译本书的样例程序。早于1.6版本的Java并不支持所有Hadoop内核所需要的语言特征。除此之外，Hadoop核心似乎在Sun JDK上会表现出更好的性能。我们看到经常会有其他生产商的JDK用户要求提供帮助。这本书后续章节中的样例程序是基于Hadoop 0.17.0，这需要JDK1.6。

Hadoop能够运行在任何现代的Linux操作系统上。我更喜欢Red Hat, Fedora和CentOS上使用的红帽包管理系统(RPM)，于是，本书样例代码就借鉴了基于RPM的安装过程。

一个具有大批用户量的Fedora项目提供了torrents（从BitTorrent下载）去下载Fedora的各个版本(http://torrent.fedoraproject.org/)。如果你想跳过更新过程，Fedora联盟提供了一个具有更新的合一版本。你能从http://spins.fedoraunity.org/spins网址下载它。这就是所谓的re-spins。他们并不提供更早的版本的发布包。这些re-spins需要客户化下载工具Jigdo才能下载。

如果你是Linux入门用户，而且你想要下载试用，Live CD和具有持久存储的USB Stick能够帮助你启动一个简单而快速的测试环境。对于富有经验的客户，他们可以在http://www.vmware.com/appliances/directory/cat/45?sort=changed下载VMware Linux安装镜像。

1.1.1.1 在Linux下安装Hadoop

在你安装了Linux操作系统以后，我们必须决定在哪里安装JDK，因为我们需要JDK的安装路径来设置JAVA_HOME和PATH环境变量。

你可以使用具有一定选项的RPM命令获得RPM包包含文件的信息。这些命令是，-q用于查询文件， -l用于列出所有文件信息，-p用于指定你正在查询包的路径。然后，使用egrep查找字符串’/bin/javac$’，这个egrep命令用来在前面命令的输出中查找一个简单的正则表达式。

cloud9: ~/Downloads$ rpm -q -l -p ~/Downloads/jdk-6u7-linux-i586.rpm | egrep '/bin/javac$'

在我的机器中，输出是，

/usr/java/jdk1.6.0_07/bin/javac

请注意，在字符串/bin/javac$上的单引号是必不可少的。如果你不使用单引号，或者使用了双引号，Shell解释器就会把$解释做为一个环境变量。

我们假设我们在~/Downloads目录下执行JDK安装程序，安装程序在当前的工作目录解压绑定的RPM文件。

输出表明JDK被安装在/usr/java/jdk1.6.0_07，Java可执行程序在/usr/java/jdk1.6.0_07/bin下。

在你的.bashrc或者.bash_profile里面添加下面的两行 :

export JAVA_HOME=/usr/java/jdk1.6.0_07

export PATH=${JAVA_HOME}/bin:${PATH}

列表1-1是update_env.sh脚本，这个脚本能够为你配置Hadoop(你能够从这本书所附带的代码中找到这个脚本)。在执行这个脚本之前，请下载JDK的RPM安装包。

列表1-1 update_env.sh脚本

#! /bin/sh
# This script attempts to work out the installation directory of the jdk,
# given the installer file.
# The script assumes that the installer is an rpm based installer and
# that the name of the downloaded installer ends in
# -rpm-bin
#
# The script first attempts to verify there is one argument and the
# argument is an existing file
# The file may be either the installer binary, the -rpm.bin
# or the actual installation rpm that was unpacked by the installer
#
# The script will use the rpm command to work out the
# installation package name from the rpm file, and then
# use the rpm command to query the installation database,
# for where the files of the rpm were installed.
# This query of the installation is done rather than
# directly querying the rpm, on the off
# chance that the installation was installed in a different root
# directory than the default.
# Finally, the proper environment set commands are appended
# to the user's .bashrc and .bash_profile file, if they exist, and
# echoed to the standard out so the user may apply them to
# their currently running shell sessions.
# Verify that there was a single command line argument
# which will be referenced as $1

if [ $# != 1 ]; then
echo "No jdk rpm specified"
echo "Usage: $0 jdk.rpm" 1>&2
exit 1
fi

# Verify that the command argument exists in the file system
if [ ! -e $1 ]; then
echo "the argument specified ($1) for the jdk rpm does not exist" 1>&2
exit 1
fi

# Does the argument end in '-rpm.bin' which is the suggested install
# file, is the argument the actual .rpm file, or something else
# set the variable RPM to the expected location of the rpm file that
# was extracted from the installer file
if echo $1 | grep -q -e '-rpm.bin'; then
RPM=`dirname $1`/`basename $1 -rpm.bin`.rpm
elif echo $1 | grep -q -e '.rpm'; then
RPM=$1
else
echo -n "$1 does not appear to be the downloaded rpm.bin file or" 1>&2
echo " the extracted rpm file" 1>&2
exit 1
fi
# Verify that the rpm file exists and is readable
if [ ! -r $RPM ]; then
echo -n "The jdk rpm file (${RPM}) does not appear to exist" 1>&2
echo -n " have you run "sh ${RPM}" as root?" 1>&2
exit 1
fi

# Work out the actual installed package name using the rpm command
#. man rpm for details
INSTALLED=`rpm -q --qf %{Name}-%{Version}-%{Release} -p ${RPM}`
if [ $? -ne 0 ]; then
(echo -n "Unable to extract package name from rpm (${RPM}),"
Echo " have you installed it yet?") 1>&2
exit 1
fi

# Where did the rpm install process place the java compiler program 'javac'
JAVAC=`rpm -q -l ${INSTALLED} | egrep '/bin/javac$'`

# If there was no javac found, then issue an error
if [ $? -ne 0 ]; then
(echo -n "Unable to determine the JAVA_HOME location from $RPM, "
echo "was the rpm installed? Try rpm -Uvh ${RPM} as root.") 1>&2
exit 1
fi

# If we found javac, then we can compute the setting for JAVA_HOME
JAVA_HOME=`echo $JAVAC | sed -e 's;/bin/javac;;'`
echo "The setting for the JAVA_HOME environment variable is ${JAVA_HOME}"
echo -n "update the user's .bashrc if they have one with the"
echo " setting for JAVA_HOME and the PATH."
if [ -w ~/.bashrc ]; then
echo "Updating the ~/.bashrc file with the java environment variables";
(echo export JAVA_HOME=${JAVA_HOME} ;
echo export PATH='${JAVA_HOME}'/bin:'${PATH}' ) >> ~/.bashrc
echo
fi
echo -n "update the user's .bash_profile if they have one with the"
echo " setting for JAVA_HOME and the PATH."
if [ -w ~/.bash_profile ]; then
echo "Updating the ~/.bash_profile file with the java environment variables";
(echo export JAVA_HOME=${JAVA_HOME} ;
echo export PATH='${JAVA_HOME}'/bin:'${PATH}' ) >> ~/.bash_profile
echo
fi
echo "paste the following two lines into your running shell sessions"
echo export JAVA_HOME=${JAVA_HOME}
echo export PATH='${JAVA_HOME}'/bin:'${PATH}'

执行上面列表1-1的脚本就会找到JDK的安装目录，然后，更新你的环境变量，使这个安装的JDK能够被使用。

update_env.sh "FULL_PATH_TO_DOWNLOADED_JDK"

./update_env.sh ~/Download/jdk-6u7-linux-i586-rpm.bin

The setting for the JAVA_HOME environment variable is /usr/java/jdk1.6.0_07

update the user's .bashrc if they have one with the setting ➥
for JAVA_HOME and the PATH.

Updating the ~/.bashrc file with the java environment variables

update the user's .bash_profile if they have one with the setting ➥
for JAVA_HOME and the PATH.

Updating the ~/.bash_profile file with the java environment variables

paste the following two lines into your running shell sessions

export JAVA_HOME=/usr/java/jdk1.6.0_07

export PATH=${JAVA_HOME}/bin:${PATH}

1.1.1.2 在Windows下安装Hadoop：方法和常见问题

为了在Windows操作系统上使用Hadoop, 你需要先安装Sun JDK和Cygwin环境(你能够从http://sources.redhat.com/cywin下载Cygwin)。

通过点击图2-3所示的图标开始运行Cygwin Bash Shell脚本。你需要在JDK安装目录和~/Java所在的目录下建立一个符号链接，这样，当你执行cd ~/java的时候，目录就会改变到JDK的安装目录。因此，JAVA_HOME目录应该设置为JAVA_HOME=~/java。这样你的进程会根据进程的环境变量找到你的java可执行程序，例如，Hadoop需要找到Java安装目录去执行相应的任务。

列表 2-3 Cygwin Bash Shell图标

如果JAVA_HOME环境变量指向的路径包含空格，bin/hadoop脚本就不能正常执行。通常情况下我们在 C:/Program Files/java/jdkRELEASE_VERSION下安装JDK。如果我们做一个符号链接，然后，把JAVA_HOME指向到这个符号链接, bin/hadoop就会正常工作。我通常这样设置我的Cygwin安装目录的，

$echo $JAVA_HOME

/home/Jason/jdk1.6.0_12

$ls –l /home/Jason/jdk1.6.0_12

lrwxrwxrwx 1 Jason None 43 Mar 20 16:32 /home/Jason/jdk1.6.0_12 ➥

/cygdrive/c/Program Files/Java/jdk1.6.0_12/

Cygwin映射Windows磁盘字符到/cygdrive/X，X是磁盘的盘符。此外，Cygwin路径的分隔符是“/”，而Windows的路径分隔符是“/”。

当你执行bin/hadoop脚本的时候，你必须记得你的文件有两套路径，bin/haoop脚本和所有的Cygwin实用程序使用Windows文件系统的一个子系统的路径。这个子系统把Windows磁盘映射到/cygdrive目录下。然而，Windows程序看见传统的C:/文件系统。以/tmp为例，在一个标准的Cygwin安装里，/tmp也是C:/cywin/tmp目录。Java将要转换/tmp作为C:/tmp，他们是一个完全不同的目录。如果你从Cygwin里启动Windows应用程序，并且出现文件没有找到错误，那么通常情况下是这个应用程序（例如Java可执行程序）在一个错误的路径下查找文件。

请注意，你可能会需要在你的系统中对Cygwin的安装有所改变。这根据Sun JDK的安装和Windows的安装环境的不同而有所不同。特别是用户名可能不是Jason，JDK版本也可能不是1.6.0_12, 而且JDK安装位置可能也不是C:/Program Files/Java。

1.1.2 安装Hadoop

当你安装了Linux操作系统或者带有Cygwin的Windows操作系统，下一步你应该下载和安装Hadoop。

打开Hadoop下载网址http://www.apache.org/dyn/closer.cgi/hadoop/core/。在这个网址上找到你选择的tar.gz文件包，相信你还记得我在介绍章节所说的那个文件，然后下载它。

如果你是一个细腻的人，你需要回到这个网址，得到这个文件的PGP 摘要和MD5摘要。

解压这个Tar文件在任何一个你想要作为测试目的的安装目录里。通常我把它解压到一个私人根目录下的src目录，

~jason/src.

mkdir ~src

cd ~/src

tar zxf ~/Downloads/hadoop-0.19.0.tar.gz

这会在~/src目录里创建一个新的目录hadoop-0.19.0。

在你的.bashrc或者.bash_profile文件里添加如下两行：

export HADOOP_HOME=~/src/hadoop-0.19.0

export PATH=${HADOOP_HOME}/bin:${PATH}

如果你使用的是一个不同于~/src的目录，你需要根据你选择的路径调整这些export语句。

1.1.3 检查你的环境

安装了Hadoop以后，你应该检查是否你已经正确的设置了JAVA_HME和HADOOP_HOME环境变量。你的PATH环境变量应该包含${JAVA_HOME}/bin和${HADOOP_HOME}/bin，并且，他们应该在任何其他Java和Hadoop安装变量的前面，最好放在PATH的第一个元素，此外，你的Shell的默认工作目录应该是${HADOOP_HOME}。你需要这些设置来执行这本书的样例程序。

列表1-2所示的check_basic_env.sh脚本会校验你的执行时环境（你能够在本书附带的下载样例程序代码中能够找到这个脚本）。

列表 3-2 update_env.sh脚本

#! /bin/sh
# This block is trying to do the basics of checking to see if
# the HADOOP_HOME and the JAVA_HOME variables have been set correctly
# and if they are not been set, suggest a setting in line with the earlier examples
# The script actually tests for:
# the presence of the java binary and the hadoop script,
# and verifies that the expected versions are present
# that the version of java and hadoop is as expected (warning if not)
# that the version of java and hadoop referred to by the
# JAVA_HOME and HADOOP_HOME environment variables are default version to run.
#
#
# The 'if [' construct you see is a shortcut for 'if test' ....
# the -z tests for a zero length string
# the -d tests for a directory
# the -x tests for the execute bit
# -eq tests numbers
# = tests strings
# man test will describe all of the options
# The '1>&2' construct directs the standard output of the
# command to the standard error stream.
if [ -z "$HADOOP_HOME" ]; then
echo "The HADOOP_HOME environment variable is not set" 1>&2
if [ -d ~/src/hadoop-0.19.0 ]; then
echo "Try export HADOOP_HOME=~/src/hadoop-0.19.0" 1>&2
fi
exit 1;
fi
# This block is trying to do the basics of checking to see if
# the JAVA_HOME variable has been set
# and if it hasn't been set, suggest a setting in line with the earlier examples
if [ -z "$JAVA_HOME" ]; then
echo "The JAVA HOME environment variable is not set" 1>&2
if [ -d /usr/java/jdk1.6.0_07 ]; then
echo "Try export JAVA_HOME=/usr/java/jdk1.6.0_07" 1>&2
fi
exit 1
fi
# We are now going to see if a java program and hadoop programs
# are in the path, and if they are the ones we are expecting.
# The which command returns the full path to the first instance
# of the program in the PATH environment variable
#
JAVA_BIN=`which java`
HADOOP_BIN=`which hadoop`
# Check for the presence of java in the path and suggest an
# appropriate path setting if java is not found
if [ -z "${JAVA_BIN}" ]; then
echo "The java binary was not found using your PATH settings" 1>&2
if [ -x ${JAVA_HOME}/bin/java ]; then
echo 'Try export PATH=${JAVA_HOME}/bin' 1>&2
fi
exit 1
fi
# Check for the presence of hadoop in the path and suggest an
# appropriate path setting if java is not found
if [ -z "${HADOOP_BIN}" ]; then
echo "The hadoop binary was not found using your PATH settings" 1>&2
if [ -x ${HADOOP_HOME}/bin/hadoop ]; then
echo 'Try export PATH=${HADOOP_HOME}/bin:${PATH}' 1>&2
fi
exit 1
fi
# Double check that the version of java installed in ${JAVA_HOME}
# is the one stated in the examples.
# If you have installed a different version your results may vary.
#
if ! ${JAVA_HOME}/bin/java -version 2>&1 | grep -q 1.6.0_07; then
(echo -n "Your JAVA_HOME version of java is not the"
echo -n " 1.6.0_07 version, your results may vary from"
echo " the book examples.") 1>&2
fi
# Double check that the java in the PATH is the expected version.
if ! java -version 2>&1 | grep -q 1.6.0_07; then
(echo -n "Your default java version is not the 1.6.0_07 "
echo -n "version, your results may vary from the book"
echo " examples.") 1>&2
fi
# Try to get the location of the hadoop core jar file
# This is used to verify the version of hadoop installed
HADOOP_JAR=`ls -1 ${HADOOP_HOME}/hadoop-0.19.0-core.jar`
HADOOP_ALT_JAR=`ls -1 ${HADOOP_HOME}/hadoop-*-core.jar`
# If a hadoop jar was not found, either the installation
# was incorrect or a different version installed
if [ -z "${HADOOP_JAR}" -a -z "${HADOOP_ALT_JAR}" ]; then
(echo -n "Your HADOOP_HOME does not provide a hadoop"
echo -n " core jar. Your installation probably needs"
echo -n " to be redone or the HADOOP_HOME environment"
echo variable needs to be correctly set.") 1>&2
exit 1
fi
if [ -z "${HADOOP_JAR}" -a ! -z "${HADOOP_ALT_JAR}" ]; then
(echo -n "Your hadoop version appears to be different"
echo -n " than the 0.19.0 version, your results may vary"
echo " from the book examples.") 1>&2
fi
if [ `pwd` != ${HADOOP_HOME} ]; then
(echo -n 'Please change your working directory to"
echo -n " ${HADOOP_HOME}. cd ${HADOOP_HOME} <Enter>") 1>&2
exit 1
fi
echo "You are good to go"
echo -n "your JAVA_HOME is set to ${JAVA_HOME} which "
echo "appears to exist and be the right version for the examples."
echo -n "your HADOOP_HOME is set to ${HADOOP_HOME} which "
echo "appears to exist and be the right version for the examples."
echo "your java program is the one in ${JAVA_HOME}"
echo "your hadoop program is the one in ${HADOOP_HOME}"
echo -n "The shell current working directory is ${HADOOP_HOME} "
echo "as the examples require."
if [ "${JAVA_BIN}" = "${JAVA_HOME}/bin/java" ]; then
echo "Your PATH appears to have the JAVA_HOME java program as the default java."
else
echo -n "Your PATH does not appear to provide the JAVA_HOME"
echo " java program as the default java."
fi
if [ "${HADOOP_BIN}" = "${HADOOP_HOME}/bin/hadoop" ]; then
echo -n "Your PATH appears to have the HADOOP_HOME"
echo " hadoop program as the default hadoop."
else
echo -n "Your PATH does not appear to provide the the HADOOP_HOME "
echo "hadoop program as the default hadoop program."
fi
exit 0

然后执行脚本：

[scyrus@localhost ~]$ ./check_basic_env.sh

Please change your working directory to ${HADOOP_HOME}. cd ➥
${HADOOP_HOME} <Enter>

[scyrus@localhost ~]$ cd $HADOOP_HOME
[scyrus@localhost hadoop-0.19.0]$
[scyrus@localhost hadoop-0.19.0]$ ~/check_basic_env.sh

You are good to go
your JAVA_HOME is set to /usr/java/jdk1.6.0_07 which appears to exist and be the right version for the examples.
your HADOOP_HOME is set to /home/scyrus/src/hadoop-0.19.0 which appears
to exist and be the right version for the examples.
your java program is the one in /usr/java/jdk1.6.0_07
your hadoop program is the one in /home/scyrus/src/hadoop-0.19.0
The shell current working directory is /home/scyrus/src/hadoop-0.19.0 as
the examples require.
Your PATH appears to have the JAVA_HOME java program as the default
java.
Your PATH appears to have the HADOOP_HOME hadoop program as the default
hadoop.