1.1 安装Hadoop
和其他的软件一样,使用Hadoop需要一些先决条件。如果你安装了Cywin,在Windows上执行和开发Hadoop应用程序也是可能。但是,我们强烈建议你使用Linux作为运行Hadoop的产品平台。
请注意,你需要有Linux和Java的基础知识才能使用Hadoop。我们使用Bash脚本来启动这本书的样例程序。
1.1.1 安装的前提条件
我们需要在下列的环境下运行这本书的样例程序,
- Fedora 8
- Sun Java 6
- Hadoop 0.19.0 or 更新版本
早于0.18.2 的Hadoop版本并不是通用的,我们不能在这些版本上编译本书的样例程序。早于1.6版本的Java并不支持所有Hadoop内核所需要的语言特征。除此之外,Hadoop核心似乎在Sun JDK上会表现出更好的性能。我们看到经常会有其他生产商的JDK用户要求提供帮助。这本书后续章节中的样例程序是基于Hadoop 0.17.0,这需要JDK1.6。
Hadoop能够运行在任何现代的Linux操作系统上。我更喜欢Red Hat, Fedora和CentOS上使用的红帽包管理系统(RPM),于是,本书样例代码就借鉴了基于RPM的安装过程。
一个具有大批用户量的Fedora项目提供了torrents(从BitTorrent下载)去下载Fedora的各个版本(http://torrent.fedoraproject.org/)。如果你想跳过更新过程,Fedora联盟提供了一个具有更新的合一版本。你能从http://spins.fedoraunity.org/spins网址下载它。这就是所谓的re-spins。他们并不提供更早的版本的发布包。这些re-spins需要客户化下载工具Jigdo才能下载。
如果你是Linux入门用户,而且你想要下载试用,Live CD和具有持久存储的USB Stick能够帮助你启动一个简单而快速的测试环境。对于富有经验的客户,他们可以在http://www.vmware.com/appliances/directory/cat/45?sort=changed下载VMware Linux安装镜像。
1.1.1.1 在Linux下安装Hadoop
在你安装了Linux操作系统以后,我们必须决定在哪里安装JDK,因为我们需要JDK的安装路径来设置JAVA_HOME和PATH环境变量。
你可以使用具有一定选项的RPM命令获得RPM包包含文件的信息。这些命令是,-q用于查询文件, -l用于列出所有文件信息,-p用于指定你正在查询包的路径。然后,使用egrep查找字符串’/bin/javac$’,这个egrep命令用来在前面命令的输出中查找一个简单的正则表达式。
cloud9: ~/Downloads$ rpm -q -l -p ~/Downloads/jdk-6u7-linux-i586.rpm | egrep '/bin/javac$'
在我的机器中,输出是,
/usr/java/jdk1.6.0_07/bin/javac
请注意,在字符串/bin/javac$上的单引号是必不可少的。如果你不使用单引号,或者使用了双引号,Shell解释器就会把$解释做为一个环境变量。
我们假设我们在~/Downloads目录下执行JDK安装程序,安装程序在当前的工作目录解压绑定的RPM文件。
输出表明JDK被安装在/usr/java/jdk1.6.0_07,Java可执行程序在/usr/java/jdk1.6.0_07/bin下。
在你的.bashrc或者.bash_profile里面添加下面的两行 :
export JAVA_HOME=/usr/java/jdk1.6.0_07
export PATH=${JAVA_HOME}/bin:${PATH}
列表1-1是update_env.sh脚本,这个脚本能够为你配置Hadoop(你能够从这本书所附带的代码中找到这个脚本)。在执行这个脚本之前,请下载JDK的RPM安装包。
#! /bin/sh # This script attempts to work out the installation directory of the jdk, # given the installer file. # The script assumes that the installer is an rpm based installer and # that the name of the downloaded installer ends in # -rpm-bin # # The script first attempts to verify there is one argument and the # argument is an existing file # The file may be either the installer binary, the -rpm.bin # or the actual installation rpm that was unpacked by the installer # # The script will use the rpm command to work out the # installation package name from the rpm file, and then # use the rpm command to query the installation database, # for where the files of the rpm were installed. # This query of the installation is done rather than # directly querying the rpm, on the off # chance that the installation was installed in a different root # directory than the default. # Finally, the proper environment set commands are appended # to the user's .bashrc and .bash_profile file, if they exist, and # echoed to the standard out so the user may apply them to # their currently running shell sessions. # Verify that there was a single command line argument # which will be referenced as $1 if [ $# != 1 ]; then echo "No jdk rpm specified" echo "Usage: $0 jdk.rpm" 1>&2 exit 1 fi # Verify that the command argument exists in the file system if [ ! -e $1 ]; then echo "the argument specified ($1) for the jdk rpm does not exist" 1>&2 exit 1 fi # Does the argument end in '-rpm.bin' which is the suggested install # file, is the argument the actual .rpm file, or something else # set the variable RPM to the expected location of the rpm file that # was extracted from the installer file if echo $1 | grep -q -e '-rpm.bin'; then RPM=`dirname $1`/`basename $1 -rpm.bin`.rpm elif echo $1 | grep -q -e '.rpm'; then RPM=$1 else echo -n "$1 does not appear to be the downloaded rpm.bin file or" 1>&2 echo " the extracted rpm file" 1>&2 exit 1 fi # Verify that the rpm file exists and is readable if [ ! -r $RPM ]; then echo -n "The jdk rpm file (${RPM}) does not appear to exist" 1>&2 echo -n " have you run "sh ${RPM}" as root?" 1>&2 exit 1 fi # Work out the actual installed package name using the rpm command #. man rpm for details INSTALLED=`rpm -q --qf %{Name}-%{Version}-%{Release} -p ${RPM}` if [ $? -ne 0 ]; then (echo -n "Unable to extract package name from rpm (${RPM})," Echo " have you installed it yet?") 1>&2 exit 1 fi # Where did the rpm install process place the java compiler program 'javac' JAVAC=`rpm -q -l ${INSTALLED} | egrep '/bin/javac$'` # If there was no javac found, then issue an error if [ $? -ne 0 ]; then (echo -n "Unable to determine the JAVA_HOME location from $RPM, " echo "was the rpm installed? Try rpm -Uvh ${RPM} as root.") 1>&2 exit 1 fi # If we found javac, then we can compute the setting for JAVA_HOME JAVA_HOME=`echo $JAVAC | sed -e 's;/bin/javac;;'` echo "The setting for the JAVA_HOME environment variable is ${JAVA_HOME}" echo -n "update the user's .bashrc if they have one with the" echo " setting for JAVA_HOME and the PATH." if [ -w ~/.bashrc ]; then echo "Updating the ~/.bashrc file with the java environment variables"; (echo export JAVA_HOME=${JAVA_HOME} ; echo export PATH='${JAVA_HOME}'/bin:'${PATH}' ) >> ~/.bashrc echo fi echo -n "update the user's .bash_profile if they have one with the" echo " setting for JAVA_HOME and the PATH." if [ -w ~/.bash_profile ]; then echo "Updating the ~/.bash_profile file with the java environment variables"; (echo export JAVA_HOME=${JAVA_HOME} ; echo export PATH='${JAVA_HOME}'/bin:'${PATH}' ) >> ~/.bash_profile echo fi echo "paste the following two lines into your running shell sessions" echo export JAVA_HOME=${JAVA_HOME} echo export PATH='${JAVA_HOME}'/bin:'${PATH}'
执行上面列表1-1的脚本就会找到JDK的安装目录,然后,更新你的环境变量,使这个安装的JDK能够被使用。
for JAVA_HOME and the PATH.
for JAVA_HOME and the PATH.
1.1.1.2 在Windows下安装Hadoop:方法和常见问题
为了在Windows操作系统上使用Hadoop, 你需要先安装Sun JDK和Cygwin环境(你能够从http://sources.redhat.com/cywin下载Cygwin)。
通过点击图2-3所示的图标开始运行Cygwin Bash Shell脚本。你需要在JDK安装目录和~/Java所在的目录下建立一个符号链接,这样,当你执行cd ~/java的时候,目录就会改变到JDK的安装目录。因此,JAVA_HOME目录应该设置为JAVA_HOME=~/java。这样你的进程会根据进程的环境变量找到你的java可执行程序,例如,Hadoop需要找到Java安装目录去执行相应的任务。
列表 2-3 Cygwin Bash Shell图标
如果JAVA_HOME环境变量指向的路径包含空格,bin/hadoop脚本就不能正常执行。通常情况下我们在 C:/Program Files/java/jdkRELEASE_VERSION下安装JDK。如果我们做一个符号链接,然后,把JAVA_HOME指向到这个符号链接, bin/hadoop就会正常工作。我通常这样设置我的Cygwin安装目录的,
$echo $JAVA_HOME
/home/Jason/jdk1.6.0_12
$ls –l /home/Jason/jdk1.6.0_12
lrwxrwxrwx 1 Jason None 43 Mar 20 16:32 /home/Jason/jdk1.6.0_12 ➥
/cygdrive/c/Program Files/Java/jdk1.6.0_12/
Cygwin映射Windows磁盘字符到/cygdrive/X,X是磁盘的盘符。此外,Cygwin路径的分隔符是“/”,而Windows的路径分隔符是“/”。
当你执行bin/hadoop脚本的时候,你必须记得你的文件有两套路径,bin/haoop脚本和所有的Cygwin实用程序使用Windows文件系统的一个子系统的路径。这个子系统把Windows磁盘映射到/cygdrive目录下。然而,Windows程序看见传统的C:/文件系统。以/tmp为例, 在一个标准的Cygwin安装里,/tmp也是C:/cywin/tmp目录。Java将要转换/tmp作为C:/tmp,他们是一个完全不同的目录。如果你从Cygwin里启动Windows应用程序,并且出现文件没有找到错误,那么通常情况下是这个应用程序(例如Java可执行程序)在一个错误的路径下查找文件。
请注意,你可能会需要在你的系统中对Cygwin的安装有所改变。这根据Sun JDK的安装和Windows的安装环境的不同而有所不同。特别是用户名可能不是Jason,JDK版本也可能不是1.6.0_12, 而且JDK安装位置可能也不是C:/Program Files/Java。
1.1.2 安装Hadoop
当你安装了Linux操作系统或者带有Cygwin的Windows操作系统,下一步你应该下载和安装Hadoop。
打开Hadoop下载网址http://www.apache.org/dyn/closer.cgi/hadoop/core/。在这个网址上找到你选择的tar.gz文件包,相信你还记得我在介绍章节所说的那个文件,然后下载它。
如果你是一个细腻的人,你需要回到这个网址,得到这个文件的PGP 摘要和MD5摘要。
解压这个Tar文件在任何一个你想要作为测试目的的安装目录里。通常我把它解压到一个私人根目录下的src目录,
~jason/src.
mkdir ~src
cd ~/src
tar zxf ~/Downloads/hadoop-0.19.0.tar.gz
这会在~/src目录里创建一个新的目录hadoop-0.19.0。
在你的.bashrc或者.bash_profile文件里添加如下两行:
export HADOOP_HOME=~/src/hadoop-0.19.0
export PATH=${HADOOP_HOME}/bin:${PATH}
如果你使用的是一个不同于~/src的目录,你需要根据你选择的路径调整这些export语句。
1.1.3 检查你的环境
安装了Hadoop以后,你应该检查是否你已经正确的设置了JAVA_HME和HADOOP_HOME环境变量。你的PATH环境变量应该包含${JAVA_HOME}/bin和${HADOOP_HOME}/bin,并且,他们应该在任何其他Java和Hadoop安装变量的前面,最好放在PATH的第一个元素,此外,你的Shell的默认工作目录应该是${HADOOP_HOME}。你需要这些设置来执行这本书的样例程序。
列表1-2所示的check_basic_env.sh脚本会校验你的执行时环境(你能够在本书附带的下载样例程序代码中能够找到这个脚本)。
列表 3-2 update_env.sh脚本
#! /bin/sh # This block is trying to do the basics of checking to see if # the HADOOP_HOME and the JAVA_HOME variables have been set correctly # and if they are not been set, suggest a setting in line with the earlier examples # The script actually tests for: # the presence of the java binary and the hadoop script, # and verifies that the expected versions are present # that the version of java and hadoop is as expected (warning if not) # that the version of java and hadoop referred to by the # JAVA_HOME and HADOOP_HOME environment variables are default version to run. # # # The 'if [' construct you see is a shortcut for 'if test' .... # the -z tests for a zero length string # the -d tests for a directory # the -x tests for the execute bit # -eq tests numbers # = tests strings # man test will describe all of the options # The '1>&2' construct directs the standard output of the # command to the standard error stream. if [ -z "$HADOOP_HOME" ]; then echo "The HADOOP_HOME environment variable is not set" 1>&2 if [ -d ~/src/hadoop-0.19.0 ]; then echo "Try export HADOOP_HOME=~/src/hadoop-0.19.0" 1>&2 fi exit 1; fi # This block is trying to do the basics of checking to see if # the JAVA_HOME variable has been set # and if it hasn't been set, suggest a setting in line with the earlier examples if [ -z "$JAVA_HOME" ]; then echo "The JAVA HOME environment variable is not set" 1>&2 if [ -d /usr/java/jdk1.6.0_07 ]; then echo "Try export JAVA_HOME=/usr/java/jdk1.6.0_07" 1>&2 fi exit 1 fi # We are now going to see if a java program and hadoop programs # are in the path, and if they are the ones we are expecting. # The which command returns the full path to the first instance # of the program in the PATH environment variable # JAVA_BIN=`which java` HADOOP_BIN=`which hadoop` # Check for the presence of java in the path and suggest an # appropriate path setting if java is not found if [ -z "${JAVA_BIN}" ]; then echo "The java binary was not found using your PATH settings" 1>&2 if [ -x ${JAVA_HOME}/bin/java ]; then echo 'Try export PATH=${JAVA_HOME}/bin' 1>&2 fi exit 1 fi # Check for the presence of hadoop in the path and suggest an # appropriate path setting if java is not found if [ -z "${HADOOP_BIN}" ]; then echo "The hadoop binary was not found using your PATH settings" 1>&2 if [ -x ${HADOOP_HOME}/bin/hadoop ]; then echo 'Try export PATH=${HADOOP_HOME}/bin:${PATH}' 1>&2 fi exit 1 fi # Double check that the version of java installed in ${JAVA_HOME} # is the one stated in the examples. # If you have installed a different version your results may vary. # if ! ${JAVA_HOME}/bin/java -version 2>&1 | grep -q 1.6.0_07; then (echo -n "Your JAVA_HOME version of java is not the" echo -n " 1.6.0_07 version, your results may vary from" echo " the book examples.") 1>&2 fi # Double check that the java in the PATH is the expected version. if ! java -version 2>&1 | grep -q 1.6.0_07; then (echo -n "Your default java version is not the 1.6.0_07 " echo -n "version, your results may vary from the book" echo " examples.") 1>&2 fi # Try to get the location of the hadoop core jar file # This is used to verify the version of hadoop installed HADOOP_JAR=`ls -1 ${HADOOP_HOME}/hadoop-0.19.0-core.jar` HADOOP_ALT_JAR=`ls -1 ${HADOOP_HOME}/hadoop-*-core.jar` # If a hadoop jar was not found, either the installation # was incorrect or a different version installed if [ -z "${HADOOP_JAR}" -a -z "${HADOOP_ALT_JAR}" ]; then (echo -n "Your HADOOP_HOME does not provide a hadoop" echo -n " core jar. Your installation probably needs" echo -n " to be redone or the HADOOP_HOME environment" echo variable needs to be correctly set.") 1>&2 exit 1 fi if [ -z "${HADOOP_JAR}" -a ! -z "${HADOOP_ALT_JAR}" ]; then (echo -n "Your hadoop version appears to be different" echo -n " than the 0.19.0 version, your results may vary" echo " from the book examples.") 1>&2 fi if [ `pwd` != ${HADOOP_HOME} ]; then (echo -n 'Please change your working directory to" echo -n " ${HADOOP_HOME}. cd ${HADOOP_HOME} <Enter>") 1>&2 exit 1 fi echo "You are good to go" echo -n "your JAVA_HOME is set to ${JAVA_HOME} which " echo "appears to exist and be the right version for the examples." echo -n "your HADOOP_HOME is set to ${HADOOP_HOME} which " echo "appears to exist and be the right version for the examples." echo "your java program is the one in ${JAVA_HOME}" echo "your hadoop program is the one in ${HADOOP_HOME}" echo -n "The shell current working directory is ${HADOOP_HOME} " echo "as the examples require." if [ "${JAVA_BIN}" = "${JAVA_HOME}/bin/java" ]; then echo "Your PATH appears to have the JAVA_HOME java program as the default java." else echo -n "Your PATH does not appear to provide the JAVA_HOME" echo " java program as the default java." fi if [ "${HADOOP_BIN}" = "${HADOOP_HOME}/bin/hadoop" ]; then echo -n "Your PATH appears to have the HADOOP_HOME" echo " hadoop program as the default hadoop." else echo -n "Your PATH does not appear to provide the the HADOOP_HOME " echo "hadoop program as the default hadoop program." fi exit 0
然后执行脚本:
[scyrus@localhost ~]$ ./check_basic_env.sh
Please change your working directory to ${HADOOP_HOME}. cd ➥
${HADOOP_HOME} <Enter>
[scyrus@localhost ~]$ cd $HADOOP_HOME
[scyrus@localhost hadoop-0.19.0]$
[scyrus@localhost hadoop-0.19.0]$ ~/check_basic_env.sh
You are good to go
your JAVA_HOME is set to /usr/java/jdk1.6.0_07 which appears to exist and be the right version for the examples.
your HADOOP_HOME is set to /home/scyrus/src/hadoop-0.19.0 which appears
to exist and be the right version for the examples.
your java program is the one in /usr/java/jdk1.6.0_07
your hadoop program is the one in /home/scyrus/src/hadoop-0.19.0
The shell current working directory is /home/scyrus/src/hadoop-0.19.0 as
the examples require.
Your PATH appears to have the JAVA_HOME java program as the default
java.
Your PATH appears to have the HADOOP_HOME hadoop program as the default
hadoop.