运行环境
- 操作系统:OS X 10.9.2
- Hadoop版本:2.3.0
准备工作
安装Java
我用的1.7版本,可以在 这里 进行下载。
下载好解压缩安装之后,需要对Java环境变量(我都是直接改的~/.bash_profile)进行配置。貌似OS X下的配置比较恶心,网上(Mac OS 上设置 JAVA_HOME)比较推荐的做法是
export JAVA_HOME=`/usr/libexec/java_home`
设置SSH生成密钥
单节点伪集群部署时需要本机ssh连通,生成密钥方法为:
ssh-keygen -t rsa
cat .ssh/id_rsa.pub >>.ssh/authorized_keys
如果之前本机没有开启SSH服务,需要勾上“系统偏好设置->共享->远程登录”设置项。
下载Hadoop 2.3.0
在 官网 下载。网上建议的方法是新建一个用户专门用来进行Hadoop环境的配置和管理,偷懒的就在当前用户目录下找个地方解压。
以下是环境变量的配置,后文中都用$HADOOP_HOME表示hadoop根目录。
# hadoop
export HADOOP_HOME=~/Environment/hadoop-2.3.0
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
这个时候应该是可以使用hadoop命令了,可以用以下命令测试一下:
[ 502 ~ ]$hadoop version
Hadoop 2.3.0
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1567123
Compiled by jenkins on 2014-02-11T13:40Z
Compiled with protoc 2.5.0
From source with checksum dfe46336fbc6a044bc124392ec06b85
This command was run using /Users/chenshijiang/Environment/hadoop-2.3.0/share/hadoop/common/hadoop-common-2.3.0.jar
Hadoop配置项
在使用Hadoop之前,需要对一些配置文件进行修改,Hadoop 2.3.0的配置文件都保存在$HADOOP_HOME/etc/hadoop文件夹下。以下直接列出几个配置文件的修改方法。
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/your username/Environment/hadoop-2.3.0/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration> </span>
这里需要注意的是“hadoop.tmp.dir”的配置,这是为了解决Hadoop namenode无法启动的问题。
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/Users/username/Environment/hadoop-2.3.0/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/Users/username/Environment/hadoop-2.3.0/hdfs/datanode</value>
</property>
</configuration>
这里需要注意的是配置之前需要提前建好namenode和datanode相应的目录。
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>
</property>
</configuration>
至此,基本完成所有准备工作。
Hadoop试用
启动Hadoop
之前的Hadoop版本中,可以使用start-all.sh启动Hadoop,现在这种做法已经不赞同使用了。
[ 503 ~ ]$start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
按输出建议的那样,我们依次启动HDFS和YARN,每次启动之后可以运行jps观察已经启动的服务:
[ 505 ~ ]$start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /log/path
localhost: starting datanode, logging to /log/path
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /log/path
[ 506 ~ ]$jps
27592 Jps
27310 NameNode
27519 SecondaryNameNode
27405 DataNode
[ 507 ~ ]$start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /log/path
localhost: starting nodemanager, logging to /log/path
[ 508 ~ ]$jps
27737 NodeManager
27777 Jps
27640 ResourceManager
27310 NameNode
27519 SecondaryNameNode
27405 DataNode
此时,Hadoop已经启动,用浏览器打开localhost:50070和localhost:8088,可以分别看到HDFS和YARN的管理页面。
我们也可以跑个Job试试了。
Job测试
可以运行一下命令执行一个简单的Hadoop Job:
[ 510 ~ ]$cd $HADOOP_HOME
[ 511 ~/Environment/hadoop-2.3.0 ]$hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar pi 10 5
#log太长就不贴了
或者先运行以下命令传一个文件到HDFS上:
[513 ~/Environment/hadoop-2.3.0 ]$hadoop fs -mkdir hdfs://localhost:9000/user/
[514 ~/Environment/hadoop-2.3.0 ]$hadoop fs -mkdir hdfs://localhost:9000/user/username
[517 ~/Environment/hadoop-2.3.0 ]$hadoop fs -copyFromLocal README.txt hdfs://localhost:9000/user/username/readme.txt
然后跑一个用该文件的Job:
[ 518 ~/Environment/hadoop-2.3.0 ]$hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar wordcount readme.txt out
*更多hadoop使用方法请执行hadoop查看帮助或者自行Google
跑完以上两个Job后可以在YARN管理界面中查看。 Job 结果
停止Hadoop
停止Hadoop的操作步骤和启动类似,把start-.sh换成stop-.sh就可以了。
[ 521 ~/Environment/hadoop-2.3.0 ]$stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop
[ 522 ~/Environment/hadoop-2.3.0 ]$stop-dfs.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
跑Job可能遇到的问题
14/03/21 13:49:11 INFO mapreduce.Job: Job job_1395379328591_0005 failed with state FAILED due to: Application application_1395379328591_0005 failed 2 times due to AM Container for appattempt_1395379328591_0005_000002 exited with exitCode: 127 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
这是因为YARN使用的JAVA_HOME和系统使用的不一致,解决方法为:
[ 583 ~/Environment/hadoop-2.3.0 ]$sudo ln -s /usr/bin/java /bin/java
Password:
其他问题
14/03/21 13:49:11 INFO mapreduce.Job: Job job_1395379328591_0005 failed with state FAILED due to: Application application_1395379328591_0005 failed 2 times due to AM Container for appattempt_1395379328591_0005_000002 exited with exitCode: 127 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
执行初始化namenode:
hadoop namenode -format
参考资料
官网教程:http://hadoop.apache.org/docs/r1.0.4/cn/quickstart.html
Hadoop namenode无法启动
http://stackoverflow.com/questions/20390217/mapreduce-job-in-headless-environment-fails-n-times-due-to-am-container-exceptio