大家好,最近对Hadoop非常感兴趣,今天花点工夫搭建一个开发环境,并整理成文。
首先要了解一下Hadoop的运转形式:
单机形式(standalone)
单机形式是Hadoop的默许形式。现在次解压Hadoop的源码包时,Hadoop无奈了解硬件安装环境,便激进地抉择了最小配置。在这种默许形式下所有3个XML文件均为空。当配置文件为空时,Hadoop会齐全运转在当地。由于不需求与其余节点交互,单机形式就倒霉用HDFS,也不加载任何Hadoop的守护过程。该形式主要用于开发调试MapReduce顺序的使用逻辑。
伪分布形式(Pseudo-Distributed Mode)
伪分布形式在“单节点集群”上运转Hadoop,其中所有的守护过程都运转在同一台机器上。该形式在单机形式之上添加了代码调试性能,承诺你反省内存操作情况,HDFS输入输出,以及其余的守护过程交互。
全分布形式(Fully Distributed Mode)
Hadoop守护过程运转在一个集群上。
环境:Vmware 8.0 和ubuntu11.04
第一步:首先安装jdk和hadoop
1.1 下载jdk1.7
注意:一定要下linux 下32位的jdk1.7,不要下64位的jdk1.7
1.2 下载hadoop-0.20.2
http://labs.mop.com/apache-mirror/hadoop/common/hadoop-0.22.0/hadoop-0.22.0.tar.gz
1.3 解压到/home/tanglg1987
第二步:设置环境变量
su root
vim /etc/profile
用vim编辑器打开/etc目录下的profile,在文件末尾增加如下几行 :
export JAVA_HOME=/home/tanglg1987/jdk1.7.0_07
export HADOOP_HOME=/home/tanglg1987/hadoop-0.20.2
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME/lib
source一下:
source /etc/profile
第三步:测试
java -version
java version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) Client VM (build 23.3-b01, mixed mode)
hadoop version
Hadoop 0.20.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707
第四步:配置伪分布模式
由于只有一台机器,所以只能配置伪分布模式了,即hadoop守护进程运行在本地机器上,模拟一个小规模的集群。
hadoop-env.sh的配置
export JAVA_HOME=/home/tanglg1987/jdk1.7.0_07
core-site.xml 的配置
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9100</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
hdfs-site.xml的配置
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
mapred-site.xml的配置
<property>
<name>mapred.job.tracker</name>
<value>localhost:9101</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
第五步:配置SSH
1.1安装ssh
sudo apt-get install ssh
1.2 基于空命令创建一个新的ssh密钥,以启用无密码登录。
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /home/xiaoming/.ssh/id_rsa.
Your public key has been saved in /home/xiaoming/.ssh/id_rsa.pub.
The key fingerprint is:
19:41:d5:4a:97:04:7f:a8:3d:ee:fc:20:07:9f:33:47 xiaoming@ustc
The key's randomart image is:
+--[ RSA 2048]----+
| .o.o+.. |
| ...+. |
| .. oo . |
| o.o . |
| S o o E |
| + + |
| . O . |
| = = |
| o.. |
+-----------------+
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
1.3 测试
ssh localhost
Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-13-generic i686)
* Documentation: https://help.ubuntu.com/
Last login: Fri Apr 27 17:54:39 2012 from localhost
第六步: 在/home/tanglg1987目录下新建一个start.sh脚本文件,每次启动虚拟机都要删除/tmp目录下的全部文件,重新格式化namenode,代码如下:
sudo rm -rf /tmp/*
rm -rf /home/tanglg1987/hadoop-0.20.2/logs
hadoop namenode -format
hadoop datanode -format
start-all.sh
hadoop dfsadmin -safemode leave
./start.sh
执行过程如下:
第七步:查看运行结果
1.查看日志文件:/home/tanglg1987/hadoop-0.20.2/logs
2.查看report
hadoop dfsadmin -report
Configured Capacity: 20079898624 (18.7 GB)
Present Capacity: 11551305743 (10.76 GB)
DFS Remaining: 11551281152 (10.76 GB)
DFS Used: 24591 (24.01 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 20079898624 (18.7 GB)
DFS Used: 24591 (24.01 KB)
Non DFS Used: 8528592881 (7.94 GB)
DFS Remaining: 11551281152(10.76 GB)
DFS Used%: 0%
DFS Remaining%: 57.53%
Last contact: Mon Oct 15 23:07:59 CST 2012
3.查看web服务端口
结语:在ubuntu上 搭建hadoop胜利!有点小兴奋,已经迫不及待的想去做一些相关的开发及深化理解hadoop内核实现,继续加油咯!
PS:单机形式和伪分布形式均用于开发和调试的目标。真实Hadoop集群的运转驳回的是第三种形式,即全分布形式。待续。