操作系统:ubuntu 12.04
1.$ sudo apt-get install install ssh (备注:需要输入yes的,要安装openssh server 和另一个文件,忘了)
2.官方文档要求:$ sudo apt-get install rsync 这个系统已经装好了的
3 安装java(我的安装方法)
$chmod +x jdk-6u30-linux-i586.bin
$./jdk-6u30-linux-i586.bin
找到安装好的目录jdk1.6.0_30
$sudo mv jdk1.6.0_30 /usr/java (没有这个目录,可以提前建一个)
(这里java -version是不会出版本信息的,对hadoop是不影响的,如果需要设定,可以参考我的另一个文章http://www.cnblogs.com/xioyaozi/archive/2012/05/21/2511562.html)
4 解压hadoop安装包
$ tar -zxvf hadoop-1.0.3-bin.tar.gz
解压后,找到hadoop-1.0.3目录,修改conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_30 (shell脚本中#号是注释)
6 在hadoop-1.0.3目录下 $ bin/hadoop 如果执行成功,就可以了。
Now you are ready to start your Hadoop cluster in one of the three supported modes:
- Local (Standalone) Mode(单机模式)
- Pseudo-Distributed Mode(伪分布式模式)
- Fully-Distributed Mode(完全分布式模式)
7 单机模式下(以下英文是官方文档,可以简单调试下)
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
$ cat output/*
8 Pseudo-Distributed Mode
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.Configuration
Use the following:
conf/core-site.xml:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
conf/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
conf/mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> 9 设置ssh免密码登陆 $ssh localhost 会让你输入密码,所以需要下进行配置 $ssh-keygen -t dsa 然后按回车就行。文件会自动产生.ssh目录,但是我们看不到,无所谓 $ cd .ssh ****/.ssh$ cp id_dsa.pub authorized_keys 然后执行$ ssh localhost就可以不需要密码登陆了 10 完全分布的还没有配置,OK,over了,我也是一个新手
参考官方文档的步骤:(发现官方的步骤已经很好了,就不写我的了)
1 Format a new distributed-filesystem: $ bin/hadoop namenode -format 格式化分布式文件系统
2 Start the hadoop daemons: $ bin/start-all.sh 运行,如果不运行,会出现 不能连接到主机的错误信息
The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
- NameNode - http://localhost:50070/
- JobTracker - http://localhost:50030/
3 Copy the input files into the distributed filesystem: $ bin/hadoop fs -put conf input 把conf目录上传到分布式系统dfs的input目录
4 Run some of the examples provided: $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 运行一个(java)Hadoop程序的命令,这个程序是找字符串的命令,j计数程序的是$ bin/hadoop jar hadoop-examples-*.jar wordcount input output .*代表版本,输命令的时候用tab自动补全就可以了。
Examine the output files:
5 Copy the output files from the distributed filesystem to the local filesytem and examine them: 从dfs中取到本地查看 $ bin/hadoop fs -get output output $ cat output/*
or
View the output files on the distributed filesystem: 在dfs中查看 $ bin/hadoop fs -cat output/*
6 When you're done, stop the daemons with: 结束 $ bin/stop-all.sh