在windows上安装Hadoop总结

最新推荐文章于 2024-07-19 00:36:58 发布

ThreeandOut

最新推荐文章于 2024-07-19 00:36:58 发布

阅读量3.2k

点赞数

文章标签： hadoop windows java output download unix

For mahout, today I install Hadoop in my PC, here are the installation guide, hope useful :)
这是一个图文安装教程十分详细：http://ebiquity.umbc.edu/Tutorials/Hadoop/00 - Intro.html
Required Software
1. Java 1.6.x
2. Cygwin: It is a is a Linux-like environment for Windows, and is Required for shell support in addition to the required software above.
3. SSH must be installed and SSHD must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Install Cygwin:
1. download setup.exe from http://www.cygwin.com/
2. select install from internet, and sepecify the download folder and install folder, for downloading, please select a mirror nearby.
3. after the conponent list are downloaded, please search "SSH", it will in Net category, and change the default "Skip" into one version of SSH.
4. download and install the component.

After install, you can see a Cygwin.exe in your desktop, you can run the bash shell to perform Linux in our Windows enviornment.
Your linux filesystem is under %You_Cygwin_Install% folder.

SSH Configuration:
1. Add System Enviornment:
A．add a new system enviornment named as CYGWIN, it value is 'ntsec tty'.
B．edit system enviornment PATH, add your 'Cygwin\bin' folder into it.
2. Config SSH
A. change to bin folder: "cd /bin"
B. execute configuration command："ssh-host-config "，when "CYGWIN=" come up, please input "ntsec tty". After this, your SSH service is been started in window services, then please restart your computer.When askedif privilege separation should be used, answer no.

C. change to home folder in your Cygwin install folder, you can see that a folder named as your window user account have been generated.
D. execute connect command: "ssh youname@127.0.0.1"
if you connect successfully, that means your configuration is correct. print out "Last login: Sun Jun 8 19:47:14 2008 from localhost"
if your connection fail, maybe you need add SSH permition in your firewall, the default port of SSH is 22.
E. if you need sepcify the password, you could using the following commands:
"ssh-keygen -t dsa -P '<your_password>' -f ~/.ssh/id_dsa"
"$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys"
then every time you are be asked to input the password when you trying to connect through SSH.
SSH设置补充：再设置密码为空仍需要提示输入密码，需要

配置 sshd

在 cygwin 的命令行中输入以下命令：

$ cd /etc
$ chmod 666 sshd_config
$ vi sshd_config

修改 sshd_config 的以下配置。

PermitRootLogin no       # 禁止root登录
StrictModes yes          # CYGWIN=ntsec时的安全配置
RhostsRSAAuthentication no   # 禁止 rhosts 认证
IgnoreRhosts yes         # 禁止 rhosts 认证
PasswordAuthentication no    # 禁止密码认证
ChallengeResponseAuthentication no    # 禁止密码认证
PermitEmptyPasswords no     # 禁止空密码用户登录

最后将 sshd_config 的权限修改回 644。

$ chmod 644 sshd_config

Hadoop Install and Configuration
1. Download Hadoop ".tar.gz" file, and extract them under your Cygwin file system, suggest: usr\local
2. Configure hadoop-env.sh under hadoop/conf folder
export JAVA_HOME=<Your Java Location> //put java under Cygwin is better to sepecify the location

export HADOOP_IDENT_STRING=MYHADOOP

JAVA_HOME的设置：这里的路径不能以window的路径出现，应该用unix的形式出现，例如c:\Program Files\Java\jdk1.6.0，应该写成/cygdrive/c/Program Files/Java/jdk1.6.0.而且，还有一点要注意的是，如果路径有空格，如Program Files间的空格，则应该将路径用双引号括起(/cygdrive/c/"Program Files"/Java/jdk1.6.0)。另外需要将export JAVA_HOME前面的#符号去掉，因为这是注释符号不然就读不出来。修改完hadoop-env.sh文件后，如果马上启动Hadoop，仍会报错，出现“/bin/java: No such file or directoryva/jdk1.6.0_10 ”的错误。是dos环境与Unix的字符编码的问题，可以用dos2unix来实现 DOS <=> UNIX text file 转换。所以切换到conf目录下，对hadoop-env.sh进行Unix文本转化：dos2unix hadoop-env.sh.如此就可以正常启动Hadoop了！如果出现：bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory，暂时不清楚什么原因。

After the configuration, you can use the following commands verify your installation.
cd /usr/local/hadoop
bin/hadoop version
It should print out:
Hadoop 0.17.0
Subversion http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 656523
Compiled by hadoopqa on Thu May 15 07:22:55 UTC 2008
3. Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
For the "Pseudo-Distributed Operation" mode, you need do the following configurations
A. in conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
B. in conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
C. in conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
4. Execution
A. Format a new distributed-filesystem:
$ bin/hadoop namenode -format
B. Start the hadoop daemons:
$ bin/start-all.sh   your also can start them specifically, $> bin/start-dfs.sh and $> bin/start-mapred.sh
The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to  ${HADOOP_HOME}/logs).
C. Browse the web interface for the NameNode and the JobTracker; by default they are available at:
* NameNode - http://localhost:50070/
* JobTracker - http://localhost:50030/
D. Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
E. Run some of the examples provided:
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
F. Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
   or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
G. When you're done, stop the daemons with:
$ bin/stop-all.sh

具体安装过程参考

http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html

要点注意
1 hadoop 0.203有问题，会报错换0.20.2

2cygwin安装openshh 看图中两个选项，具体样式可能不同。server clent 及runtime environment；环境变量不要忘记；

3 predo的配置文件与教程有些区别，主要是因为0.2之后把配置文件分成了三个；同时端口用9100 9001，不能用9000，会提示端口被占用。

conf/hadoop-env.sh export JAVA_HOME=/cygdrive/c/Java/jdk1.6.0_05/

4 myeclipse 8，插件直接放到dropins文件夹下即可，不需额外操作；注意插件与eclips 3.5有些不兼容，不能将程序ran on hadoop，提示一个class不能load，搜索有path，但需要重新编译，ant，信号cloudria有编译好的，下载使用，可以。

localtion注意端口9100，9101即可；

补充 javahome的问题

JDK路径包含空格的问题：需要在hadoop-env.sh中加入JAVA_HOME环境变量以供hadoop使用java。因为cygwin只是在windows平台模拟linux环境，所以，实际上使用的就是安装在windows目录下的java。在windows下安装JDK习惯性的都放入Program File目录下，但是，这是一个包含空格的路径，这在以后的使用过程中就会发生问题。有人说在路径上加上" "，或者在空格前写入\，可以解决目录中含空格问题，但是实际操作中似乎不起作用。所以，最保险的解决方法重新安装JDK到不含空格的路径中，另外官网上也有一种办法是建立链接文件，原文链接http://hbase.apache.org/docs/current/cygwin.html

One important thing to remember in shell scripting in general (i.e. *nix and Windows) is that managing, manipulating and assembling path names that contains spaces can be very hard, due to the need to escape and quote those characters and strings. So we try to stay away from spaces in path names. *nix environments can help us out here very easily by using symbolic links.

Create a link in /usr/local to the Java home directory by using the following command and substituting the name of your chosen Java environment: LN -s /cygdrive/c/Program\ Files/Java/<jre name> /usr/local/<jre name>Test your java installation by changing directories to your Java folder CD /usr/local/<jre name> and issueing the command ./bin/java -version. This should output your version of the chosen JRE.cygwin下如何表示windows驱动器和其下的目录：因为JDK实际安装在windows下，所以，JAVA_HOME的内容应该是这个路径，那么这个windows路径如何在cygwin下用unix方式表示呢，就是用 cygdrive/驱动器名来访问某个驱动器。所以hadoop-env.sh中加入：export JAVA_HOME=/cygdrive/e/Java/jdk1.6.0_21 （假设JDK安装在E盘的java目录下）（另注：而且这个环境变量只是hadoop自己用，在cygwin下用echo $JAVA_HOME是看不到的。）用hadoop格式化hdfs文件系统的问题：在Linux中，如果在core-site.xml中设置了以下内容：
<property>
<name>hadoop.tmp.dir</name>
<value>/data0/pwzfb/hadooptmp</value>
<description>A base for other temporary directories.</description>
</property>
那么如果格式化成功，在相应的/data0/pwzfb/hadooptmp目录下会生成一些新的文件夹（如dfs），但是在cygwin中执行完hadoop namenode -format命令后却无法在/data0/pwzfb/hadooptmp目录下看到任何文件。其实因为hadoop和cygwin将/映射的路径是不同的：cygwin认为/对应的实际上是c:\cygwin目录（如果cygwin安装在c盘根目录）；而hadoop将/映射为c:\，所以，实际上格式化成功后，应该去c:\下去确认（我被这个又害惨了，总以为没有成功格式化，折腾了很久，最后偶尔发现在c:\下的data0目录，当时还以为是配置文件写错了，又折腾半天。。）hadoop启动进程信息不全问题：格式化成功后，在.bash_profile中配置完hadoop可执行脚本的路径，直接运行start-all.sh启动Hadoop。习惯性用jps查看(也可用ps -ef)，确实有五个java进程出现，但是，却显示如下：
$ jps
1048 NameNode
3540 -- process information unavailable
5344 JobTracker
5372 Jps
3076 -- process information unavailable
7936 -- process information unavailable
其中，datanode,secondarynamenode,tasktracker进程都显示process information unavailable。其实这个我想可能是Jps的bug，只要验证hadoop的wordcount程序执行正确，就可以忽略这个错误（当时，我再一次倒在这个问题上，为啥米我这么杯具呢。。。因为我使用mkdir命令可以成功创建目录，但是使用put命令却无法成功上传一个文件，加之有process information unavailable这个问题，我就更怀疑hadoop没有启动成功。后来发现用put上传一个带文件的文件夹就可以，然后运行wordcount也成功了，这才停止纠结于process information unavailable这个提示信息。而且，第二天重开机运行后，也能上传单个文件了，再次很纠结。没法再现问题也不好跟踪源码确认问题了。。。）（另注：如果使用其他的客户端如putty连接Cygwin环境下的linux，则在putty中启动hadoop发现名称也能正常识别，所以，也可能是cygwin的问题。。忽略吧。。）cygwin无法正常logout问题：记得在exit指令之前，关闭hadoop就可以了。

put信息出错

這個錯誤訊息意思是，他想要放檔案，但沒半個node可以給存取，因此我們需要檢查：

系統或hdfs是否還有空間（像我就是） datanode數是否正常是否在safemode 讀寫的權限什麼都檢查過都正常的話，也只好砍掉重練了

PS:检查上面几个，1）系统空间够。df -hl查看。 2）datanode数是2.datanode用jps查看进程，都启动了。3，是否在safemode下。hadoop dfsadmin -safemode leave.使用后，可以正常拷贝了。也许之前的操作和这一步操作起作用了。

有关datanode不能正常启动的问题

在单机上实验“伪分布式运行模式”时,第一遍执行"伪分布式运行模式",能正常地输出结果,

重新运行一次,问题来了,用ps -ef查看进程,怎么只多了4个进程,正常应该是5个(namenode、secondnamenode、datanode、jobtracker、 tasktracker),查看cat /tmp/*.id,

发现是datanode的进程没有启动.在执行bin/stop-all.sh时也能明显看到no datanode to stop,

重复试了好几次,都是一样的结果. 看了一下hadoop目录先的logs/*-datanode.log,

2008-10-19 16:39:53,546 ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in C:\tmp\hadoop-SYSTEM\dfs\data: namenode namespaceID = 26465944; datanode namespaceID = 453380336
at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:226)
at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:141)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:273)
at org.apache.hadoop.dfs.DataNode.<init>(DataNode.java:190)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:2987)
at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2942)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2950)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3072)

可以看出,在每次执行bin/hadoop namenode -format时,会为namenode生成namespaceID,
但是在tmp文件夹下的datanode还是保留上次的namespaceID,在启动时,由于namespaceID
不一致,导致datanode无法启动。所以只要在每次bin/hadoop namenode -format之前
先删除"临时文件夹"就可以启动成功。

ThreeandOut

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
在windows上安装Hadoop总结

For mahout, today I install Hadoop in my PC, here are the installation guide, hope useful :)这是一个图文安装教程十分详细：http://ebiquity.umbc.edu/Tutorials/Hadoop/00 - Intro.htmlRequired Software1. Java 1.6.x
复制链接

扫一扫