本文主要是我实践过程中遇到的bug集锦,主要包括win7下eclipse中连接局域网内的伪分布hadoop的项目的启动过程中和运行过程中遇到的bug的集合。
一、 Incompatible clusterIDs
今天启动Hadoop2.2.0集群后,发现datanode进程没启动,查看日志发现如下报错:
2014-05-15 14:46:50,788 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-2020521428-192.168.0.166-1397704506565 (storage id DS-432251277-192.168.0.166-50010-1397704557407) service to singlehadoop/192.168.0.166:8020
java.io.IOException:
Incompatible clusterIDs in /home/casliyang/hadoop2/hadoop-2.2.0/metadata/data:
namenode clusterID = CID-2cc69ada-3730-4c79-8384-c725fa85859a
;
datanode clusterID = CID-3e649eb6-cdb3-4a0c-aad8-5948c66bf282
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:722)
上网查了下,有些文章说的解决办法是删掉数据文件,格式化,重启集群,但这办法实在太暴力,根本无法在生产环境实施,所以还是参考另一类文章的解决办法,修改clusterID:
step1:
查看hdfs-site.xml,找到存namenode元数据和datanode元数据的路径:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/
/
/home/casliyang/hadoop2/hadoop-2.2.0/metadata/name
</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/
/
/
home/casliyang/hadoop2/hadoop-2.2.0/metadata/data
</value>
</property>
step2:
打开namenode路径下的current/VERSION文件:
casliyang@singlehadoop:~/hadoop2/hadoop-2.2.0/metadata/name/current$ cat VERSION
#Thu May 15 14:46:39 CST 2014
namespaceID=1252551786
clusterID=
CID-2cc69ada-3730-4c79-8384-c725fa85859a
cTime=0
storageType=NAME_NODE
blockpoolID=BP-2020521428-192.168.0.166-1397704506565
layoutVersion=-47
打开datanode路径下的current/VERSION文件:
casliyang@singlehadoop:~/hadoop2/hadoop-2.2.0/metadata/data/current$ cat VERSION
#Thu Apr 17 11:15:57 CST 2014
storageID=DS-432251277-192.168.0.166-50010-1397704557407
clusterID=
CID-3e649eb6-cdb3-4a0c-aad8-5948c66bf282
cTime=0
storageType=DATA_NODE
layoutVersion=-47
我们可以看到,name节点元数据的clusterID和data节点元数据的clusterID不一致了,并且和报错信息完全对应上!
接下来
将data节点的clusterID修改成和name节点的clusterID一致
,重启集群即可。
二、UnknownHostException
原因:hosts文件包含中文字符等造成hosts文件损坏 或 配置不当
注:host文件的修改不需要重启系统
Caused by: java.net.UnknownHostException: free-switch: Name or service not known
修改/etc/hosts文件,添加:127.0.0.1 free-switch
三、解除 "Name node is in safe mode"
原因:在分布式文件系统启动的时候,开始的时候会有安全模式,当分布式文件系统处于安全模式的情况下,文件系统中的内容不允许修改也不允许删除,直到安全模式结束。安全模式主要是为了系统启动的时候检查各个DataNode上数据块的有效性,同时根据策略必要的复制或者删除部分数据块。运行期通过命令也可以进入安全模式。在实践过程中,系统启动的时候去修改和删除文件也会有安全模式不允许修改的出错提示,只需要等待一会儿即可。
如此场景会造成namenode处于安全模式:在hadoop执行过程中使用了"ctrl+c"操作 ,再次使用hadoop时出现“Name node is in safe mode”提示。
解决办法:$
hdfs dfsadmin -safemode leave
四、 No FileSystem for scheme
启动dfs client时候报错:“
java.io.IOException: No FileSystem for scheme: hdfs
",是因为缺少hadoop-hdfs jar包,下面
两个都不能少
:
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-common</artifactId>
- <version>2.2.0</version>
- <exclusions>
- <exclusion>
- <artifactId>jdk.tools</artifactId>
- <groupId>jdk.tools</groupId>
- </exclusion>
- </exclusions>
- </dependency>
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-hdfs</artifactId>
- <version>2.2.0</version>
- <exclusions>
- <exclusion>
- <artifactId>jdk.tools</artifactId>
- <groupId>jdk.tools</groupId>
- </exclusion>
- </exclusions>
- </dependency>
五、 HDFS客户端的权限错误:Permission denied
跟踪代码进入 FileSystem.get-->CACHE.get()-->Key key = new Key(uri, conf);到这里的时候发现key值里面已经有kingbull了,所以关键肯定是在new key的过程。继续跟踪UserGroupInformation.getCurrentUser()-->getLoginUser()-->login.login()到这一步的时候发现用户名已经确定了,但是这个方法是Java的一个通用的安全认证包,关键代码为:
- if (!isSecurityEnabled() && (user == null)) {
- String envUser = System.getenv(HADOOP_USER_NAME);
- if (envUser == null) {
- envUser = System.getProperty(HADOOP_USER_NAME);
- }
- user = envUser == null ? null : new User(envUser);
- }
解决办法:在Eclipse的Debug Configuration中增加
环境变量:HADOOP_USER_NAME=kingbull
六、Cannot initialize Cluster
创建JobClient时候总出现这个错误:
- Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
- at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:119)
- at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:81)
- at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:74)
- at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1188)
- at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1184)
- at java.security.AccessController.doPrivileged(Native Method)
- at javax.security.auth.Subject.doAs(Subject.java:396)
- at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
- at org.apache.hadoop.mapreduce.Job.connect(Job.java:1183)
- at org.apache.hadoop.mapreduce.Job.submit(Job.java:1212)
- at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1236)
这是因为 Cluster 中的 frameworkLoader 什么也没加载到:
- private static ServiceLoader<ClientProtocolProvider> frameworkLoader =
- ServiceLoader.load(ClientProtocolProvider.class);
于是把包含 ClientProtocolProvider 子类的两个包全部引入后便正常了。
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-mapreduce-client-common</artifactId>
- <version>2.2.0</version>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
- <version>2.2.0</version>
七、 NullPointerexception shell.runcommand
- Exception in thread "main" java.lang.NullPointerException
- at java.lang.ProcessBuilder.start(Unknown Source)
- at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
原因:缺少winutils.exe文件,从360云盘上下载后,存放在$HADOOP_HOME/bin下面(记得增加
$HADOOP_HOME
环境变量)
八、 UnsatisfiedLinkError
- UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows
原因:缺少hadoop.dll,把这个文件拷贝到C:\Windows\System32下面即可。
九、Datanode denied communication with namenode:
Datanode denied communication with namenode: DatanodeRegistration(0.0.0.0, storageID=DS-2053126411-127.0.0.1-50010-1395699655564, infoPort=50075, ipcPort=50020, storageInfo=lv=-47;cid=CID-ba95b66c-d94b-4390-8d44-adb486ba5682;nsid=2073699254;c=0)
原因:RSA没有给局域网ip授权,造成data node没有正常启动。(参考:Permanently added 'free-switch' (RSA) to the list of known hosts.)
解决办法:/ets/hosts中增加:172.16.100.33 free-switch,然后重启,启动中选择:
- The authenticity of host 'free-switch (172.16.100.33)' can't be established.
- RSA key fingerprint is ea:61:cd:e3:cb:8b:fa:07:04:7b:92:1a:3f:f4:50:b6.
- Are you sure you want to continue connecting (yes/no)? y
- Please type 'yes' or 'no': yes