Hadoop Troubleshooting

错误:org.apache.hadoop.security.AccessControlException: Permission denied
原因:当前用户没有权限
解决方案:在hdfs-site.xml中加入
<property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>

错误:org.apache.hadoop.hdfs.server.namenode.SafeModeException
原因:NameNode在启动的时候首先进入安全模式
解决:如果是此种原因,等启动完成后再重新运行程序

在windows下访问远程服务器上的hadoop,报错

problemsolution
INFO client.RMProxy: Connecting to ResourceManager at namenode:8032
Exception in thread “main” java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: “namenode”:8032; java.net.UnknownHostException;
yarn-site.xmlyarn.resourcemanager.hostname的值由hostname改为可访问的IP地址,如果hadoop部署在NAT的子网内,则还需进行端口映射
Exception in thread “main” org.apache.hadoop.security.AccessControlException: Permission denied: user=xxx, access=EXECUTE, inode="/tmp":hadoop:supergroup:drwx------hadoop fs -rm -R /tmp/hadoop-yarn然后重新运行
ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
下载hadoop-windows-native-master.zip在将hadoop-windows-native-master\2.5.2\bin\VS2013\x64\bin目录下的文件复制到本地Hadoop目录的bin目录下
INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/zhangchen/.staging/job_1457095941103_0036
Exception in thread “main” org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging/zhangchen/.staging/job_1457095941103_0036/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
There are 0 datanode(s) running and no node(s) are excluded in this operation.
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
stop-all.sh删除namenode和所有datanode中/home/hadoop/hd_space/hdfs/目录下的所有文件,然后htfs namenode -format 接着start-dfs.shstart-yarn.sh
Failed to APPEND_FILE test.txt for DFSClient_NONMAPREDUCE_-656499820_1 on 127.0.0.1 because lease recovery is in progress.
Failed to APPEND_FILE test1.txt for DFSClient_NONMAPREDUCE_1211927326_1 on 127.0.0.1 because this file lease is currently owned by DFSClient_NONMAPREDUCE_-90637176_1 on 127.0.0.1
lease hardlimit过期时间为一小时,要么干等一小时,要么手动删除吧,但下次记住把outputstream close掉吧
Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.

ssh localhost need password in mac
reason: ~/.ssh/config has specified an IdentityFile other than ~/.ssh/id_rsa
solution: just add a new line in the end of config file IdentityFile ~/.ssh/id_rsa

Directory /private/tmp/hadoop-zhangchen/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
Reason: I followed the hadoop setting up guide from it’s officical document, this guide hasn’t tell me to set a name and data directory expicitly, so it defaults to /tmp dir in my mac. after a reboot, everything goes wrong.
Solution:
add following config to etc/hadoop/core-site.xml

<property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/share/data/hadoop</value>
 </property>

re-format the namenode(DANGEROUS!!!)
bin/hdfs namenode -format
restart hdfs
sbin/start-dfs.sh

ssh: Could not resolve hostname 186.3.168.191.isp.timbrasil.com.br: nodename nor servname provided, or not known

solution: just fix your hostname and add it to /etc/hosts
e.g.(on mac)
scutil --set HostName mbp-zc
then add 127.0.0.1 mbp-zc to /etc/hosts

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

java.io.IOException: Incompatible clusterIDs in /private/tmp/hadoop-zhangchen/dfs/data: namenode clusterID = CID-92121909-a8e9-4802-8021-0b52ec8ba1d6; datanode clusterID = CID-c7c456df-8de1-44eb-aa0a-6da8fd5e7aa0
solution: just fix it


9000 端口 connection refused
在cenos7中关闭防火墙
systemctl stop firewalld
systemctl disable firewalld


hadoop fs -ls / report error

GSSException: no valid credentials provided (Mechanism level: Failed to find any kerberos tgt)

reason: kereros is a network authentication protocol. Hadoop uses Kerberos as the basis for strong authentication and identity propagation for both user and services.

solution:
run this command, replace the <ktab file> with your ktab fie path, and <principle> with your principle name
kinit -k -t <ktab file> <principle>


Exception in thread “main” java.io.IOException: Incomplete HDFS URI, no host: hdfs:///user/ds/Wikipedia
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)

solution: if running code in intellij idea, just copy core-site.xml to the resource directory:
cp $HADOOP_HOME/etc/hadoop/core-site.xml .

which should include the defaultFS host config

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    ...
</configuration>

run sbin/start-yarn.sh report error:
localhost: ERROR: Cannot set priority of nodemanager process 16149

this error occurred after I add following config to yarn-site.xml

<property>
    <name>yarn.nodemanager.resource-plugins</name>
    <value>yarn.io/gpu</value>
  </property>

when I checked the nodemanager log, I found the root cause is

2021-06-15 10:15:58,500 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer: Failed to
 locate GPU device discovery binary, tried paths: [/bin/nvidia-smi, /usr/bin/nvidia-smi, /usr/local/nvidia/bin/nvidia-smi]! Please 
double check the value of config yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. Using default binary: nvidia-
smi
2021-06-15 10:15:58,501 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED
org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery executable, please double check yarn.nodemanager.reso
urce-plugins.gpu.path-to-discovery-executables setting. Also tried to find the executable in the default directories: [/usr/bin, /b
in, /usr/local/nvidia/bin]
...
2021-06-15 10:15:58,502 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery executable, please double check yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. Also tried to find the executable in the default directories: [/usr/bin, /bin, /usr/local/nvidia/bin]

because I am testing the gpu scheduling of yarn on my laptop without nvidia gpu, I have to remove above resource plugin config to make yarn startup normally.

another work around is that we can cheat YARN with an mocked output

cat /usr/local/nvidia/bin/nvidia-smi

#!/bin/sh
sample_output=/Users/zhangchen/code/hadoop-rel-release-3.2.2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/nvidia-smi-sample-output.xml
cat ${sample_output}

ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Failed to bootstrap configured resource subsystems!

因为yarn非docker模式要使用cgroup进行资源隔离,而mac上没有cgroup,所以只能在centos上操作了。


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值