目录
1、hadoop集群3个节点启动正常,但是namenode和datanode之间无法正常通讯
2、启动mapreduce作业时,无法启动,报namenode安全模式(safe mode)
3、提交mapreduce作业,map阶段正常,reduce阶段失败,报虚拟内存不足
4、mapreduce作业正常执行,map阶段可以正常结束,但reduce失败,报错:
1、hadoop集群3个节点启动正常,但是namenode和datanode之间无法正常通讯
hdfs dfs -put xxxx.txt /user/root
报错消息:0 datanode 2 datanode running but exclude。。。
1、检查防火墙
2、检查以下datanode启动日志,确认是不是blk pool id (数据块id)不一致
删除datanode上的文件系统,重新格式namenode
启动日志的位置:/opt/hadoop-2.8.5/logs/hadoop-root-datanode-hadoop02.log
cat /opt/hadoop-2.8.5/logs/hadoop-root-datanode-hadoop02.log
看最新的报错信息,进行解决
解决方法:
防火墙没关闭:
关闭防火墙
各个节点的防火墙都应关闭,运行如下命令:
systemctl stop firewalld systemctl disable firewalld
第二个问题:
step 1 删除每个datanode节点上数据块 rm -rf current/ 路径:/opt/hadoop-2.8.5/tmp/dfs/data
step 2 格式化namenode
hdfs namenode -format
在实际中没有遇到这个问题,只是身边发现有这样的问题!可以借助参考
2、启动mapreduce作业时,无法启动,报namenode安全模式(safe mode)
集群正在启动中,会有一个短暂的安全模式时期,不接受作业,稍后即可
或者输入命令:
hdfs dfsadmin -safemode leave
3、提交mapreduce作业,map阶段正常,reduce阶段失败,报虚拟内存不足
错误案例如下面链接中shuffle测试MapReduce作业提交:
(4条消息) 案例 3 Hadoop集群Shuffle加密_Siobhan_明鑫的博客-CSDN博客
22/03/10 16:00:48 INFO mapreduce.Job: Task Id : attempt_1646898479823_0006_m_000010_1, Status : FAILED Container [pid=3174,containerID=container_1646898479823_0006_01_000015] is running beyond virtual memory limits. Current usage: 70.9 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1646898479823_0006_01_000015 :
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
原因分析:yarn调度发现虚拟内存不足。默认虚拟内存是实际内存的2.1,是系统先检查虚拟内存,认为不足就kill容器
查看磁盘利用率:
解决方案,不让hadoop检查虚拟内存:
设置:yarn-site.xml
添加:
不检查虚拟内存:
<property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> <description>Whether virtual memory limits will be enforced for containers</description> </property>
提高虚拟内存和物理内存的倍数
<property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description> </property>
[root@hadoop01 hadoop]# scp yarn-site.xml root@hadoop02:/opt/hadoop-2.8.5/etc/hadoop/
[root@hadoop01 hadoop]# scp yarn-site.xml root@hadoop03:/opt/hadoop-2.8.5/etc/hadoop/
拷贝到hadoop02、hadoop03节点。
停止服务并重启,重新执行
4、mapreduce作业正常执行,map阶段可以正常结束,但reduce失败,报错:
fetch # 4 fail 超过做大尝试数,bailing-out(被困住,我太难了)
报错信息:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:370) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:292) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:282) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:321) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
1 首先说明,不是内存不足了
2 节点间要正常通信,hosts 主机名这些都是正常的(几率很小)
3 去掉了mapred-site.xml里面shuffle加密的设置(<!-- -->
),mapreduce任务执行正常执行,其实就是证书配置错误,加密环节有问题,在shuffle过程中,去map节点取数据,结果没有证书,无法取到数据。
所有节点要配置ssl证书,加密后才可以获取shuffle