进入cm界面发现两个服务出现了异常
查找问题
- 二话不说,重启
本地的服务哈,如果是线上的请跳过…
反馈结果:问题依旧 - 查看相关信息吧
我们知道hive依赖hdfs,所以hive的问题有可能是hdfs导致
两条信息
HDFS Canary这个要读写hdfs验证hdfs的可用性。
NameNode 当前处于 safe mode。所以是第二个问题导致的前面的验证不通过。所以我们需要排查一下namenode的sofemode
命令行查询一下
hdfs dfsadmin -safemode get
思考
想必大家都知道hdfs的启动过程
1.启动服务,datenode向namenode汇报自己的block信息,最初处于安全模式,如果每个最小的block满足设置的阈值,namenode离开安全模式,补充block副本数不足的,进行补充。
既然我们一直处于安全模式,说明我的block数不对,难道我集群有问题?
我去看了一下主机状态
这些节点都是正常的,我们还有一个命令,可以验证block是不是正常的
hadoop fsck /
排查解决
[gugu@slave1 ~]$ hadoop fsck /
WARNING: Use of this script to execute fsck is deprecated.
WARNING: Attempting to execute replacement "hdfs fsck" instead.
Connecting to namenode via http://master.com:9870/fsck?ugi=gugu&path=%2F
FSCK started by gugu (auth:SIMPLE) from /192.168.2.101 for path / at Wed Dec 09 22:07:50 CST 2020
/tmp/logs/gugu/logs/application_1607435161095_0001/slave1.com_8041: MISSING 1 blocks of total size 3 7242 B.
/tmp/logs/gugu/logs/application_1607435161095_0001/slave2.com_8041: CORRUPT blockpool BP-572917739-1 92.168.2.100-1601750079728 block blk_1073745376
/tmp/logs/gugu/logs/application_1607435161095_0001/slave2.com_8041: CORRUPT 1 blocks of total size 1 2697 B.
Status: CORRUPT
Number of data-nodes: 3
Number of racks: 1
Total dirs: 325
Total symlinks: 0
Replicated Blocks:
Total size: 4994301 B
Total files: 76 (Files currently being written: 3)
Total blocks (validated): 71 (avg. block size 70342 B) (Total open file blocks (not validated) : 3)
********************************
UNDER MIN REPL'D BLOCKS: 2 (2.8169014 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 2
MISSING BLOCKS: 1
MISSING SIZE: 37242 B
CORRUPT BLOCKS: 1
CORRUPT SIZE: 12697 B
********************************
Minimally replicated blocks: 69 (97.1831 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 0.97183096
Missing blocks: 1
Corrupt blocks: 1
Missing replicas: 0 (0.0 %)
Blocks queued for replication: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Dec 09 22:07:50 CST 2020 in 16 milliseconds
The filesystem under path '/' is CORRUPT
我们可以看到
The filesystem under path '/' is CORRUPT
说明我们的根路径是没有通过检查的,我们在上面可以看到健康检查的蛛丝马迹
FSCK started by gugu (auth:SIMPLE) from /192.168.2.101 for path / at Wed Dec 09 22:07:50 CST 2020
/tmp/logs/gugu/logs/application_1607435161095_0001/slave1.com_8041: MISSING 1 blocks of total size 3 7242 B.
/tmp/logs/gugu/logs/application_1607435161095_0001/slave2.com_8041: CORRUPT blockpool BP-572917739-1 92.168.2.100-1601750079728 block blk_1073745376
/tmp/logs/gugu/logs/application_1607435161095_0001/slave2.com_8041: CORRUPT 1 blocks of total size 1 2697 B.
是这几个文件的问题
一看是我spark日志的文件
不如,嘿嘿嘿删掉可好(仅供参考)
解决方案1
文件修复
- 离开安全模式
hdfs dfsadmin -safemode leave
2. 尝试修复文件
hdfs debug recoverLease -path /hbase/oldWALs/slave1.com%2C16020%2C1608213311522.meta.1608216920810.meta -retries 2
说明:-path 修复文件路径
-retries 尝试次数
所有文件修复后再次检查状态并且重新启动
hdfs fsck /
解决方案2
首先我们要离开安全模式,否则我们是无法上传,更新,删除文件的
hdfs dfsadmin -safemode leave
所以我就去愉快地删除我的文件了
hdfs dfs -rmr /tmp/logs/gugu/logs/application_1607435161095_0001
重启服务
啊哈怎么还是有异常
点开让我康康
我去hdfsui康康
ok i know
删除文件进入了垃圾箱
so 我们清理下垃圾箱(但是先要离开安全模式),或者我们删除的时候跳过垃圾箱即可
hdfs dfsadmin -safemode leave
hdfs dfs -expunge
这次我记住了,清理完重新检查一下
然后重启服务,可以了哈哈哈
删除的时候跳过垃圾箱
hdfs dfs -rmr -f -skipTrash /tmp/logs/gugu/logs/application_1607435161095_0001
反思
hdfs安全模式一直没有正常离开说明我们的文件有损坏,后来我想到我的hdfs副本之前为了节约资源设置成了1,还是恢复成默认值3吧,这样也可以减少出现问题的概率,虽说我的虚拟机已经备份了镜像,但是所有的备份都是最后的保障,我们一般不会动用这个手段