1、描述
集群报块丢失的原因很多,一般CM界面会提示出来。出现这种情况该如何解决呢:
2、修复
先检测已损坏的块
fsck命令来检查各种块不一致情况。
hdfs fsck <path>
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks | -replicaDetails | -upgradedomains]]]
[-includeSnapshots] [-showprogress]
[-storagepolicies] [-maintenance]
[-blockId <blk_Id>]
sudo -u hdfs hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' > test.log
后台执行命令查看输出文件中会提示丢失快的统计和具体有问题的文件路径和名称。
如果文件不重要直接将文件块删除:
hdfs fsck -delete /tmp/hadoop-yarn/staging/yebowen/.staging/job_1537174906503_876513/job.xml
如果有备份也可以删除整个文件后重新复制一份到集群中:
hdfs dfs -rm -r /tmp/hadoop-yarn/staging/yebowen/.staging/job_1537174906503_876513/job.xml
3.如果文件重要
修复损坏文件:执行命令 hdfs debug recoverLease -path -retries eg:hdfs debug recoverLease -path /tmp/hadoop-yarn/staging/yebowen/.staging/job_1537174906503_876513/job.xml -retries 10