文章目录
一,介绍:
①:hdfs fsck /path
检查path中文件的健康状况
②:hdfs fsck /path -files -blocks -locations
打印文件块的位置信息(-locations) 需要和-files -blocks一起使用
③:hdfs fsck /path -list-corruptfileblocks
查看文件中损坏的块(-list-corruptfileblocks)
④:hdfs fsck /path -delete
这是删除损坏的文件(它在hdfs上)
二,实践
①在hdfs创建文件夹,上传测试文件,并检查文件健康状况
[hadoop@hadoop001 ~]$ hdfs dfs -mkdir /blocktest
[hadoop@hadoop001 ~]$ hdfs dfs -put testHdfsFile.txt /blocktest/
[hadoop@hadoop001 ~]$ hdfs dfs -ls /blocktest/
Found 1 items
-rw-r--r-- 3 hadoop hadoop 60 2019-08-21 14:23 /blocktest/testHdfsFile.txt
[hadoop@hadoop001 ~]$ hdfs fsck /blocktest/
Connecting to namenode via http://hadoop001:50070/fsck?ugi=hadoop&path=%2Fblocktest
FSCK started by hadoop (auth:SIMPLE) from /172.19.252.139 for path /blocktest at Wed Aug 21 14:25:05 CST 2019
.Status: HEALTHY
Total size: 60 B
Total dirs: 1
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 60 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Wed Aug 21 14:25:05 CST 2019 in 1 milliseconds
The filesystem under path '/blocktest' is HEALTHY
[hadoop@hadoop001 ~]$
②找出块位置,并且删除一个block副本和block元数据信息
[root@hadoop001 subdir0]# hdfs fsck /blocktest/testHdfsFile.txt -files -blocks -locations
-bash: hdfs: command not found
[root@hadoop001 subdir0]# su - hadoop
Last login: Wed Aug 21 14:12:44 CST 2019 on pts/0
[hadoop@hadoop001 ~]$ hdfs fsck /blocktest/testHdfsFile.txt -files -blocks -locations
Connecting to namenode via http://hadoop001:50070/fsck?ugi=hadoop&files=1&blocks=1&locations=1&path=%2Fblocktest%2FtestHdfsFile.txt
FSCK started by hadoop (auth:SIMPLE) from /172.19.252.139 for path /blocktest/testHdfsFile.txt at Wed Aug 21 14:47:17 CST 2019
/blocktest/testHdfsFile.txt 60 bytes, 1 block(s): OK
0. BP-577895678-172.19.252.139-1566271200217:blk_1073741826_1002 len=60 Live_repl=3 [DatanodeInfoWithStorage[172.19.252.141:50010,DS-ffd3fa19-ddbb-4f5a-b487-d1ecb6a6d95b,DISK], DatanodeInfoWithStorage[172.19.252.140:50010,DS-ce5c4933-ca59-4955-bfcd-b1c6c0276f1f,DISK], DatanodeInfoWithStorage[172.19.252.139:50010,DS-afdf9c32-a7f5-4b9b-b9ff-32bf4ea876e2,DISK]]
Status: HEALTHY
Total size: 60 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 60 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Wed Aug 21 14:47:17 CST 2019 in 1 milliseconds
The filesystem under path '/blocktest/testHdfsFile.txt' is HEALTHY
[hadoop@hadoop001 ~]$ logout
[root@hadoop001 subdir0]# find / -name "*blk_1073741826_1002*"
/home/hadoop/data/dfs/data/current/BP-577895678-172.19.252.139-1566271200217/current/finalized/subdir0/subdir0/blk_1073741826_1002.meta
[root@hadoop001 subdir0]# cd /home/hadoop/data/dfs/data/current/BP-577895678-172.19.252.139-1566271200217/current/finalized/subdir0/subdir0/
[root@hadoop001 subdir0]# ll
total 20
-rw-rw-r-- 1 hadoop hadoop 4233 Aug 20 12:32 blk_1073741825
-rw-rw-r-- 1 hadoop hadoop 43 Aug 20 12:32 blk_1073741825_1001.meta
-rw-rw-r-- 1 hadoop hadoop 60 Aug 21 14:23 blk_1073741826
-rw-rw-r-- 1 hadoop hadoop 11 Aug 21 14:23 blk_1073741826_1002.meta
#删除块和meta⽂件
[root@hadoop001 subdir0]# rm -rf blk_1073741826*
[root@hadoop001 subdir0]# ll
total 12
-rw-rw-r-- 1 hadoop hadoop 4233 Aug 20 12:32 blk_1073741825
-rw-rw-r-- 1 hadoop hadoop 43 Aug 20 12:32 blk_1073741825_1001.meta
[root@hadoop001 subdir0]#
③重启hdfs,直接模拟损坏效果,然后hdfs fsck /path 进行检出
[hadoop@hadoop001 subdir0]$ hdfs fsck /
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path / at Mon Apr 29 18:51:06 CST 2019
..
/blockrecover/testcorruptfiles.txt: CORRUPT blockpool BP-2041209051-127.0.0.1-1556350579057 block blk_1073741890
/blockrecover/testcorruptfiles.txt: MISSING 1 blocks of total size 51 B.............Status: CORRUPT
Total size: 654116 B
Total dirs: 12
Total files: 14
Total symlinks: 0
Total blocks (validated): 14 (avg. block size 46722 B)
********************************
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 51 B
CORRUPT BLOCKS: 1
********************************
Minimally replicated blocks: 13 (92.85714 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 0.9285714
Corrupt blocks: 1
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Apr 29 18:51:06 CST 2019 in 41 milliseconds
The filesystem under path '/' is CORRUPT
[hadoop@hadoop001 subdir0]$
Corrupt blocks: 1
有一个block损坏
(此次模拟是伪分布式,集群模式下可能重启hdfs集群就已经自动修复了,看不到损坏的block)
三,修复
①hdfs debug 手动修复(推荐)
手动删除损坏的block块。切记,是删除损坏block文件和meta文件,而不是删除hdfs文件。然后用命令修复:
[hadoop@hadoop001 subdir0]$ hdfs debug recoverLease -path /blocktest/testHdfsFile.txt -retries 10
-retries 重试次数
②手动修复二
先用命令从hdfs上把文件下载到本地,然后删除hdfs上的对应文件,最后在上传上去,
hdfs dfs -ls /xxxx
hdfs dfs -get /xxxx ./
hdfs dfs -rm /xxx
hdfs dfs -put xxx / #put到hdfs上之后,它会自动变为3份。
③自动修复
当数据块损坏后,DN节点执⾏directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进⾏blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进⾏恢复操作。
四,总结
①,区分好hdfs文件和block之间的关系(通常一个文件有三个block副本)
②,⽣产环境中本⼈⼀般倾向于使⽤ ⼿动修复⽅式,但是前提要⼿动删除损坏的block块。
切记,是删除损坏block⽂件和meta⽂件,⽽不是删除hdfs⽂件。
当然还可以先把⽂件get下载,然后hdfs删除,再对应上传。
切记删除不要执⾏: hdfs fsck / -delete 这是删除损坏的⽂件, 那么数据就直接丢了;除⾮⽆所谓丢数据,或者有信⼼从其他地⽅可以补数据到hdfs!