本文模拟hdfs上block块损坏之后,如何定位损坏的块,并修复。
关于hdfs fsck命令
在HDFS中,提供了fsck命令,用于检查HDFS上文件和目录的健康状态、获取文件的block信息和位置信息等。
注意:fsck命令必须由HDFS超级用户来执行,普通用户无权限。
执行hdfs fsck命令后,会显示出相关的用法。
[hadoop@hadoop001 logs]$ hdfs fsck
此处省略1000字
总结常用的命令:
- 1.查看文件中损坏的块(-list-corruptfileblocks)
hdfs fsck /blockrecover/ -list-corruptfileblocks
- 2.将损坏的文件移动至/lost+found目录(-move)
hdfs fsck /blockrecover/testcorruptfiles.txt -move
- 3.删除损坏的文件(-delete)
hdfs fsck /blockrecover/testcorruptfiles.txt -delete
- 4.检查并列出所有文件状态(-files)
hdfs fsck /blockrecover/ -files
- 5.检查并打印正在被打开执行写操作的文件(-openforwrite)
hdfs fsck /blockrecover/ -openforwrite
- 6.打印文件的Block报告(-blocks) 需要和-files一起使用。
hdfs fsck /blockrecover/testcorruptfiles.txt -files -blocks
- 7.打印文件块的位置信息(-locations) 需要和-files -blocks一起使用。
hdfs fsck /blockrecover/testcorruptfiles.txt -files -blocks -locations
- 8.打印文件块位置所在的机架信息(-racks)
hdfs fsck /blockrecover/testcorruptfiles.txt -files -blocks -locations -racks
- 9.检查hdfs的健康状态
hdfs fsck /
创建一个文件并上传至hdfs上
[hadoop@hadoop001 ~]$ hdfs dfs -mkdir /blockrecover
[hadoop@hadoop001 ~]$ echo "This is a corrupt block. It needs to be recovered." > testcorruptfiles.txt
[hadoop@hadoop001 ~]$ hdfs dfs -put testcorruptfiles.txt /blockrecover/
[hadoop@hadoop001 ~]$ hdfs dfs -ls /blockrecover
Found 1 items
-rw-r--r-- 1 hadoop supergroup 51 2019-04-29 17:23 /blockrecover/testcorruptfiles.txt
#检查一下hdfs的健康状况,没有发现问题
[hadoop@hadoop001 ~]$ hdfs fsck /
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path / at Mon Apr 29 18:05:16 CST 2019
..............Status: HEALTHY
Total size: 654116 B
Total dirs: 12
Total files: 14
Total symlinks: 0
Total blocks (validated): 14 (avg. block size 46722 B)
Minimally replicated blocks: 14 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Apr 29 18:05:16 CST 2019 in 7 milliseconds
The filesystem under path '/' is HEALTHY
删除文件的一个block块的一个副本
#查看文件所在的block块名称
[hadoop@hadoop001 ~]$ hdfs fsck /blockrecover/testcorruptfiles.txt -files -blocks -locations
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path /blockrecover/testcorruptfiles.txt at Mon Apr 29 18:14:27 CST 2019
/blockrecover/testcorruptfiles.txt 51 bytes, 1 block(s): OK
0. BP-2041209051-127.0.0.1-1556350579057:blk_1073741890_1066 len=51 Live_repl=1 [DatanodeInfoWithStorage[127.0.0.1:50010,DS-8683d6d0-d0a5-4ff3-a21a-f94ab236a523,DISK]]
Status: HEALTHY
Total size: 51 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 51 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Apr 29 18:14:27 CST 2019 in 1 milliseconds
The filesystem under path '/blockrecover/testcorruptfiles.txt' is HEALTHY
[hadoop@hadoop001 ~]$
#从上面看出块为blk_1073741890_1066,现在找到它在linux上的路径
[hadoop@hadoop001 ~]$ find ./ -name "BP-2041209051-127.0.0.1-1556350579057"
./app/hadoop-2.6.0-cdh5.7.0/hadoop_tmp/dfs/data/current/BP-2041209051-127.0.0.1-1556350579057
[hadoop@hadoop001 ~]$ cd app/hadoop-2.6.0-cdh5.7.0/hadoop_tmp/dfs/data/current/BP-2041209051-127.0.0.1-1556350579057
[hadoop@hadoop001 BP-2041209051-127.0.0.1-1556350579057]$ ls
current scanner.cursor tmp
[hadoop@hadoop001 BP-2041209051-127.0.0.1-1556350579057]$ find ./ -name "*blk_1073741890_1066*"
./current/finalized/subdir0/subdir0/blk_1073741890_1066.meta
[hadoop@hadoop001 BP-2041209051-127.0.0.1-1556350579057]$ cd current/finalized/subdir0/subdir0/
[hadoop@hadoop001 subdir0]$ ls
blk_1073741825 blk_1073741841_1017.meta blk_1073741871 blk_1073741887_1063.meta
blk_1073741825_1001.meta blk_1073741855 blk_1073741871_1047.meta blk_1073741888
blk_1073741839 blk_1073741855_1031.meta blk_1073741872 blk_1073741888_1064.meta
blk_1073741839_1015.meta blk_1073741856 blk_1073741872_1048.meta blk_1073741889
blk_1073741840 blk_1073741856_1032.meta blk_1073741873 blk_1073741889_1065.meta
blk_1073741840_1016.meta blk_1073741857 blk_1073741873_1049.meta blk_1073741890
blk_1073741841 blk_1073741857_1033.meta blk_1073741887 blk_1073741890_1066.meta
[hadoop@hadoop001 subdir0]$
#把文件所在的块和meta文件删除掉
[hadoop@hadoop001 subdir0]$ rm -rf blk_1073741890 blk_1073741890_1066.meta
直接重启HDFS,直接模拟损坏效果,然后fsck检查:
#注意: Corrupt blocks: 1(有一个块损坏)
[hadoop@hadoop001 subdir0]$ hdfs fsck /
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path / at Mon Apr 29 18:51:06 CST 2019
..
/blockrecover/testcorruptfiles.txt: CORRUPT blockpool BP-2041209051-127.0.0.1-1556350579057 block blk_1073741890
/blockrecover/testcorruptfiles.txt: MISSING 1 blocks of total size 51 B.............Status: CORRUPT
Total size: 654116 B
Total dirs: 12
Total files: 14
Total symlinks: 0
Total blocks (validated): 14 (avg. block size 46722 B)
********************************
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 51 B
CORRUPT BLOCKS: 1
********************************
Minimally replicated blocks: 13 (92.85714 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 0.9285714
Corrupt blocks: 1
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Apr 29 18:51:06 CST 2019 in 41 milliseconds
The filesystem under path '/' is CORRUPT
[hadoop@hadoop001 subdir0]$
定位损坏的block块的位置
-
想要知道文件的哪些块分布在哪些机器上面,然后手工删除在Linux上损坏的块
-files 文件分块信息,
-blocks 在带-files参数后才显示block信息
-locations 在带-blocks参数后才显示block块所在datanode的具体IP位置,
-racks 在带-files参数后显示机架位置 -
①.检测缺失块
hdfs fsck -list-corruptfileblocks -
②.查看上面某一个文件的情况
hdfs fsck /blockrecover/testcorruptfiles.txt -files -blocks -locations
举例:
[hadoop@hadoop001 ~]$ hdfs fsck /blockrecover/testcorruptfiles.txt -files -blocks -locations
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path /blockrecover/testcorruptfiles.txt at Mon Apr 29 21:28:27 CST 2019
/blockrecover/testcorruptfiles.txt 51 bytes, 1 block(s):
/blockrecover/testcorruptfiles.txt: CORRUPT blockpool BP-2041209051-127.0.0.1-1556350579057 block blk_1073741890
MISSING 1 blocks of total size 51 B
0. BP-2041209051-127.0.0.1-1556350579057:blk_1073741890_1066 len=51 MISSING!
Status: CORRUPT
Total size: 51 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 51 B)
********************************
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 51 B
CORRUPT BLOCKS: 1
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 0.0
Corrupt blocks: 1
Missing replicas: 0
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Apr 29 21:28:27 CST 2019 in 1 milliseconds
The filesystem under path '/blockrecover/testcorruptfiles.txt' is CORRUPT
可以看到/blockrecover/testcorruptfiles.txt: CORRUPT blockpool BP-2041209051-127.0.0.1-1556350579057 block blk_1073741890 这个,根据这个就可以定位到相关的坏块的位置了。(然后我是这样找的:find ./ -name “blk_1073741890”)
-③ 也可以通过查看日志进行检查和排除,需要从上面命令中找到发生在哪台机器上,然后到此机器上查看日志,坏损坏的原因,找到文件块丢失的位置等。
手动修复
看一下hdfs debug这个命令,这个命令很重要
[hadoop@hadoop001 ~]$ hdfs debug
Usage: hdfs debug <command> [arguments]
verify [-meta <metadata-file>] [-block <block-file>]
recoverLease [-path <path>] [-retries <num-retries>]
[hadoop@hadoop001 ~]$
手动修复命令:
[hadoop@hadoop001 subdir0]$ hdfs debug recoverLease -path /blockrecover/testcorruptfiles.txt-retries 10
因为我部署的Hadoop是伪分布式,只有一个DataNode节点,所以只有一个副本,没有其它副本,所以一个块也就只有一份。当删除了一个副本上的块之后,就无法模拟了。但是正常情况下,如果有三个副本,删除一个块的一个副本之后,另外还有两个副本,那么就可以修复了。
自动修复
当数据块损坏后,DN节点执行行directoryscan操作之前,都不不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进行blockreport前,都不不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进行行恢复操作。
dfs.datanode.directoryscan.interval、dfs.blockreport.intervalMsec
hadoop官网上,hdfs-default.xml默认配置里面,可以搜到这两个参数。
那么CDH的配置里搜索没有这两个参数,怎么调整生效呢?
待补充~
总结
①当块损坏之后,当NN收到blockreport才会进行恢复操作,可能需要很长时间。生产中倾向于使用手动修复的方法去修复损坏的数据块。
②生产上一般倾向于使用手动修复方式,但是前提要手动删除损坏的block块。切记,是删除损坏block文件和meta文件,而不是删除hdfs文件。然后用命令修复:
hdfs debug recoverLease -path 文件位置 -retries 重试次数 # 修复指定路径的hdfs文件,尝试多次
③当然还可以先把文件get下载,然后hdfs删除,再对应上传。比如:假设数据仅有HDFS上的一份:
hdfs dfs -ls /xxxx
hdfs dfs -get /xxxx ./
hdfs dfs -rm /xxx
hdfs dfs -put xxx / #put到hdfs上之后,它会自动变为3份。
切记删除不不要执行: hdfs fsck / -delete 这是删除损坏的文件(它在hdfs上), 这样的话数据会丢失;除非无所谓丢数据,或者有信心从其他地方可以补数据到hdfs!
④如果知道文件的来源,也就是文件可以从其他库里刷过来一份,可以暴力的把hdfs损坏文件删除。然后数据重刷一遍。
hdfs fsck / -delete 这个命令是删除hdfs上所有损坏的块文件,而且它只会删除损坏的文件。而且这个命令耗时。
比如:“hdfs fsck /文件路径 -delete”。比如如果知道某个文件来自MySQL的某个表,只需要从MySQL这个表的数据重新刷新一份到HDFS平台即可。当然如果只有一个副本(就像我自己的伪分布式一样),或所有副本均已经损坏,则可以执行此命令。
HDFS面试题:hadoop出现文件块丢失怎么处理
首先需要定位到哪的数据块丢失,可以通过hdfs fsck命令检查找到块的位置,或者可以通过查看日志进行检查和排除,找到文件块丢失的位置后,如果文件不是很重要可以直接删除,然后重新复制到集群上一份即可,如果删除不了,每一个集群都会有备份,需要恢复备份。
另外块扫描文章:HDFS DataNode Scanners and Disk Checker Explained文章:
https://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/
http://fatkun.com/2017/07/hdfs-health-check.html