HDFS的BLOCK损坏或丢失问题检查与处理

1. fsck命令介绍

fsck是file system check的简写,中文名其实就是文件系统检查,通过hdfs fsck命令可以看出具体的参数。

[hdfs@rtn01 ~]$ hdfs fsck
Usage: DFSck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]] [-maintenance]
	<path>	start checking from this path
	-move	move corrupted files to /lost+found
	-delete	delete corrupted files
	-files	print out files being checked
	-openforwrite	print out files opened for write
	-includeSnapshots	include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
	-list-corruptfileblocks	print out list of missing blocks and files they belong to
	-blocks	print out block report
	-locations	print out locations for every block
	-racks	print out network topology for data-node locations

	-maintenance	print out maintenance state node details
	-blockId	print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

Please Note:
	1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually  tagged CORRUPT or HEALTHY depending on their block allocation status
	2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

[hdfs@rtn01 ~]$

在这里插入图片描述

# 中文注解如下:
-move: 移动损坏的文件到/lost+found目录下
-delete: 删除损坏的文件
-openforwrite: 输出检测中的正在被写的文件
-list-corruptfileblocks: 输出损坏的块及其所属的文件
-files: 输出正在被检测的文件
-blocks: 输出block的详细报告 (需要和-files参数一起使用)
-locations: 输出block的位置信息 (需要和-files参数一起使用)
-racks: 输出文件块位置所在的机架信息(需要和-files参数一起使用)

2. 检测缺失的BLOCK块

  • 输出损坏的块及其所属的文件
hdfs fsck -list-corruptfileblocks

在这里插入图片描述

  • 输出文件及其对应的块信息
hdfs fsck / | egrep -v '^\.+$' | grep -v eplica

在这里插入图片描述

上图显示的不清晰,将最下方几行粘贴到下方:

/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000105_0.snappy: MISSING 1 blocks of total size 39250374 B..
/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy: CORRUPT blockpool BP-1033365284-50.27.1.1-1534905241284 block blk_1310129758


/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy: MISSING 1 blocks of total size 30618782 B....
/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000109_0.snappy: CORRUPT blockpool BP-1033365284-50.27.1.1-1534905241284 block blk_1310129895


/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000109_0.snappy: MISSING 1 blocks of total size 38601489 B...............................................................
..............................................................................Status: CORRUPT
 Total size:	39846282415994 B (Total open files size: 124242 B)
 Total dirs:	22313
 Total files:	718378
 Total symlinks:		0 (Files currently being written: 2)
 Total blocks (validated):	814197 (avg. block size 48939362 B) (Total open file blocks (not validated): 2)
  ********************************
  UNDER MIN REPL'D BLOCKS:	3104 (0.38123453 %)
  CORRUPT FILES:	3104
  MISSING BLOCKS:	3104
  MISSING SIZE:		101125381421 B
  CORRUPT BLOCKS: 	3104
  ********************************
 Corrupt blocks:		3104
 Number of data-nodes:		4
 Number of racks:		2
FSCK ended at Mon Nov 25 14:00:35 CST 2019 in 10141 milliseconds


The filesystem under path '/' is CORRUPT
[hdfs@rtn01 ~]$

  • 查看上方命令返回的某一个块信息
hdfs fsck /user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy -locations -blocks -files

在这里插入图片描述

上图显示的不清晰,将信息粘贴到下方:

[hdfs@rtn01 ~]$ hdfs fsck /user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy -locations -blocks -files
Connecting to namenode via http://rtn02:50070
FSCK started by hdfs (auth:SIMPLE) from /50.27.1.1 for path /user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy at Mon Nov 25 14:07:19 CST 2019
/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy 30618782 bytes, 1 block(s):
/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy: CORRUPT blockpool BP-1033365284-50.27.1.1-1534905241284 block blk_1310129758
 Replica placement policy is violated for BP-1033365284-50.27.1.1-1534905241284:blk_1310129758_236410095. Block should be additionally replicated on 1 more rack(s).
 MISSING 1 blocks of total size 30618782 B
0. BP-1033365284-50.27.1.1-1534905241284:blk_1310129758_236410095 len=30618782 MISSING!

Status: CORRUPT
 Total size:	30618782 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 30618782 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:	1 (100.0 %)
  dfs.namenode.replication.min:	1
  CORRUPT FILES:	1
  MISSING BLOCKS:	1
  MISSING SIZE:		30618782 B
  CORRUPT BLOCKS: 	1
  ********************************
 Minimally replicated blocks:	0 (0.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		1 (100.0 %)
 Default replication factor:	1
 Average block replication:	0.0
 Corrupt blocks:		1
 Missing replicas:		0
 Number of data-nodes:		4
 Number of racks:		2
FSCK ended at Mon Nov 25 14:07:19 CST 2019 in 1 milliseconds


The filesystem under path '/user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy' is CORRUPT
[hdfs@rtn01 ~]$

3. BLOCK丢失后的解决办法

3.1 BLOCK部分副本损坏

  • 方案一:hadoop会在6个小时候自动检测并修复
主动发现阶段:
当数据块损坏后,DN节点执行directoryscan操作(间隔6小时)之前,不会发现损坏。
dfs.datanode.directoryscan.interval : 21600
主动回复阶段:
在DN向NN进行blockreport(间隔6小时)前,都不会恢复数据块; 
dfs.blockreport.intervalMsec : 21600000

当NN收到blockreport才会进⾏行行恢复操作(也就是12小时之后)
  • 方案二:手工重启hdfs服务后会自动修复
重启hdfs服务会进行坏块检测,若发现坏块就会进行主动修复(不定期的重启集群服务对数据块的保护有很大的益处)
  • 方案三:手工修复(推荐使用)
# 其实hdfs dfs -get而后-put就可以解决,或者用下面的命令也行。
hdfs debug  recoverLease  -path /user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy -retries 10

在这里插入图片描述

上图显示成功,但由于我的集群是单副本,其实损坏的副本是没办法通过手工修复进行解决的,这里只是演示命令的截图。

3.2 BLOCK全副本均损坏

  • 单副本(伪分布式),特别容易出现坏块,例如内存、磁盘、网络等各方面原因均能导致该现象出现

  • 多副本,某些或某个BLOCK的所有副本全部丢失

若出现了上述的两种情况,那就只能通过如下两类处理办法解决。

3.2.1 若文件不重要

# 退出安全模式
hdfs dfsadmin -safemode leave
#删除损坏(丢失)的BLOCK
hdfs fsck  /path  -delete

执行上述命令前需确认如下两点:

  1. 退出安全模式 hadoop dfsadmin -safemode leave

  2. 注意/path的正确性,尽量不使用/,导致重要文件(已经全部损坏)还没有做相应处理就被删除,导致回溯困难。

注意: 这种方式会出现数据丢失,损坏的block会被删掉

3.2.2 若文件很重要

  • 场景一:若数据来源于其他Hadoop集群,重新获取BLOCK并上传至缺失目录
# 源Hadoop集群,获取BLOCK
hdfs dfs -get /user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/000106_0.snappy
# 丢失BLOCK的集群
hdfs dfs -put 000106_0.snappy /user/hive/warehouse/sdata.db/s002_lm_pm_shd/dt=20191019/
  • 场景二:若数据为Hadoop内部生成,需要重新生成数据
# 需要注意BLOCK若为某张HIVE表的一部分时,需要对表进行重新生成,而非仅仅对HDFS文件的操作

4.处理完成后再次检查BLOCK

hdfs fsck -list-corruptfileblocks

在这里插入图片描述

hdfs fsck /

在这里插入图片描述

从上图可以看到,损坏(丢失)的BLOCK已经消失。

  • 4
    点赞
  • 38
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值