Hadoop3 Safemode

集群环境:

CHD: 6.1
Hadoop3.0
Environment : MYTEST
OS : CentOS Linux release 7.7.1908 (Core)
ROLE: 1NN, 5DN

问题症状

打开CM, 发现hdfs组件飘红, 点进去后发现集群处于safemode状态,且存在hdfs块丢失的问题.
在这里插入图片描述

于是,打开dfshealth健康检查页面(注意hadoop3之前为50070, hadoop3之后为9870)查看后, 看到hadoop集群确实进入到了safemode状态.

Hadoop HDFS健康检查页面地址:
http://${HOSTNAME}:9870/dfshealth.html#tab-overview
在这里插入图片描述

Overview '${HOSTNAME}:8020' (active)
Started:	Fri Feb 28 18:40:22 +0800 2020
Version:	3.0.0-cdh6.1.0, rb8dd3044ff414ac0bf14b77ab23d55ca291464a9
Compiled:	Fri Dec 07 09:00:00 +0800 2018 by jenkins from Unknown
Cluster ID:	cluster14
Block Pool ID:	BP-1007471777-${NN_IP}-1579591487193
Summary
Security is off.

Safe mode is ON. The reported blocks 0 needs additional 1393 blocks to reach the threshold 1.0000 of total blocks 1393. The number of live datanodes 5 has reached the minimum number 1. Safe mode will be turned off automatically once the thresholds have been reached.
翻译:
安全模式已打开. 当前总的block块数为1393, dn已向nn报告的块数为0, 距离阈值1.0000, 尚缺少额外的1393块. 活动dn节点为5台, 已经满足最小为1台的数量限制. 一旦达到限制解除条件, 安全模式将自动关闭.

1,483 files and directories, 1,393 blocks (1,393 replicated blocks, 0 erasure coded block groups) = 2,876 total filesystem object(s).

Heap Memory used 300.29 MB of 1.19 GB Heap Memory. Max Heap Memory is 1.19 GB.

Non Heap Memory used 86.86 MB of 88.77 MB Commited Non Heap Memory. Max Non Heap Memory is <unbounded>.

错误排查

①. 检查hdfs文件块状况:

查询

经查询,如下所示, 可以看到有1393个块丢失.
export HADOOP_USER_NAME=hdfs; hdfs fsck / -files -blocks -locations

/user/oozie/share/lib/lib_20200121152632/sqoop/websocket-server-9.3.20.v20170531.jar 35066 bytes, replicated: replication=3, 1 block(s):  MISSING 1 blocks of total size 35066 B
0. BP-1007471777-${NN_IP}-1579591487193:blk_1073741911_1087 len=35066 MISSING!

/user/oozie/share/lib/lib_20200121152632/sqoop/websocket-servlet-9.3.20.v20170531.jar 18188 bytes, replicated: replication=3, 1 block(s):  MISSING 1 blocks of total size 18188 B
0. BP-1007471777-${NN_IP}-1579591487193:blk_1073741922_1098 len=18188 MISSING!

/user/oozie/share/lib/lib_20200121152632/sqoop/xz-1.6.jar 103131 bytes, replicated: replication=3, 1 block(s):  MISSING 1 blocks of total size 103131 B
0. BP-1007471777-${NN_IP}-1579591487193:blk_1073742027_1203 len=103131 MISSING!

/user/oozie/share/lib/lib_20200121152632/sqoop/zookeeper.jar 1459956 bytes, replicated: replication=3, 1 block(s):  MISSING 1 blocks of total size 1459956 B
0. BP-1007471777-${NN_IP}-1579591487193:blk_1073742005_1181 len=1459956 MISSING!

/user/spark <dir>
/user/spark/applicationHistory <dir>
/user/yarn <dir>
/user/yarn/mapreduce <dir>
/user/yarn/mapreduce/mr-framework <dir>
/user/yarn/mapreduce/mr-framework/3.0.0-cdh6.1.0-mr-framework.tar.gz 222906481 bytes, replicated: replication=1, 2 block(s):  MISSING 2 blocks of total size 222906481 B
0. BP-1007471777-${NN_IP}-1579591487193:blk_1073742800_1976 len=134217728 MISSING!
1. BP-1007471777-${NN_IP}-1579591487193:blk_1073743045_2221 len=88688753 MISSING!


Status: CORRUPT
 Number of data-nodes:	5
 Number of racks:		1
 Total dirs:			85
 Total symlinks:		0

Replicated Blocks:
 Total size:	1426108438 B
 Total files:	1392 (Files currently being written: 6)
 Total blocks (validated):	1393 (avg. block size 1023767 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:	1393 (100.0 %)
  dfs.namenode.replication.min:	1
  CORRUPT FILES:	1392
  MISSING BLOCKS:	1393
  MISSING SIZE:		1426108438 B
  ********************************
 Minimally replicated blocks:	0 (0.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	0.0
 Missing blocks:		1393
 Corrupt blocks:		0
 Missing replicas:		0

Erasure Coded Block Groups:
 Total size:	0 B
 Total files:	0
 Total block groups (validated):	0
 Minimally erasure-coded block groups:	0
 Over-erasure-coded block groups:	0
 Under-erasure-coded block groups:	0
 Unsatisfactory placement block groups:	0
 Average block group size:	0.0
 Missing block groups:		0
 Corrupt block groups:		0
 Missing internal blocks:	0
FSCK ended at Wed Mar 25 17:40:40 CST 2020 in 237 milliseconds


The filesystem under path '/' is CORRUPT(坏, 腐败)

②. 退出安全模式
登录nn节点, 执行leave命令, 退出安全模式:

  • hdfs dfsadmin -safemode leave # 离开安全模式
  • hdfs dfsadmin -safemode get #获取安全模式状态 Safe mode is OFF
  • hdfs dfsadmin -safemode enter #进入安全模式
  • hdfs dfsadmin -safemode wait #等待安全模式结束

进入safemode时,很多操作都执行不了:

不可以创建文件夹
不可以上传文件
不可以删除文件

如下recoverLease 命令再safemode下也无法执行:

[root@${NN_HOST} cloudera-scm-server]# hdfs debug recoverLease -path /user/oozie/share/lib/lib_20200121152632/sqoop/xz-1.6.jar -retries 2
recoverLease got exception: Cannot recover the lease of /user/oozie/share/lib/lib_20200121152632/sqoop/xz-1.6.jar. Name node is in safe mode.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:{NN_HOST}
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1446)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1433)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2463)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:816)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

Retrying in 5000 ms...
Retry #1

③. 恢复hdfs文件租约

1、手动修复:
hdfs fsck / #首先检查哪些数据块丢失了
hdfs debug recoverLease -path 文件位置 -retries 重试次数 # 修复指定路径的hdfs文件,尝试多次
此时,hdfs就能被修复了,切记不要使用hdfs fsck / -delete 命令,它是删除文件,会导致数据彻底丢失,当然若只有一个副本,或所有副本均已经损坏,则可以执行此命令。
.
2、自动修复
hdfs当然会自动修复损坏的数据块,当数据块损坏后,DN节点执⾏directoryscan(datanode进行内存和磁盘数据集块校验)操作之前,都不会发现损坏;也就是directoryscan操作校验是间隔6h
dfs.datanode.directoryscan.interval : 21600

在DN向NN进行blockreport前,都不会恢复数据块;也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
最终当NN收到blockreport才会进行恢复操作
.
生产中倾向于使用手动修复的方法去修复损坏的数据块。

单条恢复命令:

hdfs debug recoverLease -path /user/oozie/share/lib/lib_20200121152632/sqoop/zookeeper.jar -retries 2

查询全部MISSING 文件:

hdfs fsck / -files -blocks -locations | grep ‘.jar’|awk ‘{print $1}’

批量恢复所有MISSING文件:

hdfs fsck / -files -blocks -locations | grep ‘.jar’|awk ‘{print $1}’|while read jar; do hdfs debug recoverLease -path $jar -retries 2; done

[root@{NN_HOST} cloudera-scm-server]# hdfs fsck / -files -blocks -locations | grep '.jar'|awk '{print $1}'|while read jar; do hdfs debug recoverLease -path $jar -retries 2; done
Connecting to namenode via http://{NN_HOST}:9870/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2F
recoverLease SUCCEEDED on /user/oozie/share/lib/lib_20200121152632/distcp/hadoop-distcp.jar
recoverLease SUCCEEDED on /user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
recoverLease SUCCEEDED on /user/oozie/share/lib/lib_20200121152632/distcp/oozie-sharelib-distcp-5.0.0-cdh6.1.0.jar
recoverLease SUCCEEDED on /user/oozie/share/lib/lib_20200121152632/distcp/oozie-sharelib-distcp.jar
recoverLease SUCCEEDED on /user/oozie/share/lib/lib_20200121152632/hcatalog/HikariCP-2.6.1.jar
recoverLease SUCCEEDED on /user/oozie/share/lib/lib_20200121152632/hcatalog/HikariCP-java7-2.4.12.jar
recoverLease SUCCEEDED on /user/oozie/share/lib/lib_20200121152632/hcatalog/ST4-4.0.4.jar

④. 执行完上一步后, 查看hadoop MISSING blocks问题并未解决.
继续查看hdfs文件是否存在. 看到是存在的.

[root@${NN_HOST} ~]# hdfs dfs -ls /user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
-rwxrwxr-x 3 oozie oozie 3780056 2020-01-21 15:26 /user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar

⑤. 尝试查看jar文件内容(实际上二进制jar文件内容无法查看,目的促使FsFileSystem查找block块)

[root@tech50 ~]# hdfs dfs -ls /user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
-rwxrwxr-x   3 oozie oozie    3780056 2020-01-21 15:26 /user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
[root@techtest204-50 ~]# hdfs dfs -tail /user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
20/03/26 15:54:18 WARN hdfs.DFSClient: No live nodes contain block BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 after checking nodes = [], ignoredNodes = null
20/03/26 15:54:18 INFO hdfs.DFSClient: No node available for BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
20/03/26 15:54:18 INFO hdfs.DFSClient: Could not obtain BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 from any node:  No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
20/03/26 15:54:18 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2875.0318767294357 msec.
20/03/26 15:54:21 WARN hdfs.DFSClient: No live nodes contain block BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 after checking nodes = [], ignoredNodes = null
20/03/26 15:54:21 INFO hdfs.DFSClient: No node available for BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
20/03/26 15:54:21 INFO hdfs.DFSClient: Could not obtain BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 from any node:  No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
20/03/26 15:54:21 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 5921.653390282358 msec.
20/03/26 15:54:27 WARN hdfs.DFSClient: No live nodes contain block BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 after checking nodes = [], ignoredNodes = null
20/03/26 15:54:27 INFO hdfs.DFSClient: No node available for BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
20/03/26 15:54:27 INFO hdfs.DFSClient: Could not obtain BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 from any node:  No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
20/03/26 15:54:27 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 7743.24114377171 msec.
20/03/26 15:54:34 WARN hdfs.DFSClient: No live nodes contain block BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 after checking nodes = [], ignoredNodes = null
20/03/26 15:54:34 WARN hdfs.DFSClient: Could not obtain block: BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
20/03/26 15:54:34 WARN hdfs.DFSClient: No live nodes contain block BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 after checking nodes = [], ignoredNodes = null
20/03/26 15:54:34 WARN hdfs.DFSClient: Could not obtain block: BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
20/03/26 15:54:34 WARN hdfs.DFSClient: DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar
	at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:879)
	at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:862)
	at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:841)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
	at java.io.DataInputStream.read(DataInputStream.java:100)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:92)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
	at org.apache.hadoop.fs.shell.Tail.dumpFromOffset(Tail.java:96)
	at org.apache.hadoop.fs.shell.Tail.processPath(Tail.java:73)
	at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
	at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
	at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
	at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
	at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
	at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:326)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:389)
tail: Could not obtain block: BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar

看到如下字眼:
20/03/26 15:54:34 WARN hdfs.DFSClient: Could not obtain block: BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/netty-all-4.1.17.Final.jar No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1007471777-${NN_HOST}-1579591487193:blk_1073742432_1608 file=/user/oozie/share/lib/lib_20200121152632/distcp/

表名当前active的node列表中没有找到该block,该block存在于Dead nodes。

⑤. 追究问题根源&处理办法

问题追踪:
在安装cdh6.1之前, 此6节点hadoop集群为5.10.2版本, 猜测应是 DataNode 数据目录
dfs.datanode.data.dir 在安装cdh6.1之前, 未及时做删除清理操作, 导致文件系统混乱.

经综合评估,认为该部分丢失文件对集群没什么影响,遂决定在namenode直接删除已丢失的block.

hdfs fsck / -files -blocks -locations | grep '.jar'|awk '{print $1}'|while read jar; do hdfs fsck -delete $jar ; done

之后在dfshealth web端和cm侧查看, 已恢复正常.

参考列表:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值