NAMENODE双活,数据不能写入

11 篇文章 0 订阅
10 篇文章 0 订阅
集群遇到HDFS安全模式问题,导致数据无法写入。原因是坏块导致NameNode进入安全模式,通过`hdfs fsck`查询并删除坏块,使用`hdfs debug recoverLease`修复。之后出现日志同步错误,调整`dfs.qjournal.write-txns.timeout.ms`和`ha.health-monitor.rpc-timeout.ms`参数解决。目前集群运行稳定。
摘要由CSDN通过智能技术生成

背景

公司在黑龙江的集群用了7年,最近总是遇到namenode双活,数据无法写入问题。

问题

2022-03-28 11:02:38,318 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 49 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.173:43484 Call#4 Retry#3
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_d/day_id=20220224/000218_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_d/day_id=20220224/000218_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
	... 12 more
2022-03-28 11:02:38,319 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,319 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 68 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.178:54519 Call#22650 Retry#2
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211120/000162_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211120/000162_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
	... 12 more
2022-03-28 11:02:38,319 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 89 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.132:46251 Call#24119 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20210924/000197_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20210924/000197_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
	... 12 more
2022-03-28 11:02:38,320 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,320 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,320 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,320 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 50 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 136.192.59.66:60785 Call#38290463 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0469/20220328100252.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1331)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3494)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:851)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0469/20220328100252.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
	... 12 more
2022-03-28 11:02:38,320 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,320 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,320 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,320 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,383 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,322 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,385 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 39 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.167:49984 Call#22762 Retry#2
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211224/000051_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211224/000051_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
	... 12 more
2022-03-28 11:02:38,321 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,385 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,384 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,383 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,386 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,387 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,387 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,387 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,387 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,390 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 30 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 136.192.59.69:42099 Call#55478308 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/itf/t_4glog_roam_d/20220328/08/LOGROAM_23_5MIN_2022032808500_20220328080959_20220328080722_mbl_dpi_HrcWhlgq86699.gz. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1331)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3494)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:851)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/itf/t_4glog_roam_d/20220328/08/LOGROAM_23_5MIN_2022032808500_20220328080959_20220328080722_mbl_dpi_HrcWhlgq86699.gz. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
	... 12 more
2022-03-28 11:02:38,390 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 61 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 136.192.59.66:60785 Call#38290464 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0452/20220328105703.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1331)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3494)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:851)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0452/20220328105703.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
	... 12 more
2022-03-28 11:02:38,391 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,392 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,746 INFO  destination.HDFSAuditDestination (HDFSAuditDestination.java:createConfiguration(263)) - Returning HDFS Filesystem Config: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
2022-03-28 11:02:38,795 INFO  destination.HDFSAuditDestination (HDFSAuditDestination.java:getLogFileStream(224)) - Checking whether log file exists. hdfPath=hdfs://hljcluster/ranger/audit/hdfs/20220328/hdfs_ranger_audit_hljnn01.asiainfo.com.log, UGI=ocdp (auth:SIMPLE)
2022-03-28 11:02:38,818 INFO  BlockStateChange (BlockManager.java:processReport(1939)) - BLOCK* processReport: from storage DS-dab00d6e-491d-4155-9c29-2967706ef4ed node DatanodeRegistration(136.192.59.179:50010, datanodeUuid=53d93291-9230-4348-8f2d-58357969f186, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-56;cid=CID-64402ccb-cbc1-4522-bd2a-bf5883a9e04c;nsid=1379355448;c=0), blocks: 528412, hasStaleStorage: true, processing time: 426 msecs
2022-03-28 11:02:38,818 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,818 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,818 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,820 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,820 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,851 WARN  security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,852 INFO  ipc.Server (Server.java:run(2172)) - IPC Server handler 87 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.178:54446 Call#4 Retry#3
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_m/month_id=202202/000257_0.lzo_deflate. Name node is in safe mode.
The reported blocks 23282231 needs additional 41919168 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_m/month_id=202202/000257_0.lzo_deflate. Name node is in safe mode.
The reported blocks 23282231 needs additional 41919168 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
	... 12 more

请添加图片描述
集群有坏块,导致进入安全模式,无法写入数据,影响业务

解决

处理坏块

#查询坏块
hdfs fsck /user/hadoop-twq/cmd -list-corruptfileblocks```
#删除坏块
hdfs fsck /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20220204/ -delete
#修复坏块必须指定到文件
hdfs debug recoverLease -path /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20220204/1.file

追踪问题

上次改完好了一段时间,没面出现如下新的问题

2022-04-02 12:01:26,670 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [136.192.59.131:8485, 136.192.59.133:8485, 136.192.59.132:8485], stream=QuorumOutputStream starting at txid 14906286046))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
136.192.59.132:8485: IPC's epoch 288 is less than the last promised epoch 289
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
	at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
	at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

136.192.59.133:8485: IPC's epoch 288 is less than the last promised epoch 289
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
	at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
	at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

136.192.59.131:8485: IPC's epoch 288 is less than the last promised epoch 289
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
	at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
	at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

	at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
	at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
	at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
	at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermission(FSNamesystem.java:1660)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(NameNodeRpcServer.java:760)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setPermission(ClientNamenodeProtocolServerSideTranslatorPB.java:453)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

解决

修改hdfs-site.xml

#Write timeout in milliseconds when writing to a quorum of remote journals.
dfs.qjournal.write-txns.timeout.ms=60000

修改core-site.xml

#Timeout for the actual monitorHealth() calls.
ha.health-monitor.rpc-timeout.ms=180000

结果

到目前为止稳定运行

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值