背景
公司在黑龙江的集群用了7年,最近总是遇到namenode双活,数据无法写入问题。
问题
2022-03-28 11:02:38,318 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 49 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.173:43484 Call#4 Retry#3
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_d/day_id=20220224/000218_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_d/day_id=20220224/000218_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
... 12 more
2022-03-28 11:02:38,319 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,319 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 68 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.178:54519 Call#22650 Retry#2
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211120/000162_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211120/000162_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
... 12 more
2022-03-28 11:02:38,319 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 89 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.132:46251 Call#24119 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20210924/000197_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20210924/000197_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
... 12 more
2022-03-28 11:02:38,320 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,320 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,320 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,320 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 50 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 136.192.59.66:60785 Call#38290463 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0469/20220328100252.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1331)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3494)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:851)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0469/20220328100252.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
... 12 more
2022-03-28 11:02:38,320 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,320 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,320 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,320 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,383 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,322 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,385 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 39 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.167:49984 Call#22762 Retry#2
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211224/000051_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/in_bil/contact_push_msg/day_id=20211224/000051_0.lzo_deflate. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
... 12 more
2022-03-28 11:02:38,321 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,385 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,384 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,383 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,386 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,386 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,387 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,387 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,387 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,387 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,390 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 30 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 136.192.59.69:42099 Call#55478308 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/itf/t_4glog_roam_d/20220328/08/LOGROAM_23_5MIN_2022032808500_20220328080959_20220328080722_mbl_dpi_HrcWhlgq86699.gz. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1331)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3494)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:851)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/itf/t_4glog_roam_d/20220328/08/LOGROAM_23_5MIN_2022032808500_20220328080959_20220328080722_mbl_dpi_HrcWhlgq86699.gz. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
... 12 more
2022-03-28 11:02:38,390 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,390 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 61 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 136.192.59.66:60785 Call#38290464 Retry#1
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0452/20220328105703.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1331)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3494)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:851)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot complete file /data/oc_guoxin_self_data/boncdb/dwa/dwa_mobile_data_event_new/20220328/0452/20220328105703.txt. Name node is in safe mode.
The reported blocks 22878610 needs additional 42322789 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1327)
... 12 more
2022-03-28 11:02:38,391 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,392 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,746 INFO destination.HDFSAuditDestination (HDFSAuditDestination.java:createConfiguration(263)) - Returning HDFS Filesystem Config: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
2022-03-28 11:02:38,795 INFO destination.HDFSAuditDestination (HDFSAuditDestination.java:getLogFileStream(224)) - Checking whether log file exists. hdfPath=hdfs://hljcluster/ranger/audit/hdfs/20220328/hdfs_ranger_audit_hljnn01.asiainfo.com.log, UGI=ocdp (auth:SIMPLE)
2022-03-28 11:02:38,818 INFO BlockStateChange (BlockManager.java:processReport(1939)) - BLOCK* processReport: from storage DS-dab00d6e-491d-4155-9c29-2967706ef4ed node DatanodeRegistration(136.192.59.179:50010, datanodeUuid=53d93291-9230-4348-8f2d-58357969f186, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-56;cid=CID-64402ccb-cbc1-4522-bd2a-bf5883a9e04c;nsid=1379355448;c=0), blocks: 528412, hasStaleStorage: true, processing time: 426 msecs
2022-03-28 11:02:38,818 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,818 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,818 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,820 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,820 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_guoxin
2022-03-28 11:02:38,851 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user oc_asiainfo
2022-03-28 11:02:38,852 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 87 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 136.192.59.178:54446 Call#4 Retry#3
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_m/month_id=202202/000257_0.lzo_deflate. Name node is in safe mode.
The reported blocks 23282231 needs additional 41919168 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1810)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /data/data_asiainfo/hive_data/db_app/flg_asia_serv_info_m/month_id=202202/000257_0.lzo_deflate. Name node is in safe mode.
The reported blocks 23282231 needs additional 41919168 blocks to reach the threshold 0.9900 of total blocks 65859998.
The number of live datanodes 53 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1806)
... 12 more
集群有坏块,导致进入安全模式,无法写入数据,影响业务
解决
处理坏块
#查询坏块
hdfs fsck /user/hadoop-twq/cmd -list-corruptfileblocks```
#删除坏块
hdfs fsck /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20220204/ -delete
#修复坏块必须指定到文件
hdfs debug recoverLease -path /data/data_asiainfo/hive_data/in_bil/contact_task/day_id=20220204/1.file
追踪问题
上次改完好了一段时间,没面出现如下新的问题
2022-04-02 12:01:26,670 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [136.192.59.131:8485, 136.192.59.133:8485, 136.192.59.132:8485], stream=QuorumOutputStream starting at txid 14906286046))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
136.192.59.132:8485: IPC's epoch 288 is less than the last promised epoch 289
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
136.192.59.133:8485: IPC's epoch 288 is less than the last promised epoch 289
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
136.192.59.131:8485: IPC's epoch 288 is less than the last promised epoch 289
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermission(FSNamesystem.java:1660)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(NameNodeRpcServer.java:760)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setPermission(ClientNamenodeProtocolServerSideTranslatorPB.java:453)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
解决
修改hdfs-site.xml
#Write timeout in milliseconds when writing to a quorum of remote journals.
dfs.qjournal.write-txns.timeout.ms=60000
修改core-site.xml
#Timeout for the actual monitorHealth() calls.
ha.health-monitor.rpc-timeout.ms=180000
结果
到目前为止稳定运行