从Hive的日志文件超大引出磁盘满问题并追踪出NameNode 安全模式的问题与处理

40 篇文章 1 订阅
38 篇文章 0 订阅

目录

 

发现NameNode安全模式问题

初步判断是磁盘满导致安全模式

NameNode安全模式解释

补充解释

尝试重启DataNode

重启NameNode

总结


发现NameNode安全模式问题

一次偶然的机会,准备看下Hive的版本,发现无法执行hive脚本,发现Hive命令无法正常进入hive命令行,报错信息如下:

ls: cannot access /usr/local/spark/lib/spark-assembly-*.jar: No such file or directory
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: UseCMSCompactAtFullCollection is deprecated and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: UseCMSCompactAtFullCollection is deprecated and will likely be removed in a future release.

Logging initialized using configuration in file:/usr/local/apache-hive-1.2.1-bin/conf/hive-log4j.properties
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /tmp/scratchdir/hadoop/a6952441-35b8-492b-ac7e-b94b9e0687e2. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.

初步判断是磁盘满导致安全模式

看报错信息是NameNode进入安全模式了,我一开始以为是上次Hive日志太大使磁盘满了,导致NameNode的安全模式

通过 df -h 查看磁盘使用情况,发现/usr磁盘空间已满

于是按照上次的处理经验,去/usr目录找出占用空间最大的前30个文件

du -m |sort -n -r|head -30

发现和上次排查结果一样,占用空间较大的为 /usr/local/apache-hive-1.2.1-bin/logs/hadoop/  底下的日志文件,进去目录发现有一天的日志文件达到8g左右

 

于是查看hive.log.2019-04-17的日志文件,发现大量的堆内存溢出的报错

Caused by: java.lang.OutOfMemoryError: Java heap space
2019-04-17 16:42:48,669 INFO  [HiveServer2-Background-Pool: Thread-1086796]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=Driver.execute start=1555489632937 end=1555490568669 duration=935732 from=org.apache.hadoop.hive.ql.Driver>
2019-04-17 16:42:48,668 INFO  [HiveServer2-Background-Pool: Thread-1086757]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=Driver.execute start=1555487793258 end=1555490568668 duration=2775410 from=org.apache.hadoop.hive.ql.Driver>
2019-04-17 16:42:48,674 INFO  [HiveServer2-Background-Pool: Thread-1086757]: ql.Driver (SessionState.java:printInfo(951)) - MapReduce Jobs Launched: 
2019-04-17 16:42:48,674 INFO  [HiveServer2-Background-Pool: Thread-1086757]: ql.Driver (SessionState.java:printInfo(951)) - Stage-Stage-9: Map: 12   Cumulative CPU: 160.61 sec   HDFS Read: 7887118 HDFS Write: 983268 SUCCESS
2019-04-17 16:42:48,675 INFO  [HiveServer2-Background-Pool: Thread-1086757]: ql.Driver (SessionState.java:printInfo(951)) - Total MapReduce CPU Time Spent: 2 minutes 40 seconds 610 msec
2019-04-17 16:42:48,680 WARN  [HiveServer2-Handler-Pool: Thread-635777]: cli.CLIService (CLIService.java:getOperationStatus(396)) - OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=92894a39-42a6-4c27-a5b0-cccf51ca4e09]: The background operation was aborted
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:206)
	at org.apache.hive.service.cli.CLIService.getOperationStatus(CLIService.java:387)
	at org.apache.hive.service.cli.thrift.ThriftCLIService.GetOperationStatus(ThriftCLIService.java:610)
	at org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1473)
	at org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1458)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
2019-04-17 16:42:48,681 WARN  [HiveServer2-Handler-Pool: Thread-635777]: cli.CLIService (CLIService.java:getOperationStatus(396)) - OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=92894a39-42a6-4c27-a5b0-cccf51ca4e09]: The background operation was aborted
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space

为了得到更详尽的内存溢出的原因的信息,找出最早的发生内存溢出的时间

cat /usr/local/apache-hive-1.2.1-bin/logs/hadoop/hive.log.2019-04-17  |grep java.lang.OutOfMemoryError -10 -n|head -1000

根据得到的发送内存溢出的行信息,找出对应的行信息的前后日志信息

sed -n '349031,349181p' /usr/local/apache-hive-1.2.1-bin/logs/hadoop/hive.log.2019-04-17

发现有别的错误信息,大概意思是namemode01处于安全模式,namenode02处于standby模式,这个的具体日志忘记保存了

当时自己错误以为是磁盘满导致的安全模式,于是把没用的hive的日志信息全部删除了,磁盘空间降到62%

等了一段时间,在NameNode的webui上看,还是一直处于安全模式,

Security is off.
Safe mode is ON. Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
752478 files and directories, 596637 blocks = 1349115 total filesystem object(s).
Heap Memory used 469.58 MB of 753.5 MB Heap Memory. Max Heap Memory is 889 MB.
Non Heap Memory used 90.51 MB of 93.56 MB Commited Non Heap Memory. Max Non Heap Memory is -1 B.

于是百度了下Namenode的安全模式相关的问题“Name node is in safe mode”,发现自己真是太天真了

NameNode安全模式解释

NameNode在启动的时候首先进入安全模式,如果datanode丢失的block达到一定的比例(1-dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态

dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS启动的时候,如果DataNode上报的block个数达到了元数据记录的block个数的0.999倍才可以离开安全模式,否则一直是这种只读模式。如果设为1则HDFS永远是处于SafeMode。

下面这行摘录自NameNode启动时的日志(block上报比例1达到了阀值0.9990)

The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 18 seconds.

hadoop dfsadmin -safemode leave

有两个方法离开这种安全模式

1. 修改dfs.safemode.threshold.pct为一个比较小的值,缺省是0.999。

2. hadoop dfsadmin -safemode leave命令强制离开

还有一个我觉得解释比较完整的

safemode是namenode的一种状态(active/standby/safemode安全模式)
namenode进入安全模式的原理:
  a、namenode发现集群中的block丢失率达到一定比例时(0.01%),namenode就会进入安全模式,在安全模式下,客户端不能对任何数据进行操作,只能查看元数据信息(比如ls/mkdir)
  b、如何退出安全模式?
找到问题所在,进行修复(比如修复宕机的datanode)
或者可以手动强行退出安全模式(没有真正解决问题): hdfs namenode --safemode leave
  c、在hdfs集群正常冷启动时,namenode也会在safemode状态下维持相当长的一段时间,此时你不需要去理会,等待它自动退出安全模式即可
(原理:
namenode的内存元数据中,包含文件路径、副本数、blockid,及每一个block所在datanode的信息,而fsimage中,不包含block所在的datanode信息,
那么,当namenode冷启动时,此时内存中的元数据只能从fsimage中加载而来,从而就没有block所在的datanode信息——>
就会导致namenode认为所有的block都已经丢失——>进入安全模式——>datanode启动后,会定期向namenode汇报自身所持有的blockid信息,
——>随着datanode陆续启动,从而陆续汇报block信息,namenode就会将内存元数据中的block所在datanode信息补全更新——>找到了所有block的位置,从而自动退出安全模式)
 

补充解释

我认为本次的安全模式也和前阵子测试环境做过下线测试有关,测试环境之前有在做下线测试,下线的时候由于DataNode上的数据块太多,导致下线时间要很久,于是有做取消下线的操作,即去除下线节点,但是实际在下线过程中,NameNode会把DataNode节点上的待复制的数据块复制到其他节点上,如果下线终止了,下线过程中产生的多余的数据块将会停留在该节点上,直到DataNode重启时做数据块汇报时,NameNode发现有多余副本,他才会把多余的副本从该节点删除,这个应该是导致测试环境有些节点的数据不均衡的原因。

尝试重启DataNode

方案:鉴于网上查找原因与自己猜想,决定重启一个磁盘利用率较高的DataNode进程,让其重新向NameNode进行数据块汇报,以删除多余的数据块

重启后,发现节点上的数据块的大小和数量都与重启前基本一致,并且NameNode的安全模式也未能解除

重启DataNode时NameNode上的日志信息

[hadoop@namenodetest01.bi ~]$ tail -100f /usr/local/hadoop-2.6.3/logs/hadoop-hadoop-namenode-namenodetest01.bi.log |grep 10.104.101.45
2019-04-29 14:56:18,164 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* removeDeadDatanode: lost heartbeat from 10.104.101.45:50010
2019-04-29 14:56:18,489 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/10.104.101.45:50010
2019-04-29 14:59:55,630 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(10.104.101.45, datanodeUuid=6e1fe3f8-bce4-41bc-82f8-aaf554d60abd, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=hadoop2cluster;nsid=1577803622;c=0) storage 6e1fe3f8-bce4-41bc-82f8-aaf554d60abd
2019-04-29 14:59:55,630 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/10.104.101.45:50010
2019-04-29 14:59:55,630 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.104.101.45:50010
2019-04-29 14:59:55,630 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:NORMAL:10.104.101.45:50010 failed.
2019-04-29 14:59:55,630 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:FAILED:10.104.101.45:50010 from DataNode10.104.101.45:50010
2019-04-29 14:59:55,678 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID DS-6269c9dd-9936-4f13-935a-be590ab8e542 for DN 10.104.101.45:50010
2019-04-29 14:59:55,972 INFO BlockStateChange: BLOCK* processReport: from storage DS-6269c9dd-9936-4f13-935a-be590ab8e542 node DatanodeRegistration(10.104.101.45, datanodeUuid=6e1fe3f8-bce4-41bc-82f8-aaf554d60abd, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=hadoop2cluster;nsid=1577803622;c=0), blocks: 53296, hasStaleStorages: false, processing time: 75 msecs
2019-04-29 15:14:28,733 INFO BlockStateChange: BLOCK* processReport: from storage DS-6269c9dd-9936-4f13-935a-be590ab8e542 node DatanodeRegistration(10.104.101.45, datanodeUuid=6e1fe3f8-bce4-41bc-82f8-aaf554d60abd, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=hadoop2cluster;nsid=1577803622;c=0), blocks: 53296, hasStaleStorages: false, processing time: 23 msecs

重启NameNode

怀疑是:NameNode节点上的内存保留的数据块信息还没更新,导致数据块未能被删除,于是决定对处于安全模式下的NameNode做重启操作(补充,看了下源码应该是安全模式未解除,才导致数据块信息不能重新计算导致)

 /**
   * Periodically calls computeReplicationWork().
   */
  private class ReplicationMonitor implements Runnable {

    @Override
    public void run() {
      while (namesystem.isRunning()) {
        try {
          // Process replication work only when active NN is out of safe mode.
          if (namesystem.isPopulatingReplQueues()) {
            computeDatanodeWork();
            processPendingReplications();
          }
          Thread.sleep(replicationRecheckInterval);
        } catch (Throwable t) {
          if (!namesystem.isRunning()) {
            LOG.info("Stopping ReplicationMonitor.");
            if (!(t instanceof InterruptedException)) {
              LOG.info("ReplicationMonitor received an exception"
                  + " while shutting down.", t);
            }
            break;
          } else if (!checkNSRunning && t instanceof InterruptedException) {
            LOG.info("Stopping ReplicationMonitor for testing.");
            break;
          }
          LOG.fatal("ReplicationMonitor thread received Runtime exception. ", t);
          terminate(1, t);
        }
      }
    }
  }

NameNode重启后,在webUI上看,NameNode的安全模式已解除

日志也显示在DataNode16节点上的数据块也在被删除

42:NORMAL:10.104.101.45:50010|RBW]]}
2019-04-29 16:21:40,993 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.101.45:50010 is added to blk_1081360355_7652507{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-01785055-ace5-4380-bb9f-3b738
3c15009:NORMAL:10.104.108.172:50010|RBW], ReplicaUnderConstruction[[DISK]DS-b6dfd747-96f9-4f02-871f-715734e22c0b:NORMAL:10.104.132.196:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:NORMAL:10.104.101.45:50010|RBW]]} size 0
2019-04-29 16:21:40,994 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.132.196:50010 is added to blk_1081360355_7652507{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-01785055-ace5-4380-bb9f-3b73
83c15009:NORMAL:10.104.108.172:50010|RBW], ReplicaUnderConstruction[[DISK]DS-b6dfd747-96f9-4f02-871f-715734e22c0b:NORMAL:10.104.132.196:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:NORMAL:10.104.101.45:50010|RBW]]} size 0
2019-04-29 16:21:40,995 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.108.172:50010 is added to blk_1081360355_7652507{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-01785055-ace5-4380-bb9f-3b73
83c15009:NORMAL:10.104.108.172:50010|RBW], ReplicaUnderConstruction[[DISK]DS-b6dfd747-96f9-4f02-871f-715734e22c0b:NORMAL:10.104.132.196:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:NORMAL:10.104.101.45:50010|RBW]]} size 0
2019-04-29 16:21:41,152 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.101.45:50010 is added to blk_1081360344_7652496{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-01785055-ace5-4380-bb9f-3b738
3c15009:NORMAL:10.104.108.172:50010|RBW], ReplicaUnderConstruction[[DISK]DS-2657606b-79b3-4695-a2cf-c6ce775816de:NORMAL:10.104.108.211:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:NORMAL:10.104.101.45:50010|RBW]]} size 231
2019-04-29 16:21:41,153 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.108.211:50010 is added to blk_1081360344_7652496{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-01785055-ace5-4380-bb9f-3b73
83c15009:NORMAL:10.104.108.172:50010|RBW], ReplicaUnderConstruction[[DISK]DS-2657606b-79b3-4695-a2cf-c6ce775816de:NORMAL:10.104.108.211:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:NORMAL:10.104.101.45:50010|RBW]]} size 231
2019-04-29 16:21:41,154 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.108.172:50010 is added to blk_1081360344_7652496{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-01785055-ace5-4380-bb9f-3b73
83c15009:NORMAL:10.104.108.172:50010|RBW], ReplicaUnderConstruction[[DISK]DS-2657606b-79b3-4695-a2cf-c6ce775816de:NORMAL:10.104.108.211:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6269c9dd-9936-4f13-935a-be590ab8e542:NORMAL:10.104.101.45:50010|RBW]]} size 231
2019-04-29 16:21:41,385 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1081360332_7652484 10.104.101.45:50010 10.104.108.157:50010 10.104.108.170:50010 
2019-04-29 16:21:41,664 INFO BlockStateChange: BLOCK* BlockManager: ask 10.104.101.45:50010 to delete [blk_1081360332_7652484]

从webui上看,其数据块应该是有一定减少了

重启NameNode的时候,业务方还反应作业无法提交,看其日志显示就是,NameNode的安全模式导致作业启动时,重新分配container时,删除历史数据失败

重启一段时间后,观察NameNode的安全模式已解除,DataNode也基本正常,让业务方重新提交后,作业恢复正常。

总结

  1. Hive的日志经常出现大日志文件导致磁盘满,需要按大小分割,并定时删除
  2. 只对DataNode重启无法使得NameNode触发删除DataNode节点上的多余数据快的命令,需要重启NameNode才能生效
  3. NameNode的安全模式时可能会导致一些作业因为无法删除历史数据而无法提交作业

参考:

https://blog.csdn.net/xfg0218/article/details/51968944?utm_source=blogxgwz8

https://www.cnblogs.com/laodao/p/4484971.html

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值