2014-05-12注定是春光灿烂猪八戒的一天,历史595无故障的hadoop服务器,终于还是出了问题,事前无人登陆操作服务器,此故障属于自发行为,目前未知发生原因。
细节描述: namenode无法启动. 先贴出错误信息
2014-05-12 07:17:39,447 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = DC.aws/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.205.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 ************************************************************/ 2014-05-12 07:17:39,600 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2014-05-12 07:17:39,613 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2014-05-12 07:17:39,614 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2014-05-12 07:17:39,614 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started 2014-05-12 07:17:39,764 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2014-05-12 07:17:39,773 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered. 2014-05-12 07:17:39,774 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source NameNode registered. 2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: VM type = 64-bit 2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 17.77875 MB 2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: capacity = 2^21 = 2097152 entries 2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: recommended=2097152, actual=2097152 2014-05-12 07:17:39,823 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=root 2014-05-12 07:17:39,823 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2014-05-12 07:17:39,823 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2014-05-12 07:17:39,829 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100 2014-05-12 07:17:39,829 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 2014-05-12 07:17:40,045 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean 2014-05-12 07:17:40,065 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times 2014-05-12 07:17:40,078 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 3349287 2014-05-12 07:18:01,677 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:817) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:362) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:384) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:358) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1268) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1277) 2014-05-12 07:18:01,678 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:817) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:362) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:384) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:358) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1268) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1277) 2014-05-12 07:18:01,679 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at DC.aws/127.0.0.1 ************************************************************/
找了半天,也没找到解决方法。我们的做的是伪分布式环境,到底该怎么搞呢?
format属于大招了,臣妾办不到啊...
补充说明:
我的namenode中 fsp_w_picpath 文件为445M
我的secondarynamenode中fsp_w_picpath文件为281M
很明显是二者不同的. 目前有点头绪,正在解救服务器
……………………………………………………………………………………………………
经过长达两个多小时的奋战,终于搞定了...--主要是和之前离职开发的沟通耗费时间
我查看SNN和NN下的current和p_w_picpath目录大小,发现 产生了文件差异,这已经很说明数据已经产生了丢失,在这种情况下,只能采取如下方式来减小数据丢失,尽快回复程序正常
解决方法核心:
用SNN的current和p_w_picpath目录覆盖NN的current和p_w_picpath目录。--当然了,覆盖之前的备份是运维必须做的! 一定要和开发和老总沟通好,确定风险之后进行操作.
缺陷:无法100%恢复数据,必然会造成数据的缺失。
改进:改为真正分布式结构,避免单点存储问题。或者更改架构,和开发沟通,用ext4文件系统,取代hdfs,重新开发新的配套代码。
在此,感谢 广州-no-python(QQ...未经本人允许,暂时保密)和北京-乾坤-运维(QQ...未经本人允许,暂时保密)的鼎力帮忙,这两位大神耗费自己的宝贵时间,给我的排错过程提供了宝贵的指点,非常钦佩! 以后定当向他们学习,帮助其他有困难的运维伙伴们!
转载于:https://blog.51cto.com/jishuweiwang/1409901