今日发现一台tasktracker进入了Graylisted Nodes,查看其log发现如下报错:
2015-04-10 11:53:49,539 INFO org.apache.hadoop.mapred.JobLocalizer: Initializing user rc on this TT.
2015-04-10 11:53:49,549 WARN org.apache.hadoop.mapred.TaskTracker: Exception while localization ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:692)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:647)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:239)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:196)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1231)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1206)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1121)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2406)
at java.lang.Thread.run(Thread.java:662)
2015-04-10 11:53:49,549 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:adlog cause:ENOENT: No such file or directory
2015-04-10 11:53:49,549 WARN org.apache.hadoop.mapred.TaskTracker: Error initializing attempt_201411061445_1566728_m_000202_0:
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:692)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:647)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:239)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:196)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1231)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1206)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1121)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2406)
at java.lang.Thread.run(Thread.java:662)
2015-04-10 11:53:49,549 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set finish time for task attempt_201411061445_1566728_m_000202_0 when no start time is set, stackTrace is : java.lang.Exception
at org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:154)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:3118)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2416)
at java.lang.Thread.run(Thread.java:662)
对该tasktracker进行重启仍报该错误。
沿着报错信息查看源码,发现是userlogs下的目录无法建立。查看该目录发现有好多文件夹,wc -l 一下,结果:31998
百度后了解,在同一个路径下,一级子目录的个数限制为31998 (原因在这:目录个数超限了)
明白原因后将该目录下文件删除(都是已完成的job)
重启tasktracker后正常。
总结:正常情况userlogs下目录会自动删除,这台机器由于磁盘空间满导致一些task失败,应该是这个原因(失败的task没有删除其userlogs下目录)从而导致目录数目极限。