最近线上的的nm 有crash的问题,查看错误日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2014 - 06 - 19  00 : 01 : 22 , 308  FATAL
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Error: Shutting downjava.util.
ConcurrentModificationException
         at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java: 761 )
         at java.util.LinkedList$ListItr.next(LinkedList.java: 696 )
         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.toString(LocalizedResource.java: 120 )
         at java.lang.String.valueOf(String.java: 2826 )
         at java.lang.StringBuilder.append(StringBuilder.java: 115 )
         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java: 656 )
2014 - 06 - 19  00 : 01 : 22 , 308  INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2014 - 06 - 19  00 : 03 : 40 , 685  INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading  public  rsrc:{ hdfs: //bipcluster/tmp/hive-hdfs/hive_2014-06-19_00-05-51_049_5891972191087895437/-mr-10004/a1495555-b0dc-4356-8b68-1c881012e123, 1403107405580, FILE, null }
2014 - 06 - 19  00 : 03 : 40 , 685  FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
java.util.concurrent.RejectedExecutionException
         at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java: 1768 )
         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java: 767 )
         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java: 658 )
         at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java: 152 )
         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java: 618 )
         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java: 514 )
         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java: 456 )
         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java: 128 )
         at org.apache.hadoop.yarn.event.AsyncDispatcher$ 1 .run(AsyncDispatcher.java: 77 )
         at java.lang.Thread.run(Thread.java: 662 )
2014 - 06 - 19  00 : 03 : 40 , 685  INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.

是在做resource  localize时多线程的并发更新问题导致nm异常退出
这是一个bug,bug id:
https://issues.apache.org/jira/browse/YARN-573
bug描述:

1
2
3
4
5
6
Shared data structures in Public Localizer and Private Localizer are not Thread safe.
PublicLocalizer
1 ) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ).
PrivateLocalizer (LocalizerRunner?)
1 ) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). 
Also update method should be fixed. It too is sharing pending list.

控制resource localize的有两个线程
PublicLocalizer 和 LocalizerRunner,一个用来控制public文件的下载,一个用来控制private文件的下载,两者都会操作pending,fix的方法就是增加同步,这个bug已经在cdh5.2.0的yarn中fix了。
关于触发java.util.ConcurrentModificationException的异常可以参考:

1
http: //examples.javacodegeeks.com/java-basics/exceptions/java-util-concurrentmodificationexception-how-to-handle-concurrent-modification-exception/