今天收到线上的resource manager报警:
报错信息如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
2014
-
07
-
08
13
:
22
:
54
,
118
INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxxx:
53356
Timed out after
600
secs
2014
-
07
-
08
13
:
22
:
54
,
118
INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node xxxx:
53356
as it is now LOST
2014
-
07
-
08
13
:
22
:
54
,
118
INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: xxxx:
53356
Node Transitioned from UNHEALTHY to LOST
2014
-
07
-
08
13
:
22
:
54
,
118
FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:
715
)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:
974
)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:
108
)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:
378
)
at java.lang.Thread.run(Thread.java:
662
)
2014
-
07
-
08
13
:
22
:
54
,
118
INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
2014
-
07
-
08
13
:
22
:
54
,
119
INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is
1000
2014
-
07
-
08
13
:
22
:
54
,
119
INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is
2000
|
这是一个bug,bug id:https://issues.apache.org/jira/browse/YARN-502
根据bug的描述,是在rm删除标记为UNHEALTHY的nm的时候可能会触发bug(第一次已经删除,后面删除再进行删除操作时就会报错)。
根据堆栈信息来看代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler:
protected
ResourceScheduler scheduler;
private
final
class
EventProcessor
implements
Runnable {
// 开启一个EventProcessor 线程,对event进行处理
@Override
public
void
run() {
SchedulerEvent event;
while
(!stopped && !Thread.currentThread ().isInterrupted()) {
try
{
event = eventQueue.take();
// 从event queue里面拿出event
}
catch
(InterruptedException e) {
LOG.error(
"Returning, interrupted : "
+ e);
return
;
// TODO: Kill RM.
}
try
{
scheduler.handle(event);
//处理event
}
catch
(Throwable t) {
// cache event的异常
// An error occurred, but we are shutting down anyway.
// If it was an InterruptedException, the very act of
// shutdown could have caused it and is probably harmless.
if
(stopped ) {
LOG.warn(
"Exception during shutdown: "
, t);
break
;
}
LOG.fatal(
"Error in handling event type "
+ event.getType()
//根据日志来看,这里获取的event.getType()为 NODE_REMOVED
+
" to the scheduler"
, t);
if
(shouldExitOnError
&& !ShutdownHookManager.get().isShutdownInProgress()) {
LOG.info(
"Exiting, bbye.."
);
System. exit(-
1
);
}
}
}
}
}
|
这里可以看到可以通过shouldExitOnError可以控制RM线程是否退出。
1
2
3
4
5
6
7
8
|
private
boolean
shouldExitOnError =
false
;
// 初始设置为false
@Override
public
synchronized
void
init(Configuration conf) {
// 在做初始化时,可以通过配置文件获取
this
. shouldExitOnError =
conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,
Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR);
// 参数在Dispatcher类中定义
super
.init(conf);
}
|
1
2
3
4
5
6
7
8
9
10
11
12
|
org.apache.hadoop.yarn.event.Dispatcher类:
public
interface
Dispatcher {
// Configuration to make sure dispatcher crashes but doesn't do system-exit in
// case of errors. By default, it should be false, so that tests are not
// affected. For all daemons it should be explicitly set to true so that
// daemons can crash instead of hanging around.
public
static
final
String DISPATCHER_EXIT_ON_ERROR_KEY =
"yarn.dispatcher.exit-on-error"
;
// 控制参数
public
static
final
boolean
DEFAULT_DISPATCHER_EXIT_ON_ERROR =
false
;
// 默认为false
EventHandler getEventHandler();
void
register(Class<?
extends
Enum> eventType, EventHandler handler);
}
|
在ResourceManager类的init函数中:
1
2
3
4
|
@Override
public
synchronized
void
init(Configuration conf) {
this
. conf = conf;
this
. conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,
true
);
// 这个值的默认值为true了(覆盖了Dispatcher类中的DEFAULT设置)
|
即默认在遇到dispather的错误时,会退出。
遇到错误是否退出可以由配置参数yarn.dispatcher.exit-on-error决定。不过这个改动影响比较大,最好还是不要设置,还是打patch来解决吧。
官方的patch也比较简单,即在rmnm时进行一次判断,防止二次删除操作:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
--- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
+++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
@@ -
501
,
8
+
501
,
13
@@
public
DeactivateNodeTransition(NodeState finalState) {
public
void
transition(RMNodeImpl rmNode, RMNodeEvent event) {
// Inform the scheduler
rmNode.nodeUpdateQueue.clear();
- rmNode.context.getDispatcher().getEventHandler().handle(
-
new
NodeRemovedSchedulerEvent(rmNode));
+
// If the current state is NodeState.UNHEALTHY
+
// Then node is already been removed from the
+
// Scheduler
+
if
(!rmNode.getState().equals(NodeState.UNHEALTHY)) {
+ rmNode.context.getDispatcher().getEventHandler()
+ .handle(
new
NodeRemovedSchedulerEvent(rmNode));
+ }
rmNode.context.getDispatcher().getEventHandler().handle(
new
NodesListManagerEvent(
NodesListManagerEventType.NODE_UNUSABLE, rmNode));
|
本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1436087,如需转载请自行联系原作者