今天又再次出现了CPU100%的情况(严格来讲是4个CPU都被占满了达到了400%)
通过剖析当前进程发现,有7个线程全部堵塞在HashMap.getEntry这个方法上面,部分thread运行信息如下:
18326:
===== FULL THREAD DUMP ===============
Tue Jan 8 09:40:49 2013
Oracle JRockit(R) R28.1.5-20-146757-1.6.0_29-20111004-1750-linux-x86_64
"Main Thread" id=1 idx=0x4 tid=18419 prio=5 alive, sleeping, native_waiting
at java/lang/Thread.sleep(J)V(Native Method)
at org/apache/hadoop/hive/ql/Driver.pollTasks(Driver.java:1364)
at org/apache/hadoop/hive/ql/Driver.execute(Driver.java:1122)
at org/apache/hadoop/hive/ql/Driver.run(Driver.java:951)
at org/apache/hadoop/hive/cli/CliDriver.processLocalCmd(CliDriver.java:258)
at org/apache/hadoop/hive/cli/CliDriver.processCmd(CliDriver.java:215)
at org/apache/hadoop/hive/cli/CliDriver.processLine(CliDriver.java:406)
at org/apache/hadoop/hive/cli/CliDriver.processLine(CliDriver.java:341)
at org/apache/hadoop/hive/cli/CliDriver.processReader(CliDriver.java:439)
at org/apache/hadoop/hive/cli/CliDriver.processFile(CliDriver.java:449)
at org/apache/hadoop/hive/cli/CliDriver.run(CliDriver.java:647)
at org/apache/hadoop/hive/cli/CliDriver.main(CliDriver.java:557)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
at jrockit/vm/Reflect.invokeMethod(Ljava/lang/Object;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Native Method)
at sun/reflect/NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Native Method)
at sun/reflect/NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java/lang/reflect/Method.invoke(Method.java:597)
at org/apache/hadoop/util/RunJar.main(RunJar.java:156)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
"(Signal Handler)" id=2 idx=0x8 tid=18420 prio=5 alive, native_blocked, daemon
"(OC Main Thread)" id=3 idx=0xc tid=18421 prio=5 alive, native_waiting, daemon
"(GC Worker Thread 1)" id=? idx=0x10 tid=18422 prio=5 alive, daemon
"(GC Worker Thread 2)" id=? idx=0x14 tid=18423 prio=5 alive, daemon
"(GC Worker Thread 3)" id=? idx=0x18 tid=18424 prio=5 alive, daemon
"(GC Worker Thread 4)" id=? idx=0x1c tid=18425 prio=5 alive, daemon
"(Code Generation Thread 1)" id=4 idx=0x20 tid=18426 prio=5 alive, native_waiting, daemon
"(Code Optimization Thread 1)" id=5 idx=0x24 tid=18427 prio=5 alive, native_waiting, daemon
"(VM Periodic Task)" id=6 idx=0x28 tid=18428 prio=10 alive, native_waiting, daemon
"Finalizer" id=7 idx=0x2c tid=18429 prio=8 alive, native_waiting, daemon
at jrockit/memory/Finalizer.waitForFinalizees(J[Ljava/lang/Object;)I(Native Method)
at jrockit/memory/Finalizer.access$700(Finalizer.java:12)
at jrockit/memory/Finalizer$4.run(Finalizer.java:189)
at java/lang/Thread.run(Thread.java:662)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
"Reference Handler" id=8 idx=0x30 tid=18430 prio=10 alive, native_waiting, daemon
at java/lang/ref/Reference.waitForActivatedQueue(J)Ljava/lang/ref/Reference;(Native Method)
at java/lang/ref/Reference.access$100(Reference.java:11)
at java/lang/ref/Reference$ReferenceHandler.run(Reference.java:82)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
"(Sensor Event Thread)" id=9 idx=0x34 tid=18431 prio=5 alive, native_blocked, daemon
"VM JFR Buffer Thread" id=10 idx=0x38 tid=18432 prio=5 alive, in native, daemon
"LeaseChecker" id=21 idx=0x58 tid=18443 prio=5 alive, sleeping, native_waiting, daemon
at java/lang/Thread.sleep(J)V(Native Method)
at org/apache/hadoop/hdfs/DFSClient$LeaseChecker.run(DFSClient.java:1379)
at java/lang/Thread.run(Thread.java:662)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
"Timer-19" id=416 idx=0x2a8 tid=19295 prio=5 alive, waiting, native_blocked, daemon
-- Waiting for notification on: java/util/TaskQueue@0xe07c5070[fat lock]
at jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method)
at java/lang/Object.wait(J)V(Native Method)
at java/lang/Object.wait(Object.java:485)
at java/util/TimerThread.mainLoop(Timer.java:483)
^-- Lock released while waiting: java/util/TaskQueue@0xe07c5070[fat lock]
at java/util/TimerThread.run(Timer.java:462)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
"Thread-434" id=456 idx=0x30c tid=19808 prio=5 alive, native_blocked
at java/util/HashMap.getEntry(HashMap.java:426)
at java/util/HashMap.containsKey(HashMap.java:415)
at java/util/HashSet.contains(HashSet.java:184)
at org/apache/hadoop/hive/ql/QueryPlan.updateCountersInQueryPlan(QueryPlan.java:310)
at org/apache/hadoop/hive/ql/QueryPlan.getQueryPlan(QueryPlan.java:426)[inlined]
at org/apache/hadoop/hive/ql/QueryPlan.toString(QueryPlan.java:598)[inlined]
at org/apache/hadoop/hive/ql/history/HiveHistory.logPlanProgress(HiveHistory.java:497)[optimized]
at org/apache/hadoop/hive/ql/exec/Task.executeTask(Task.java:132)
at org/apache/hadoop/hive/ql/exec/TaskRunner.runSequential(TaskRunner.java:57)
at org/apache/hadoop/hive/ql/exec/TaskRunner.run(TaskRunner.java:47)
at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
-- end of trace
注意红色字体部分。
当前top监控使用状态,如下图:
目前此问题是间歇性的,需要对各种指标进行分析。
2013-01-16问题再次出现,检查当前GC运行状态,如下图:
ganglia监控页面,如下图:
今天查问题,我自己有个缺陷,就是对top命令不是很熟悉,没有去查看到底是那些thread导致CPU占有率高。后面会对top命令再复习一次,并对这次的查问题的过程做一个总结,为以后提高检查问题的效率节约时间(现在只能等下一次了)。
今天再次出现,通过top -H -p <pid>的命令截图如下:
部分栈运行截图如下: