记一次排查NN CPU过高线程卡住问题

背景

在使用 hadoop fs -ls /xxxhadoop fs -du -h /xxx时,出现特别卡顿的情况。怀疑namenode机器大CPU进程占用。由此开始排查之旅。

过程

登录nn机器(active和standby都要看),查看top数。
关于top的使用
通常使用top H,H表示显示进程和线程,默认只看进程。

top H
   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
134105 hadoop    20   0  0.220t 0.156t   7212 R 64.1 63.3 270:32.10 java
134092 hadoop    20   0  0.220t 0.156t   7212 R 58.4 63.3 273:13.89 java
134095 hadoop    20   0  0.220t 0.156t   7212 R 57.2 63.3 275:16.29 java
134141 hadoop    20   0  0.220t 0.156t   7212 R 56.6 63.3 270:43.97 java
134123 hadoop    20   0  0.220t 0.156t   7212 R 54.7 63.3 269:13.20 java
134106 hadoop    20   0  0.220t 0.156t   7212 R 53.1 63.3 271:42.05 java
134148 hadoop    20   0  0.220t 0.156t   7212 R 52.8 63.3 272:43.75 java
134117 hadoop    20   0  0.220t 0.156t   7212 R 52.5 63.3 269:06.99 java
134135 hadoop    20   0  0.220t 0.156t   7212 R 52.2 63.3 269:10.03 java
 84753 hadoop    20   0  0.220t 0.156t   7212 R 52.2 63.3   1:17.63 java
134111 hadoop    20   0  0.220t 0.156t   7212 R 51.2 63.3 269:20.19 java
134147 hadoop    20   0  0.220t 0.156t   7212 R 51.2 63.3 270:16.40 java
134118 hadoop    20   0  0.220t 0.156t   7212 R 50.9 63.3 274:13.64 java

可以看到PID=134105的进程(或线程占用CPU较高,测试使用,实际已降)

通过ps --forest 可以看父进程

分析jstack -l pid找到线程栈:

2019-12-12 17:47:33
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode):

"Attach Listener" #174318 daemon prio=9 os_prio=0 tid=0x00007f614a1ac000 nid=0x21985 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"qtp212683148-174317" #174317 daemon prio=5 os_prio=0 tid=0x00007f2910910800 nid=0x21783 waiting on condition [0x00007f29216cc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c805df18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
        at java.lang.Thread.run(Thread.java:745)

"qtp212683148-174316" #174316 daemon prio=5 os_prio=0 tid=0x00007f6165cfd800 nid=0x216c2 runnable [0x00007f29219cf000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00007f30c806bbc0> (a sun.nio.ch.Util$2)
        - locked <0x00007f30c806bbd8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00007f30c80821a8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:745)

"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn03.nmg02/10.76.50.29:8485 from hadoop" #174312 daemon prio=5 os_prio=0 tid=0x00007f616690e000 nid=0x21094 in Object.wait() [0x00007f2920ac4000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
        - locked <0x00007f51d2aa11a8> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)

"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn02.nmg02/10.76.60.58:8485 from hadoop" #174311 daemon prio=5 os_prio=0 tid=0x00007f29104aa800 nid=0x21093 in Object.wait() [0x00007f29229d9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
        - locked <0x00007f51d2aa3728> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)

"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn01.nmg02/10.76.32.39:8485 from hadoop" #174310 daemon prio=5 os_prio=0 tid=0x00007f6165487000 nid=0x21092 in Object.wait() [0x00007f29208c2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
        - locked <0x00007f51d2aa5ca8> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)

"IPC Parameter Sending Thread #58519" #174272 daemon prio=5 os_prio=0 tid=0x00007f61674a9800 nid=0x1f350 waiting on condition [0x00007f29230e0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"IPC Parameter Sending Thread #58518" #174271 daemon prio=5 os_prio=0 tid=0x00007f2919de0800 nid=0x1f34e waiting on condition [0x00007f2922fdf000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"IPC Parameter Sending Thread #58517" #174270 daemon prio=5 os_prio=0 tid=0x00007f2910efc000 nid=0x1f34c waiting on condition [0x00007f2920bc5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"qtp212683148-174164" #174164 daemon prio=5 os_prio=0 tid=0x00007f616719f800 nid=0x1c257 runnable [0x00007f2922ada000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00007f30c806f9a8> (a sun.nio.ch.Util$2)
        - locked <0x00007f30c806f9c0> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00007f30c8056a08> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
2019-12-12 17:47:33
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode):

"Attach Listener" #174318 daemon prio=9 os_prio=0 tid=0x00007f614a1ac000 nid=0x21985 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"qtp212683148-174317" #174317 daemon prio=5 os_prio=0 tid=0x00007f2910910800 nid=0x21783 waiting on condition [0x00007f29216cc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c805df18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
        at java.lang.Thread.run(Thread.java:745)

"qtp212683148-174316" #174316 daemon prio=5 os_prio=0 tid=0x00007f6165cfd800 nid=0x216c2 runnable [0x00007f29219cf000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00007f30c806bbc0> (a sun.nio.ch.Util$2)
        - locked <0x00007f30c806bbd8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00007f30c80821a8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:745)

将上述pid转成16进制:

printf %x 134105
20bd9

在jstack中,我们搜索20bd9找到了这一条记录:

"IPC Server handler 19 on 8020" #138 daemon prio=5 os_prio=0 tid=0x00007f6165d9d000 nid=0x20bd9 runnable [0x00007f2929b48000]
   java.lang.Thread.State: RUNNABLE
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
        at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
        at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
        at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:901)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:798)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:224)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:94)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:438)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2802)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:882)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2709)

可以定位到在dn选择块的时候耗时。

同时我们看一下namenode日志,找一下WRAN的,或者exception的日志:

agePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable:  unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2019-12-12 17:38:04,818 INFO org.apache.hadoop.ipc.Server: IPC Server handler 33 on 8020, call Call#221 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.76.57.59:32800
java.io.IOException: File /erasurecoding_backup/hdp/g_swan/prod/swan/china/dw_online/publiclog/gsdriverloc/2019/11/21/16/.distcp.tmp.attempt_1561732060733_7707095_m_000070_0 could only be written to 4 of the 6 required nodes for RS-6-3-1024k. There are 1701 datanode(s) running and 1701 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2254)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2802)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:882)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2709)

可以定位到错误。是distcp的时候出现了错误。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值