unixprocess+java+186_linux状态及原理全剖析

所谓的overcommit是过量使用的意思。

OVERCOMMIT_GUESS=0 用户申请内存的时候,系统会判断剩余的内存有多少。若是不够的话那么就会失败。这种方式是比较保守的,由于有时候好比用户申请1G内存可是可能只是会使用其中1K.

Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slighly more memory in this mode. This is the default.

OVERCOMMIT_ALWAYS=1 用户申请内存的时候,系统不进行任何检查认为内存足够使用,直到使用内存超过可用内存。

Always overcommit. Appropriate for some scientific applications.

OVERCOMMIT_NEVER=2 用户一次申请内存的大小不容许超过的大小。关于这个的大小计算能够看下面overcommit_ration这个参数,能够上面两种所说的可用内存不太同样。

Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable percentage (default is 50) of physical RAM. Depending on the percentage you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.

下午将dp3的overcommit_memory参数修改为为2以后,首先出现的问题就是不可以再执行任何shell命令了,错误是fork can't allocate enough memory,就是fork没有那么多的内存可用。而后推出会话以后没有办法再登录dp3了。这个主要是由于jvm应该基本上占用满了物理内存,而overcommit_ration=0.5,而且没有swap空间,因此没有办法allocate更多的memory了。

从/var/log/syslog里面能够看到,修改了这个参数以后,不少程序受到影响(ganglia挂掉了,cron不可以fork出进程了,init也不可以分配出更多的tty,致使咱们没有办法登录上去)在ganglia里面看到内存以及CPU使用都是一条直线,不是由于系统稳定而是由于gmond挂掉了。

Nov 8 18:07:04 dp3 /usr/sbin/gmond[1664]: [PYTHON] Can't call the metric handler function for [diskstat_sdd_reads] in the python module [diskstat].#012

Nov 8 18:07:04 dp3 /usr/sbin/gmond[1664]: [PYTHON] Can't call the metric handler function for [diskstat_sdd_writes] in the python module [diskstat].#012

Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Error writing state file: No space left on device

Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Cannot write to file /var/run/ConsoleKit/database~

Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Unable to spawn /usr/lib/ConsoleKit/run-session.d/pam-foreground-compat.ck: Failed to fork (Cannot allocate memory)

Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Error writing state file: No space left on device

Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Cannot write to file /var/run/ConsoleKit/database~

Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Cannot unlink /var/run/ConsoleKit/database: No such file or directory

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files

Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat

Nov 8 18:08:12 dp3 kernel: [4319715.969327] gmond[1664]: segfault at ffffffffffffffff ip 00007f52e0066f34 sp 00007fff4e428620 error 4 in libganglia-3.1.2.so.0.0.0[7f52e0060000+13000]

Nov 8 18:10:01 dp3 cron[1637]: (CRON) error (can't fork)

Nov 8 18:13:53 dp3 init: tty1 main process (2341) terminated with status 1

Nov 8 18:13:53 dp3 init: tty1 main process ended, respawning

Nov 8 18:13:53 dp3 init: Temporary process spawn error: Cannot allocate memory

而在hadoop的datanode日志里面,有下面这些错误(只是给出部分exception):

2012-11-08 18:07:01,283 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=50075, ipcPort=50020):DataXceiver

java.io.EOFException: while trying to read 65557 bytes

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:290)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:480)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)

2012-11-08 18:07:02,163 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=50075, ipcPort=50020):DataXceiverServer: Exiting due to:java.lang.OutOfMemoryError: unable to create new native thread

at java.lang.Thread.start0(Native Method)

at java.lang.Thread.start(Thread.java:640)

at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:131)

at java.lang.Thread.run(Thread.java:662)

2012-11-08 18:07:04,964 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=50075, ipcPort=50020):DataXceiver

java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[closed]. 0 millis timeout left.

at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)

at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)

at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)

at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)

at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)

at java.io.BufferedInputStream.read(BufferedInputStream.java:317)

at java.io.DataInputStream.read(DataInputStream.java:132)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:287)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:480)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)

2012-11-08 18:07:04,965 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-1079258682690587867_32990729 1 Exception java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:180)

at java.io.DataInputStream.readLong(DataInputStream.java:399)

at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:937)

at java.lang.Thread.run(Thread.java:662)

2012-11-08 18:07:05,057 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_1523791863488769175_32972264 1 Exception java.nio.channels.ClosedChannelException

at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)

at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)

at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)

at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)

at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)

at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)

at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)

at java.io.DataOutputStream.flush(DataOutputStream.java:106)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1047)

at java.lang.Thread.run(Thread.java:662)

2012-11-08 18:07:04,972 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=5

0075, ipcPort=50020):DataXceiver

java.io.IOException: Interrupted receiveBlock

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:622)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:480)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)

2012-11-08 18:08:02,003 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for threadgroup to exit, active threads is 1

2012-11-08 18:08:02,025 WARN org.apache.hadoop.util.Shell: Could not get disk usage information

java.io.IOException: Cannot run program "du": java.io.IOException: error=12, Cannot allocate memory

at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)

at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)

at org.apache.hadoop.util.Shell.run(Shell.java:182)

at org.apache.hadoop.fs.DU.access$200(DU.java:29)

at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:84)

at java.lang.Thread.run(Thread.java:662)

Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory

at java.lang.UNIXProcess.(UNIXProcess.java:148)

at java.lang.ProcessImpl.start(ProcessImpl.java:65)

at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)

接着以后就一直打印下面日志hang住了

2012-11-08 18:08:52,015 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for threadgroup to exit, active threads is 1

hdfs web页面上面显示dead node,可是实际上这个datanode进程还存活。缘由估计也是由于不可以分配足够的内存出现这些问题的吧。

最后能够登录上去的缘由,我猜测应该是datanode挂掉了,上面的regionserver暂时没有分配内存因此有足够的内存空间,init能够开辟tty。

如今已经将这个值调整成为原来的值,也就是0。索性的是,在这个期间,这个修改对于线上的任务执行没有什么影响。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值