所谓的overcommit是过量使用的意思。
OVERCOMMIT_GUESS=0 用户申请内存的时候,系统会判断剩余的内存有多少。若是不够的话那么就会失败。这种方式是比较保守的,由于有时候好比用户申请1G内存可是可能只是会使用其中1K.
Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slighly more memory in this mode. This is the default.
OVERCOMMIT_ALWAYS=1 用户申请内存的时候,系统不进行任何检查认为内存足够使用,直到使用内存超过可用内存。
Always overcommit. Appropriate for some scientific applications.
OVERCOMMIT_NEVER=2 用户一次申请内存的大小不容许超过的大小。关于这个的大小计算能够看下面overcommit_ration这个参数,能够上面两种所说的可用内存不太同样。
Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable percentage (default is 50) of physical RAM. Depending on the percentage you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.
下午将dp3的overcommit_memory参数修改为为2以后,首先出现的问题就是不可以再执行任何shell命令了,错误是fork can't allocate enough memory,就是fork没有那么多的内存可用。而后推出会话以后没有办法再登录dp3了。这个主要是由于jvm应该基本上占用满了物理内存,而overcommit_ration=0.5,而且没有swap空间,因此没有办法allocate更多的memory了。
从/var/log/syslog里面能够看到,修改了这个参数以后,不少程序受到影响(ganglia挂掉了,cron不可以fork出进程了,init也不可以分配出更多的tty,致使咱们没有办法登录上去)在ganglia里面看到内存以及CPU使用都是一条直线,不是由于系统稳定而是由于gmond挂掉了。
Nov 8 18:07:04 dp3 /usr/sbin/gmond[1664]: [PYTHON] Can't call the metric handler function for [diskstat_sdd_reads] in the python module [diskstat].#012
Nov 8 18:07:04 dp3 /usr/sbin/gmond[1664]: [PYTHON] Can't call the metric handler function for [diskstat_sdd_writes] in the python module [diskstat].#012
Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Error writing state file: No space left on device
Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Cannot write to file /var/run/ConsoleKit/database~
Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Unable to spawn /usr/lib/ConsoleKit/run-session.d/pam-foreground-compat.ck: Failed to fork (Cannot allocate memory)
Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Error writing state file: No space left on device
Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Cannot write to file /var/run/ConsoleKit/database~
Nov 8 18:07:28 dp3 console-kit-daemon[1760]: WARNING: Cannot unlink /var/run/ConsoleKit/database: No such file or directory
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: slurpfile() open() error on file /proc/stat: Too many open files
Nov 8 18:08:12 dp3 /usr/sbin/gmond[1664]: update_file() got an error from slurpfile() reading /proc/stat
Nov 8 18:08:12 dp3 kernel: [4319715.969327] gmond[1664]: segfault at ffffffffffffffff ip 00007f52e0066f34 sp 00007fff4e428620 error 4 in libganglia-3.1.2.so.0.0.0[7f52e0060000+13000]
Nov 8 18:10:01 dp3 cron[1637]: (CRON) error (can't fork)
Nov 8 18:13:53 dp3 init: tty1 main process (2341) terminated with status 1
Nov 8 18:13:53 dp3 init: tty1 main process ended, respawning
Nov 8 18:13:53 dp3 init: Temporary process spawn error: Cannot allocate memory
而在hadoop的datanode日志里面,有下面这些错误(只是给出部分exception):
2012-11-08 18:07:01,283 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:290)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:480)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)
2012-11-08 18:07:02,163 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=50075, ipcPort=50020):DataXceiverServer: Exiting due to:java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:131)
at java.lang.Thread.run(Thread.java:662)
2012-11-08 18:07:04,964 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=50075, ipcPort=50020):DataXceiver
java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[closed]. 0 millis timeout left.
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:287)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:480)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)
2012-11-08 18:07:04,965 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-1079258682690587867_32990729 1 Exception java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:937)
at java.lang.Thread.run(Thread.java:662)
2012-11-08 18:07:05,057 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_1523791863488769175_32972264 1 Exception java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1047)
at java.lang.Thread.run(Thread.java:662)
2012-11-08 18:07:04,972 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.10.56:50010, storageID=DS-1599419066-10.18.10.47-50010-1329122718923, infoPort=5
0075, ipcPort=50020):DataXceiver
java.io.IOException: Interrupted receiveBlock
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:622)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:480)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)
2012-11-08 18:08:02,003 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for threadgroup to exit, active threads is 1
2012-11-08 18:08:02,025 WARN org.apache.hadoop.util.Shell: Could not get disk usage information
java.io.IOException: Cannot run program "du": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.fs.DU.access$200(DU.java:29)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:84)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
接着以后就一直打印下面日志hang住了
2012-11-08 18:08:52,015 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for threadgroup to exit, active threads is 1
hdfs web页面上面显示dead node,可是实际上这个datanode进程还存活。缘由估计也是由于不可以分配足够的内存出现这些问题的吧。
最后能够登录上去的缘由,我猜测应该是datanode挂掉了,上面的regionserver暂时没有分配内存因此有足够的内存空间,init能够开辟tty。
如今已经将这个值调整成为原来的值,也就是0。索性的是,在这个期间,这个修改对于线上的任务执行没有什么影响。