问题描述
使用DBCP连接池。每当访问数据库超时,永久占用数据库连接池的一个连接数。多次超时发生后,数据库连接池耗尽,后续的数据库请求拿不到数据库连接,全部hang住。
代码分析
1 数据库访问获取连接
DBCP连接池构建了GenericObjectPool,通过GenericObjectPool来管理数据库连接。当数据库访问发生时,通过GenericObjectPool.borrowObject()方法获取一个数据库连接对象。
当存在空闲连接或者活跃连接数没达到maxAtive上限,则成功分配一个连接,numActive(活跃连接数)计数加1。代码如下:
// activate & validate the object
try {
_factory.activateObject(latch.getPair().value);
if(_testOnBorrow &&
!_factory.validateObject(latch.getPair().value)) {
throw new Exception("ValidateObject failed");
}
synchronized(this) {
_numInternalProcessing--;
_numActive++;
}
return latch.getPair().value;
}
如果numActive(活跃连接数) + numInternalProcessing(正在申请连接的请求数)达到maxAtive上限,则认为连接耗尽。这时候,会根据耗尽策略执行不同的操作,默认是BLOCK,在maxWait不大于0的情况下,永久等待,直到有数据库连接释放并调用latch.notify()。
// If no object was allocated from the pool above
if(latch.getPair() == null) {
// check if we were allowed to create one
if(latch.mayCreate()) {
// allow new object to be created
} else {
// the pool is exhausted
switch(whenExhaustedAction) {
case WHEN_EXHAUSTED_BLOCK:
try {
synchronized (latch) {
// Before we wait, make sure another thread didn't allocate us an object
// or permit a new object to be created
if (latch.getPair() == null && !latch.mayCreate()) {
if(maxWait <= 0) {
latch.wait();
} else {
// this code may be executed again after a notify then continue cycle
// so, need to calculate the amount of time to wait
final long elapsed = (System.currentTimeMillis() - starttime);
final long waitTime = maxWait - elapsed;
if (waitTime > 0)
{
latch.wait(waitTime);
}
}
} else {
break;
}
}
}
hang住的线程就是block在这里,说明数据库连接池已经耗尽了(numActive + numInternalProcessing >= maxAtive)。
那么,数据库连接为什么会耗尽呢?我们debug一下代码,在数据库访问超时的时候,看看到底发生了什么。
2 数据库访问超时释放连接
通过异常堆栈,跟踪到SqlMapClientTemplate。超时异常是在try块中发生的,并被catch住进行了翻译,释放连接的逻辑在finally代码块中。所以,理论上无论是否发生异常,都会在finally代码块中调用DataSourceUtils.doReleaseConnection释放连接。
// Execute given callback...
try {
return action.doInSqlMapClient(session);
}
catch (SQLException ex) {
throw getExceptionTranslator().translate("SqlMapClient operation", null, ex);
}
finally {
try {
if (springCon != null) {
if (transactionAware) {
springCon.close();
}
else {
DataSourceUtils.doReleaseConnection(springCon, dataSource);
}
}
}
catch (Throwable ex) {
logger.debug("Could not close JDBC Connection", ex);
}
}
那么,会不会是在释放连接的时候发生了问题,没有将连接归还给连接池?
在debug模式下,我们找到了答案(debug模式下有更多的异常堆栈)。释放连接的时候,的确发生了异常。由于数据库访问超时,底层连接已经关闭,抛出了“Already closed”异常。
看一下PoolableConnection.close的代码逻辑:
public synchronized void close() throws SQLException {
boolean isClosed = false;
try {
isClosed = isClosed();
} catch (SQLException e) {
try {
_pool.invalidateObject(this);
} catch (Exception ie) {
// DO NOTHING the original exception will be rethrown
}
throw new SQLNestedException("Cannot close connection (isClosed check failed)", e);
}
if (isClosed) {
throw new SQLException("Already closed.");
} else {
try {
_pool.returnObject(this);
} catch(SQLException e) {
throw e;
} catch(RuntimeException e) {
throw e;
} catch(Exception e) {
throw new SQLNestedException("Cannot close connection (return to pool failed)", e);
}
}
}
如果底层连接已经关闭,isClosed条件为true,那么直接抛出“Already closed”的异常,没有执行任何其他操作(比如:将连接归还,活跃连接数减1)。
归还连接需要调用GenericObjectPool.returnObject:
public void returnObject(Object obj) throws Exception {
try {
addObjectToPool(obj, true);
}
...
}
private void addObjectToPool(Object obj, boolean decrementNumActive) throws Exception {
...
// Add instance to pool if there is room and it has passed validation
// (if testOnreturn is set)
synchronized (this) {
if (isClosed()) {
shouldDestroy = true;
} else {
if((_maxIdle >= 0) && (_pool.size() >= _maxIdle)) {
shouldDestroy = true;
} else if(success) {
// borrowObject always takes the first element from the queue,
// so for LIFO, push on top, FIFO add to end
if (_lifo) {
_pool.addFirst(new ObjectTimestampPair(obj));
} else {
_pool.addLast(new ObjectTimestampPair(obj));
}
if (decrementNumActive) {
_numActive--;
}
allocate();
}
}
}
...
}
把连接对象添加回池子或者销毁后,numActive计数会减1。
3 结论
数据库访问超时,连接已经关闭,释放的时候,直接抛出异常,没有对活跃连接数进行减1操作。最终导致活跃连接数达到上限并且无法再减少,相当于数据库连接池耗尽,新来的请求无法拿到连接而永久hang住。
解决方案
问题发生时,使用是commons-dbcp-1.2.1.jar,升级成commons-dbcp-1.4.jar即可解决该问题。看一下commons-dbcp-1.4.jar中PoolableConnection.close的逻辑:
public synchronized void close() throws SQLException {
...
if (!isUnderlyingConectionClosed) {
// Normal close: underlying connection is still open, so we
// simply need to return this proxy to the pool
try {
_pool.returnObject(this); // XXX should be guarded to happen at most once
} catch(IllegalStateException e) {
// pool is closed, so close the connection
passivate();
getInnermostDelegate().close();
} catch(SQLException e) {
throw e;
} catch(RuntimeException e) {
throw e;
} catch(Exception e) {
throw (SQLException) new SQLException("Cannot close connection (return to pool failed)").initCause(e);
}
} else {
// Abnormal close: underlying connection closed unexpectedly, so we
// must destroy this proxy
try {
_pool.invalidateObject(this); // XXX should be guarded to happen at most once
} catch(IllegalStateException e) {
// pool is closed, so close the connection
passivate();
getInnermostDelegate().close();
} catch (Exception ie) {
// DO NOTHING, "Already closed" exception thrown below
}
throw new SQLException("Already closed.");
}
}
可以看到,当底层连接已经关闭时,会调用GenericObjectPool.invalidateObject方法销毁连接对象:
public void invalidateObject(Object obj) throws Exception {
try {
if (_factory != null) {
_factory.destroyObject(obj);
}
} finally {
synchronized (this) {
_numActive--;
allocate();
}
}
}
在销毁逻辑中,对numActive计数进行了减1。
所以commons-dbcp-1.4.jar这个版本的包修复了commons-dbcp-1.2.1.jar包中连接池耗尽的问题。