Tomcat c3p0连接池故障一例

负责维护的系统出现c3p0连接池泄露的故障,导致Oracle DB响应极慢,本文对当时故障处理做简单记录,虽然最后直接换掉tomcat版本、弃用c3p0连接池解决了故障,还是希望本文能为遇到类似问题的同学提供一些思路。

环境

数据库和应用在同一台服务器上。

  • OS: Windows Server 2008R2 Enterprise
  • 应用: Tomcat 6.0.32,部署一个webservice接口服务
  • 数据库: Oracle 11g R2 11.2.0.1 64bit

故障现象

每天上午08:00开始业务,约至10:00左右开始出现异常,tomcat catalina.log报错”java.sql.SQLException: Io 异常: The Network Adapter could not establish the connection”,c3p0连接池无法获取新的connection。同时Oracle Database响应极慢。

这里写图片描述

当前c3p0连接池配置

查看tomcat c3p0连接池配置参数如下。

c3p0.minPoolSize=1
c3p0.maxPoolSize=300    <----
c3p0.idleConnectionTestPeriod=35    <----
c3p0.maxIdleTime=35    <----
c3p0.maxStatements=0
c3p0.maxStatementsPerConnection=0
c3p0.preferredTestQuery=select 1 from dual    <----
c3p0.acquireIncrement=2
c3p0.acquireRetryDelay=1000
c3p0.acquireRetryAttempts=10
c3p0.initialPoolSize=1

查询session分布

v$session,按machine统计Oracle database会话情况,发现90%都是本服务器tomcat发起。其发起的session数在1000以上,远超c3p0 maxPoolSize=300的配置,猜想是c3p0连接池配置问题,或者某种原因导致连接池泄露。

SELECT b.MACHINE, b.PROGRAM, COUNT (*)
   FROM v$process a, v$session b
   WHERE a.ADDR = b.PADDR AND b.USERNAME IS NOT NULL
GROUP BY b.MACHINE, b.PROGRAM
ORDER BY COUNT (*) DESC;

Oracle Zabbix监控

在Zabbix(orabbix)监控上,看到session在08:00时约200个,此后开始迅速增长,到10:15左右达到1400个。中午12:00安排重启tomcat,下午13:00业务开始后,session依旧快速增长,故障再次出现。图上还可以看出session主体是”inactive session”,”Active Session”、”System Session”任然算是在正常范围内。想象一下有1000个active session!? 那数据库早就崩了!
这里写图片描述

Oracle AWR分析

从11月16日10:00~11:00 1个小时内的性能awr报告可以看到,平均每秒钟31.4次用户登录(logon),60min内有113040次登录。
这里写图片描述
按照对CPU Time的消耗,对1小时内SQL语句统计排序↓,排在前5的语句,均是oracle在创建新session时,用户登录阶段获取环境参数、权限信息、连接信息的底层SQL语句。这些语句分别执行的次数基本和登录次数一致,其中“cm5vu20fhtnq1”执行次数约为登录次数的4倍。

SQL_IDSQL Text
459f3z9u4fb3uselect value fromprops where name = GLOBAL_DB_NAME’
cm5vu20fhtnq1select /+ connect_by_filtering / privilege#, level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0
0ws7ahf1d78qaselect SYS_CONTEXT(‘USERENV’, ‘SERVER_HOST’), SYS_CONTEXT(‘USERENV’, ‘DB_UNIQUE_NAME’), SYS_CONTEXT(‘USERENV’, ‘INSTANCE_NAME’), SYS_CONTEXT(‘USERENV’, ‘SERVICE_NAME’), INSTANCE_NUMBER, STARTUP_TIME, SYS_CONTEXT(‘USERENV’, ‘DB_DOMAIN’) from v$instance where INSTANCE_NAME=SYS_CONTEXT(‘USERENV’, ‘INSTANCE_NAME’)
0k8522rmdzg4kselect privilege# from sysauth$ where (grantee#=:1 or grantee#=1) and privilege#>0
5ur69atw3vfhjselect decode(failover_method, NULL, 0 , ‘BASIC’, 1, ‘PRECONNECT’, 2 , ‘PREPARSE’, 4 , 0), decode(failover_type, NULL, 1 , ‘NONE’, 1 , ‘SESSION’, 2, ‘SELECT’, 4, 1), failover_retries, failover_delay, flags from service$ where name = :1

而应用业务有关的sql其实执行次数以及CPU、IO消耗很小↓↓。
这里写图片描述

收集同一时段内,业务核心数据库(2节点Oracle RAC)的AWR,每秒oracle用户登录次数为0.31,是故障服务器上数据库的1/100;并且耗时较多的sql语句均是应用业务相关的。↓经过和开发同事确认,该webservice的访问量远小于核心应用,相应数据库的logon次数不应该比核心库还大的多。
这里写图片描述

分析session的SQL记录

那么到底这些session在变为inactive之前做了什么操作呢?到底这些inactive session空闲了多久?
需要查v$session,统计session的空闲时间,及其最后执行的SQL语句。v$session中的last_call_et字段,反映了session最后一次执行sql到当前时刻共逝去多少时间(单位:秒)。以下语句统计了当前所有session从最后一次执行sql以来的空闲时间,以及最后一次执行的sql语句ID。

select username,
        s.SID,
        s.SERIAL#,
        s.logon_time,
        s.PREV_SQL_ID,
        s.last_call_et
   from v$session s
  where s.username is not null
  order by s.LAST_CALL_ET desc;

可以看出业务用户的所有1200多个session中,空闲时间从0秒~34秒不等,不超过35秒(c3p0 maxIdleTime = 35)。成功执行的最后一个sql语句都是(520mkxqpf15q8):select 1 from dual,也即是c3p0连接池的connect testquery语句/idleConnectionTest测试语句。↓↓↓
这里写图片描述

尝试分析原因

应用获取数据库连接(connection)的开销比较大,所以才有了连接池的设计。创建连接池,就是希望申请到的connection存放在pool里,达到“创建一次,循环使用”的状态。
应用使用getConnection()方法向c3p0 pool申请connection,如果pool里有空闲的,则应用得到一个connection;如果pool里没有空闲的则由c3p0到数据库去获取新的connection,然后分配给应用;应用程序的数据库操作完成后,使用close()方法将connection放回到pool中,标记为idle,以便其他应用线程调用。
现场的情况是c3p0的poolsize逐渐增大,当poolsize达到预设值300时,并没有停止而是继续申请connection,并且通过idleConnectiontest不断检查pool中所有连接的状态。

以下是从网上找到几个可能的原因,但是都跟本次场景都不符合:

  • 从pool里获取connection时,使用了getConnection(username, password)格式,并且一个DataSource配置里有多个用户名。
    举例,比如一个DataSource配置中含有dbuser1、dbuser2、dbuser3、dbuser4、dbuser5等5个,则实际的pool最大连接数是maxpoolsize * 5=1500。**
  • 程序归还connection时close()方法有问题
  • tomcat的server.xml 配置里autoDeploy=true,并且在线修改webapp应用程序的配置时,autoDeploy功能会导致自动生成一个新的pool。

几个c3p0配置的官方建议

  • Managing Pool Size and Connection Age
    maxConnectionAge/maxIdleTime不易过小
    Some general advice about all of these timeout parameters: Slow down! The point of Connection pooling is to bear the cost of acquiring a Connection only once, and then to reuse the Connection many, many times. Most databases support Connections that remain open for hours at a time. There’s no need to churn through all your Connections every few seconds or minutes.Setting maxConnectionAge or maxIdleTime to 1800 (30 minutes) is quite aggressive. For most databases, several hours may be more appropriate. You can ensure the reliability of your Connections by testing them, rather than by tossing them.
    maxIdleTime配置为30min就已经很激进了,建议以小时为单位。可以使用connection test定期去检查连接的可靠性,检测到不可用的connection时,再将其从pool里剔除出去。
    怀疑是maxpoolsize=35太小,现场将其修改为3600,重启tomcat观察故障依旧:(。

  • Simple advice on Connection testing
    If you’d like to improve performance by eliminating Connection testing from clients’ code path:
    检查修正了以下的建议配置,重启tomcat问题依旧。最后开发同事从别的服务器上copy了更新版本的tomcat,弃用c3p0连接池,恢复业务,总算是解决了这个故障,故障根本原因就没再花时间去测试了

a. Set testConnectionOnCheckout to false

b. Set testConnectionOnCheckin to true

c. Set idleConnectionTestPeriod to 30, fire up you application and observe. This is a pretty robust setting, all Connections will tested on check-in and every 30 seconds thereafter while in the pool. Your application should experience broken or stale Connections only very rarely, and the pool should recover from a database shutdown and restart quickly. But there is some overhead associated with all that Connection testing.

d. If database restarts will be rare so quick recovery is not an issue, consider reducing the frequency of tests by idleConnectionTestPeriod to, say, 300, and see whether clients are troubled by stale or broken Connections. If not, stick with 300, or try an even bigger number. Consider setting testConnectionOnCheckin back to false to avoid unnecessary tests on checkin. Alternatively, if your application does encounter bad Connections, consider reducing idleConnectionTestPeriod and set testConnectionOnCheckin back to true. There are no correct or incorrect values for these parameters: you are trading off overhead for reliability in deciding how frequently to test. The exact numbers are not so critical. It’s usually easy to find configurations that perform well. It’s rarely worth spending time in pursuit of “optimal” values here.

参考:
c3p0 connection pool queries
http://stackoverflow.com/questions/13942540/c3p0-connection-pool-queries
c3p0 connection pool is not shrinking
http://stackoverflow.com/questions/21343741/c3p0-connection-pool-is-not-shrinking?rq=1
C3P0连接池配置
http://blog.csdn.net/caihaijiang/article/details/6843496
Does c3p0 connection pooling ensures max pool size?
http://stackoverflow.com/questions/16932846/does-c3p0-connection-pooling-ensures-max-pool-size
c3p0 Connection Pool not closing connections
http://stackoverflow.com/questions/9824879/c3p0-connection-pool-not-closing-connections
c3p0 - JDBC3 Connection and Statement Pooling
http://www.mchange.com/projects/c3p0/#contents

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值