更换新库后业务图启动失败-问题定位

在项目中,业务使用jstorm分布式实时计算引擎更换新库后进行性能对比测试,遇到功能异常和日志缺失问题。启动时发现部分消息处理异常,日志显示可能与内存库连接超时或资源释放有关。尝试调整jstorm配置和增加连接数未解决问题,最终通过重启相关服务暂时恢复,但问题根源未明。初步怀疑可能是数据库查询效率低下导致的超时问题。
摘要由CSDN通过智能技术生成

项目场景

简述项目背景:
测试要求:业务更换新老内存库后,需要进行性能比对
测试前景提示:业务使用jstorm分布式实时计算引擎,借助Zookeeper来实现交换心跳信息


问题描述

描述项目中遇到的问题:
1、测试性能前,更换了连接旧库的连接,连接成新库,程序可正常启动,功能测试完成
2、业务关闭了日志,跑一批数据,发现部分消息没有出正常结果,根据日志定位时,未查询到有效能够定位问题的日志
3、未定位到问题后,停掉了图,停掉后业务程序进程存在,但在ZK中未注册信息(消息无法发通)

如下日志为起图时报错日志:

/xxxxx/xxxxxx/xxxxx/xiao>more topology_XN_SM_V10.0.0.5_01-worker-17024-gc.log
Java HotSpot(TM) 64-Bit Server VM (25.101-b13) for linux-amd64 JRE (1.8.0_101-b13), built on Jun 22 2016 02:59:44 by "java_re" with gcc 4.3.0 2008
0428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 263727068k(203680628k free), swap 0k(0k free)
CommandLine flags: -XX:CMSFullGCsBeforeCompaction=5 -XX:CMSInitiatingOccupancyFraction=70 -XX:CMSMaxAbortablePrecleanTime=5000 -XX:+HeapDumpOnOutO
fMemoryError -XX:HeapDumpPath=/project/bill03/testbill_xn/tools/env/12.0.0/jstorm/logs/topology_XN_SM_V10.0.0.5_01/java-topology_XN_SM_V10.0.0.5_0
1-1-1649656886-20220411140659.hprof -XX:InitialHeapSize=3221225472 -XX:MaxHeapSize=3221225472 -XX:MaxNewSize=1610612736 -XX:NewSize=1610612736 -XX
:OldPLABSize=16 -XX:ParallelGCThreads=4 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:SurvivorRatio=4 -XX:+Us
eCMSCompactAtFullCollection -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:
+UseParNewGC
2022-04-11T14:07:09.917+0800: 10.211: [GC (CMS Initial Mark) [1 CMS-initial-mark: 0K(1572864K)] 985667K(2883584K), 0.0229559 secs] [Times: user=0.
07 sys=0.00, real=0.02 secs]
2022-04-11T14:07:09.940+0800: 10.234: [CMS-concurrent-mark-start]
2022-04-11T14:07:09.961+0800: 10.254: [CMS-concurrent-mark: 0.020/0.020 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2022-04-11T14:07:09.961+0800: 10.254: [CMS-concurrent-preclean-start]
2022-04-11T14:07:09.966+0800: 10.259: [CMS-concurrent-preclean: 0.005/0.005 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2022-04-11T14:07:09.966+0800: 10.259: [CMS-concurrent-abortable-preclean-start]
2022-04-11T14:07:10.094+0800: 10.387: [GC (Allocation Failure) 2022-04-11T14:07:10.094+0800: 10.387: [ParNew: 1048576K->18721K(1310720K), 0.021818
8 secs] 1048576K->18721K(2883584K), 0.0219212 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
 CMS: abort preclean due to time 2022-04-11T14:07:14.984+0800: 15.278: [CMS-concurrent-abortable-preclean: 2.656/5.018 secs] [Times: user=4.97 sys
=0.16, real=5.02 secs]
2022-04-11T14:07:14.984+0800: 15.278: [GC (CMS Final Remark) [YG occupancy: 340108 K (1310720 K)]2022-04-11T14:07:14.985+0800: 15.278: [Rescan (pa
rallel) , 0.0093420 secs]2022-04-11T14:07:14.994+0800: 15.287: [weak refs processing, 0.0000257 secs]2022-04-11T14:07:14.994+0800: 15.287: [class
unloading, 0.0049656 secs]2022-04-11T14:07:14.999+0800: 15.292: [scrub symbol table, 0.0023143 secs]2022-04-11T14:07:15.001+0800: 15.295: [scrub s
tring table, 0.0006851 secs][1 CMS-remark: 0K(1572864K)] 340108K(2883584K), 0.0183145 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
2022-04-11T14:07:15.003+0800: 15.296: [CMS-concurrent-sweep-start]
2022-04-11T14:07:15.003+0800: 15.296: [CMS-concurrent-sweep: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2022-04-11T14:07:15.003+0800: 15.297: [CMS-concurrent-reset-start]
2022-04-11T14:07:15.010+0800: 15.303: [CMS-concurrent-reset: 0.007/0.007 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2022-04-11T14:07:33.222+0800: 33.515: [GC (Allocation Failure) 2022-04-11T14:07:33.222+0800: 33.516: [ParNew: 1067297K->63521K(1310720K), 0.038463
1 secs] 1067297K->63521K(2883584K), 0.0389787 secs] [Times: user=0.12 sys=0.01, real=0.03 secs]
Heap
 par new generation   total 1310720K, used 318744K [0x0000000700000000, 0x0000000760000000, 0x0000000760000000)
  eden space 1048576K,  24% used [0x0000000700000000, 0x000000070f93db40, 0x0000000740000000)
  from space 262144K,  24% used [0x0000000740000000, 0x0000000743e084c0, 0x0000000750000000)
  to   space 262144K,   0% used [0x0000000750000000, 0x0000000750000000, 0x0000000760000000)
 concurrent mark-sweep generation total 1572864K, used 0K [0x0000000760000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 31104K, capacity 31653K, committed 31964K, reserved 1077248K
  class space    used 4399K, capacity 4545K, committed 4580K, reserved 1048576K

/xxxxx/xxxxxx/xxxxx/xiao>more topology_XN_SM_V10.0.0.5_01-worker-17027.log
[ERROR 2022-04-11 14:07:44 c.a.j.d.w.Worker:475 main] Failed to create worker, topologyId:topology_XN_SM_V10.0.0.5_01-1-1649656886, port:17027, wo
rkerId:8189cd88-f3dd-4316-b18f-c178ef64b251, jar_path:/project/bill03/testbill_xn/tools/env/12.0.0/jstorm/data/supervisor/stormdist/topology_XN_SM
_V10.0.0.5_01-1-1649656886/stormjar.jar

java.io.FileNotFoundException: File '/project/bill03/testbill_xn/tools/env/12.0.0/jstorm/data/supervisor/stormdist/topology_XN_SM_V10.0.0.5_01-1-1
649656886/stormconf.ser' does not exist
        at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) ~[commons-io-2.4.jar:2.4]
        at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) ~[commons-io-2.4.jar:2.4]
        at com.alibaba.jstorm.cluster.StormConfig.readLocalObject(StormConfig.java:552) ~[jstorm-core-2.1.1_1.6.jar:na]
        at com.alibaba.jstorm.cluster.StormConfig.read_supervisor_topology_conf(StormConfig.java:454) ~[jstorm-core-2.1.1_1.6.jar:na]
        at com.alibaba.jstorm.daemon.worker.WorkerData.<init>(WorkerData.java:413) ~[jstorm-core-2.1.1_1.6.jar:na]
        at com.alibaba.jstorm.daemon.worker.Worker.<init>(Worker.java:67) ~[jstorm-core-2.1.1_1.6.jar:na]
        at com.alibaba.jstorm.daemon.worker.Worker.mk_worker(Worker.java:270) ~[jstorm-core-2.1.1_1.6.jar:na]
        at com.alibaba.jstorm.daemon.worker.Worker.main(Worker.java:467) ~[jstorm-core-2.1.1_1.6.jar:na]
        
/xxxxx/xxxxxx/xxxxx/xiao>more topology_XN_SM_V10.0.0.5_01-worker-17025.log
ERROR 2022-04-11 14:05:49 b.s.t.ShellBolt:674 Thread-97] Halting process: ShellBolt died. Command: [/project/bill03/testbill_xn/tools/env/12.0.0/jstorm/bin/cstorm, ReqBolt, topology_XN_SM_V10.0.0.5_01], ProcessInfo pid:50569, name:ReqBolt exitCode:143, errorString:
java.lang.RuntimeException: backtype.storm.multilang.NoOutputException: Pipe to subprocess seems to be broken! Currently read output: null
Serializer Exception:
WARN 2022-04-11 14:01:49.799006 TID_139800562489152 DCACXmlConfig.cpp:660] REDO_LOG/ORACLE_CONN is not encripted,it wouldn't be safe!
WARN 2022-04-11 14:01:49.799175 TID_139800562489152 DCACXmlConfig.cpp:660] REDO_LOG/ORACLE_CONN is not encripted,it wouldn't be safe!
DEBUG 2022-04-11 14:01:49.804063 TID_139800562489152 DCACDatabaseMgr.cpp:731] DCACDatabaseMgr::openDatabase,startAddr adjust from [123259084288000] to [123259084288000],procId=0
DEBUG 2022-04-11 14:01:49.804097 TID_139800562489152 DCACDatabaseMgr.cpp:738] dbSize(byte)=536875008
DEBUG 2022-04-11 14:01:49.804103 TID_139800562489152 DCACDatabaseMgr.cpp:739] DCACDatabaseMgr::openDatabase,procId=0,addr=123259084288000
DEBUG 2022-04-11 14:01:49.804219 TID_139800562489152 DCACConnectionImpl.cpp:366] DCACConnectionImpl::connect,IpcKeyMsgQueue: 0x04094f0d
DEBUG 2022-04-11 14:01:49.804248 TID_139800562489152 DCACStaticClientBase.cpp:96] DCACStaticClientBase::initialize,IpcKeyStatic: 0x02094f0d


        at backtype.storm.utils.ShellProcess.readShellMsg(ShellProcess.java:170) ~[jstorm-core-2.1.1_1.6.jar:na]
        at backtype.storm.task.ShellBolt$ReadShellMsgRunnable.run(ShellBolt.java:698) ~[jstorm-core-2.1.1_1.6.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_101]
[ERROR 2022-04-11 14:05:50 b.s.u.DisruptorQueueImpl:159 RoutBolt:64-BoltExecutors] InterruptedException null
[ERROR 2022-04-11 14:05:50 b.s.u.DisruptorQueueImpl:134 Thread-26] InterruptedException null
[ERROR 2022-04-11 14:06:49 b.s.m.JsonSerializer:189 Thread-36] process: 63423 invalid !!

原因分析

根据业务图启动的日志分析可知:业务启动时需要与新库建立连接,但是由于连接超时,框架图超时连接导致业务启动失败,业务程序不断重新拉起

可能的超时原因:
a.新库的连接没有超时判断,无法返回超时结果给框架,导致业务图不断重新拉起

b.业务与新库建立连接,程序kill掉了,但是连接依旧在,并没有进行释放,导致连接数不够了

c.jstorm的配置文件storm.yaml的nimubs.task.timeout.secs配置的超时时间太短,导致程序在启动过程中就已经超时,导致kill掉进程又重新拉起


解决方案

超时原因1:与新库开发说明现存问题,并给出自己的分析,询问是否有超时判断,后经过开发说明,内存库连接查询一般不会给出超时判断结果,只能业务处理判断超时-----该结论可pass掉

超时原因2:观察图启动后,观察业务连接新库使用的连接数情况(启动时有多少连接,停程序时连接数是否释放-会释放)-------根据实际连接情况增加连接数到500,依旧无法正常启动----根据该方案解决确认连接数确实少了,但是没有解决实际问题

超时原因3:配置文件storm.yaml的nimubs.task.timeout.secs时长改为5分钟、10分钟,程序依旧无法正常启动—根据该方案依旧没有解决实际问题

根据以上分析以及解决方案测试,依旧没有解决实际问题,程序启动依旧失败

问题解决:
重启新库的DMDB程序,重启业务框架,重启业务(第一次重启可正常启动,重启多次就无法正常启动了)
实际问题是:????未定位出来???可能是查询数据库太慢了导致超时了

业务程序启动是:业务初始化做了连接操作+sql的初始化操作prepare

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值