静态库覆盖动态库同名类以及方法

某天某个客户反馈他们测试环境,客户端上报的审计信息无法插入Oracle数据库,最近他们升级了Oracle,想问下是不是升级导致的,不过刚好这个时间,有新需求。给客户提供了新包,也可能是自己改代码搞出来的bug,而不是升级导致的

让现场同事采集了下堆栈信息,我们抽出有用的信息(此现场无法远程)

Thread 16 (Thread 0x7f7cf07e0700 (LWP 5519)):
#0  0x00007f7db0d4b54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7db0d46e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f7db0d46d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000012b6c31 in HIEUtil::RecMutex::lock (this=0x7f7db4c43548 <HOTLDBMgrOracle::sta_ins+936>)
    at ../HIERecMutex.cpp:249
#4  0x00007f7db49bd89a in HOTLDBMgrOracle::get_connect (this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>, ui_id=@0x7f7cf07df9bc: 32637, str_exec_sql=...)
    at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1991
#5  0x00007f7db49bbff7 in HOTLDBMgrOracle::do_exec_sql (this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>, str_sql=..., i_buffer_size=4096)
    at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1631
#6  0x00007f7db49c460c in DBBatInsertAuditInfoCommon::insert_general_config_warningInfo (lst_audit_infos=std::list<std::pair<AuditCommonInfo, LVPReportUserActionMonVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportUserActionMonVT> > >::_Node: 
std::list) at DBBatInsertAuditInfoCenterOracle.cpp:947
#7  0x0000000000cba179 in DBBatInsertAuditInfoCommon::insert_audit_info(std::list<std::pair<AuditCommonInfo, LVPReportUserActionMonVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportUserActionMonVT> > > const&) ()
#8  0x0000000000bdbf4c in DBBatInsertAuditInfoThread<std::pair<AuditCommonInfo, LVPReportUserActionMonVT> >::run() ()
#9  0x00000000012bb0b7 in HThreadReal::run (this=0x7f7d0c002440)
    at ../HThreadPool.cpp:213
#10 0x0000000001290827 in startHook (arg=0x7f7d0c002440)
    at ../HIEThread.cpp:606
#11 0x00007f7db0d44ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f7db0a6db0d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f7cfc7f8700 (LWP 6881)):
#0  0x00007f7db0d4b54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7db0d46e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f7db0d46d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000012b6c31 in HIEUtil::RecMutex::lock (this=0x7f7db4c43548 <HOTLDBMgrOracle::sta_ins+936>)
    at ../HIERecMutex.cpp:249
#4  0x00007f7db49bd89a in HOTLDBMgrOracle::get_connect (this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>, ui_id=@0x7f7cfc7f76dc: 32637, str_exec_sql=...)
    at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1991
#5  0x00007f7db49bbff7 in HOTLDBMgrOracle::do_exec_sql (this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>, str_sql=..., i_buffer_size=4096)
    at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1631
#6  0x00007f7db49c2c70 in DBBatInsertAuditInfoCommon::insert_cloud_monitor (lst_audit_infos= const std::list<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT> > >::_Node: std::list) at DBBatInsertAuditInfoCenterOracle.cpp:701
#7  0x0000000000cba0e9 in DBBatInsertAuditInfoCommon::insert_audit_info(std::list<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT> > > const&) ()
#8  0x0000000000bdd85c in DBBatInsertAuditInfoThread<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT> >::run() ()
#9  0x00000000012bb0b7 in HThreadReal::run (this=0x7f7d0c03a9c0)
    at ../HThreadPool.cpp:213
#10 0x00000000012907d0 in startHook (arg=0x7f7d0c03a9c0)
    at ../HIEThread.cpp:585
#11 0x00007f7db0d44ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f7db0a6db0d in clone () from /lib64/libc.so.6

看到这个很容易想到其它地方获取锁卡住导致这里获取锁时获取步到,因此卡住了。但是除此之外没有其它请求锁的线程栈了,已经让现场同事采集了好几次,都是这样的堆栈。既然不是别的地方获取锁导致执行sql语句的接口卡住了,那么怀疑可能是mutex的内存被写坏了导致的。

按照这个思路,并且发现后台有个记录sql执行以及异常时相关信息的日志UniServer_DBEXInfo_0.log,发现每次卡住都能看到类似的sql。那就根据这个日志和堆栈,在HOTLDBMgrOracle.h 文件中找到HOTLDBMgrOracle::do_exec_sql函数,一步步查吧

[1][2022-07-01][11:01:07][T2784966400][HOTLDBMgrOracle.h   ][1489][E]Msg:ORA-00903: invalid table name
, STM:update  set  = -1 where uidrecordid  = '064420220628tNiyhZuTdpPCGl80', State:, Var:, id 0

找到HOTLDBMgrOracle.h 文件中的HOTLDBMgrOracle::do_exec_sql函数

HOTLStreamOracle* HOTLDBMgrOracle::do_exec_sql(const HString& str_sql,int i_buffer_size)
{
       ...
       ...
       实现省略了

        try
        {
               ...
        }
        catch( oracle::otl_exception &ex)
		{
                HString str_tips ;
                str_tips << L"Do exec " << str_sql << L" exception:"<<otl_exception2str(ex);
                m_file_ex_log.log(m_file_ex_log.get(LEL_TIPS,__WFILE__,__LINE__)<<str_tips);

                if (p_stream->pos)
                {
                        delete p_stream->pos;
                        p_stream->pos = 0;
                }
                delete p_stream;
                p_stream = 0;
                del_ref(ui_id);
                do_ex(ex, ui_id);
        }
        return p_stream;
}

//跟着catch里的处理逻辑到do_ex函数
void HOTLDBMgrOracle::do_ex(oracle::otl_exception& ex, unsigned int ui_id)
{	
	//otl_exception2str(ex)这个接口打印了上面所说的日志,如下,所以接着堆栈往下跟踪
	/*
	HString otl_exception2str(oracle::otl_exception& p)
	{
     	HString str;
        HString str_error((char*)(p.msg)) ;
        std::string ice_error((char*)p.msg);
        printf(ice_error.c_str());
        str_error.make_by_ice_str(ice_error);
        str << L"Msg:" << str_error << L", STM:" << p.stm_text << L", State:" << p.sqlstate
                << L", Var:" << p.var_info;
        return str;
	}
	*/
	
	HString str_show = otl_exception2str(ex) << L", id " << ui_id;
	m_file_ex_log.log(m_file_ex_log.get(LEL_ERROR , __WFILE__,__LINE__)<<str_show);

	if(...)
	{
		HLog(HGET_INFO << L"set to recon");
		HIEUtil::RecMutex::Lock lock(mutex);
		for (int i = 0; i < mvt_connector.size(); i++)
		{
			if (mvt_connector[i].i_status == HOTLCS_CON_ED)
			{
				mvt_connector[i].i_status = HOTLCS_CON_NO;
				if (!mb_thread_run)
				{
					mb_thread_run = true;
					HThreadDBMMaintain* pt = new HThreadDBMMaintain;
					pt->mdm = this;
					pt->start().detach();
				}
			}
		}
	}
}
//接着跟HThreadDBMMaintain类
class HThreadDBMMaintain: public HIEUtil::Thread
{
public:
	HOTLDBMgrOracle* mdm;
	virtual void run()
	{
		mdm->run();
	}
};

//接着跟run函数
void HOTLDBMgrOracle::run()
{
	HLog(HGET_INFO << L"run start");
	while(1)
	{
		bool b_need = need_recon();
		if (b_need)
		{
			HLog(HGET_INFO << L"need_recon");
			oracle::otl_connect* p_connect = get_new_connect();
			HLog(HGET_INFO << L"get_new_connect ret " << (__int64)p_connect);
			if (p_connect == 0)
			{
				HEnvironment::Sleep(1000 * 20);
				continue;
			}
			add_new(p_connect);
		}
		else
		{
			HEnvironment::Sleep(1000 * 2);
		}
	}
}

但是我们发现UniServer_0.log日志种发现显示的文件名并不是HOTLDBMgrOracle.h而是HOTLDBMgr.cpp

[1351][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp       ][0879][I]run start
[1352][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp       ][0887][I]need_recon
[1353][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp       ][0747][I]mstr_user scmsbusinessdb mstr_dsn OracleDB
[1354][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp       ][1648][I]rlogon take time 58 ms
[1355][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp       ][0891][I]get_new_connect ret 140437713089200

但是日志里的“run start" “need_recon” "get_new_connect ret"信息不会有错,但为什么会日志会打印显示HOTLDBMgr.cpp文件名,(按照预期应该是HOTLDBMgrOracle.h),此时我们翻阅HOTLDBMgr.cpp这个文件的代码,发现两个文件中有同名的类HThreadDBMMaintain,细想下编译不会报错么。

class HThreadDBMMaintain  : public HIEUtil::Thread
{
public:
	HOTLDBMgr* mdm;
	virtual void run()
	{
		mdm->run();
	}
};

经过资料的查询以及自己写demo测试,发现动态库同名的类以及方法调用时确实会被静态库的所覆盖
(当然前提是你要在代码中的某处调用了这个同名类,否则仅仅只是定义在那里,即便同名了也不会覆盖,还是各自调用各自的)
HOTLDBMgr.cpp .h被制作成了静态库
HOTLDBMgrOracle.h被制作成了动态库
因为项目的特殊性,然后一起被链接到我们的程序里了

======================================================
再次回到这里HOTLDBMgrOracle::do_ex这里。因为动态库同名类以及方法调用被静态库覆盖了,所以HThreadDBMMaintain* pt = new HThreadDBMMaintain; new出来的是HOTLDBMgr.cpp中的HThreadDBMMaintain,而这个文件中的HThreadDBMMaintain类的成员mdm类型是HOTLDBMgr*,this指针的类型是HOTLDBMgrOracle*。这里执行了pt->mdm = this;

虽然不能远程使用gcore,无法知道具体的细节,但是可以猜测的是,这里将不同类的实例赋值并且进行相关函数调用和成员赋值操作,HOTLDBMgr和HOTLDBMgrOracle的类内存分布是不一样的,所以导致了HOTLDBMgrOracle里的mutex变量内存被踩坏,故堆栈lock的时候卡住从而数据无法入库。

后续将lock改为trylock发现返回值为16,该锁已经被占用,说明我们这个猜测是有可能性的,碍于没法远程不知道具体细节,就暂时搁置

void HOTLDBMgrOracle::do_ex(oracle::otl_exception& ex, unsigned int ui_id)
{
	HString str_show = otl_exception2str(ex) << L", id " << ui_id;
	m_file_ex_log.log(m_file_ex_log.get(LEL_ERROR , __WFILE__,__LINE__)<<str_show);

	if(...)
	{
		HLog(HGET_INFO << L"set to recon");
		HIEUtil::RecMutex::Lock lock(mutex);
		for (int i = 0; i < mvt_connector.size(); i++)
		{
			if (mvt_connector[i].i_status == HOTLCS_CON_ED)
			{
				mvt_connector[i].i_status = HOTLCS_CON_NO;
				if (!mb_thread_run)
				{
					mb_thread_run = true;
					HThreadDBMMaintain* pt = new HThreadDBMMaintain;
					pt->mdm = this;
					pt->start().detach();
				}
			}
		}
	}
}

//而pt->start()也就是调用了HOTLDBMgr::run(),而其中再次调用的相关接口也确实访问甚至赋值了相关变量
void HOTLDBMgr::run()
{
        HLog(HGET_INFO << L"run start");
        while(1)
        {
                bool b_need = need_recon();
                if (b_need)
                {
                        HLog(HGET_INFO << L"need_recon");
                        otl_connect* p_connect = get_new_connect();
                        HLog(HGET_INFO << L"get_new_connect ret " << (__int64)p_connect);
                        if (p_connect == 0)
                        {
                                HEnvironment::Sleep(1000 * 20);
                                continue;
                        }
                        add_new(p_connect);
                }
                else
                {
                        HEnvironment::Sleep(1000 * 2);
                }
        }
}

故将HOTLDBMgrOracle.h中的HThreadDBMMaintain类名以及用过改类名的地方全部修改为HThreadDBOracleMMaintain,重新编译后给到现场验证。入不了库的问题不在出现

题外话:之前一直没出现是因为没有sql语句执行异常或者有sql执行异常但不满足条件,而新交付的测试包拼接sql的时候正好出现update set = -1 where uidrecordid = '064420220628tNiyhZuTdpPCGl80’这种有问题的sql,导致进入catch分支里面并且满足条件,走到HThreadDBMMaintain* pt = new HThreadDBMMaintain;这里。才引发了这个bug。时也,命也

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值