问题:雷视的德清现场出现频繁重启问题,生成的coredump文件显示有好几个线程导致重启,其中4个都显示布防线程发出了signal 11
-
布防线程相关代码:
int RADAR_COORDINATE_SERVER::PicServ_SDK_Recv(int sockfd,char *pbuf, UINT32 buflen, UINT32 *dwOutlen) { ... NETRET_HEADER *pHead = NULL; pHead = (NETRET_HEADER *)pbuf; (iRet = readn(sockfd, pbuf, sizeof(NETRET_HEADER))); //coredump在运行下面这行时崩溃了 dwPackLen = ntohl(pHead->length); ... } int RADAR_COORDINATE_SERVER::Send_UserExchange(void) { int fd = -1; UINT32 dwRetLen = 0; UINT32 buff[PICSERV_BUFFERLEN] = {0}; char *pbuf = (char *)buff; IPCM_SDK_NETCMD_HEADER head; Make_SDK_Head(NETCMD_USEREXCHANGE, (char*)&head, sizeof(head), &dwRetLen, &Addr, mac); if(PicServ_SDK_Recv(fd, pbuf, sizeof(buff), &dwRetLen); pHead = (NETRET_HEADER *)pbuf; ... }
-
coredump的打印如下:
warning: Could not load shared library symbols for 14 libraries, e.g. /lib/libhisdk.so. Use the "info sharedlibrary" command to see the complete listing. Do you need "set solib-search-path" or "set sysroot"? Core was generated by `./hisi_app'. Program terminated with signal 11, Segmentation fault. #0 RADAR_COORDINATE_SERVER::PicServ_SDK_Recv (this=this@entry=0x3569478, sockfd=1154, pbuf=0x4 <Address 0x4 out of bounds>, pbuf@entry=0x54a05a60 "", buflen=buflen@entry=1024, dwOutlen=0x54a05940, dwOutlen@entry=0x54a05938) at dataManagement/picserv/RadarCoordinate_serv.cpp:237 warning: Source file is more recent than executable. 237 dwPackLen = ntohl(pHead->length); //coredump ?????????? (gdb) bt #0 RADAR_COORDINATE_SERVER::PicServ_SDK_Recv (this=this@entry=0x3569478, sockfd=1154, pbuf=0x4 <Address 0x4 out of bounds>, pbuf@entry=0x54a05a60 "", buflen=buflen@entry=1024, dwOutlen=0x54a05940, dwOutlen@entry=0x54a05938) at dataManagement/picserv/RadarCoordinate_serv.cpp:237 #1 0x002950f4 in RADAR_COORDINATE_SERVER::AlarmUp (this=this@entry=0x3569478) at dataManagement/picserv/RadarCoordinate_serv.cpp:938 #2 0x0029587c in RADAR_COORDINATE_SERVER::taskRadarCoordinateServer (_p=0x3569478) at dataManagement/picserv/RadarCoordinate_serv.cpp:1216 #3 0x76e19b3c in ?? () Backtrace stopped: frame did not save the PC
-
值得注意的一点是
,pbuf的地址不能正确打印:pbuf=0x4 <Address 0x4 out of bounds> (gdb) p dwOutlen $1 = (UINT32 *) 0x54a05940 (gdb) p pbuf $2 = 0x4 <Address 0x4 out of bounds>
- 但是打印dwOutlen的地址发现能正常打印
-
回到上一个栈帧,打印上述两个地址:
(gdb) up 1 #1 0x002950f4 in RADAR_COORDINATE_SERVER::AlarmUp (this=this@entry=0x3569478) at dataManagement/picserv/RadarCoordinate_serv.cpp:938 938 if(PicServ_SDK_Recv(m_iSockFd, pbuf, sizeof(buff), &dwRetLen) < 0) (gdb) p dwRetLen $3 = 256 (gdb) p &dwRetLen $4 = (UINT32 *) 0x54a05940 (gdb) p pbuf $5 = 0x54a05a60 "" (gdb) p &buff[0] $6 = (unsigned int *) 0x54a05a68
-
可以看到dwRetLen在这个栈中的地址,与下一个函数调用时作为实参传过去的地址是一样的,pbuf的地址可能放在寄存器中所限打印是0x4,这也不能确定是不是错了
(gdb) down 1 #0 RADAR_COORDINATE_SERVER::PicServ_SDK_Recv (this=this@entry=0x3569478, sockfd=1154, pbuf=0x4 <Address 0x4 out of bounds>, pbuf@entry=0x54a05a60 "", buflen=buflen@entry=1024, dwOutlen=0x54a05940, dwOutlen@entry=0x54a05938) at dataManagement/picserv/RadarCoordinate_serv.cpp:237 237 dwPackLen = ntohl(pHead->length); (gdb) p pHead $8 = (NETRET_HEADER *) 0x4 (gdb) p &pbuf Address requested for identifier "pbuf" which is in register $r5 (gdb) p &pHead Address requested for identifier "pHead" which is in register $r5 (gdb) p $r5 $10 = 4
值得注意的是下面这句打印:
- gdb打印
(gdb) p pbuf $5 = 0x54a05a60 "" (gdb) p &buff[0] $6 = (unsigned int *) 0x54a05a68
- 代码中是这样写的:
UINT32 buff[PICSERV_BUFFERLEN] = {0}; char *pbuf = (char *)buff;
- 这两个地址明明是一样的,但是打印来看却差了8个字节
- 不明白写这代码的人是什么意图,也没看到注释,明明后缀是.cpp,编译也用g++的编译工具链,但是文件里却用的C风格的显示强制类型转换,说他是错的吧,但是代码又稳定运行几十年了,第一次在这个地方出现错误,要出现应该早该出现了。
题外话:关于显示转换
- 在C++ prime第5版的4.11.3节,介绍了C++中的4中强制转换方式,
static_cast、dynamic_cast、const_cast、reinterpret_cast
- 关于reinterpret_cast,有如下原文:
假设有如下转换: int *p; char *pc=reinterpret_cast<char *>(ip); 我们必须牢记pc所指的真实对象是一个int而非字符串,如果把pc当成普通的字符指针使用就可能在运行时发生错误,如: string str(pc); ...
- 紧接着,下面还有一段waring
waring:reinterpret_cast本质上依赖于机器,要想安全的使用reinterpret_cast必须对涉及的类型和编译器实现转换的步骤都非常了解.
- 关于代码中C风格的转换与reinterpret_cast转换的关系:
- 执行旧式强制类型转换时,如果const_cast和static_cast也合法,则行为与对应命名转换一致,若不合法,则执行reinterpret_cast类似的功能,代码中用的
UINT32 *
转成char *
作用与reinterpret_cast一样
- 执行旧式强制类型转换时,如果const_cast和static_cast也合法,则行为与对应命名转换一致,若不合法,则执行reinterpret_cast类似的功能,代码中用的
重启的真正原因
- 1、重启当然不可能是强制转换引起的,因为
pbuf
是一个指针,它的值被改变是很危险的,如果指向的是一个非法地址,解引用
时就可能
发生段错误。 - 2、因为栈的生长方向是从高地址到低地址生长,所以极有可能是因为对定义在
pbuf
后的一个变量的越界访问导致pbuf
的值被改变。 - 3、在导致重启的调用
PicServ_SDK_Recv()
之前,函数Make_SDK_Head()
使用了结构体head的地址int RADAR_COORDINATE_SERVER::Make_SDK_Head(UINT32 cmdtype, char *pbuff, UINT32 buflen, UINT32 *poutlen, HPR_ADDR_T *pAddr, UINT8 *mac) { // 1.注意这个结构体长度的检查 usr_assert(pbuff != NULL && buflen >= sizeof(NETCMD_HEADER) && poutlen != NULL, ERROR); NETCMD_HEADER *pHead = NULL; INTER_ALARM_SEARCH_COND *pRadarUpPara = NULL; pHead = (NETCMD_HEADER *)pbuff; // ...省略无关代码 *poutlen = sizeof(NETCMD_HEADER); // 2.看①处注释的检测,可知下面这句话越界了 pRadarUpPara = (INTER_ALARM_SEARCH_COND *)(pbuff + sizeof(NETCMD_HEADER)); pRadarUpPara->dwAlarmComm = DVR_VCA_ALARM; pRadarUpPara->bySupport = GET_RADARDATA_SWITCH; pHead->length = htonl(sizeof(NETCMD_HEADER) + sizeof(INTER_ALARM_SEARCH_COND)); *poutlen = sizeof(NETCMD_HEADER) + sizeof(INTER_ALARM_SEARCH_COND); pHead->checkSum = htonl(checkByteSum((char *)&(pHead->netCmd), sizeof(NETCMD_HEADER)-12)); return OK; }
- 4、结论:原因就是在
Make_SDK_Head()
中对结构体head的越界读写,改变了在同一个栈空间,高地址的pbuff
变量的值,导致在后续操作,对pbuff
解引用时,概率性的发生了段错误。 - 5、引入的原因:这个发生在代码移植的时候,因为需要对消息头进行扩展,但是
参数检查
与消息头的定义
均未及时更改,才导致了这个让人头大的概率性发生的段错误。