某年某月某日,某同学发现进程gdb跟踪,断在某一点。某些错误很难复现,出现一次,千万不要随随便便看一下跳过去了。
插播1:
用gdb跟踪打断点,continue之后,就不要在按回车了,就是下面的样子Continuing.出现之后,不要按回车,否则你会发现好不容易断住了,又没有了。
因为gdb会记录键值,回车就是重复上一次的命令。多按的回车会是你的噩梦。
Loaded symbols for lib/libgcc_s.so.1
Reading symbols from /lib/ld-linux.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.3
0xb6f44908 in pthread_cond_wait@@GLIBC_2.4 () from lib/libpthread.so.0
(gdb) b DRV_BaSignalHandler
Breakpoint 1 at 0x14270
(gdb) c
Continuing.
继续,断住之后,通过查看调用栈,发现错误出现在某个函数,并且确定是除0异常导致的进程关闭。然而,因为进程的debug段被strip掉,看不到更多的信息(比如行号)。
不要灰心,解决方法也很简单。
插播2
1、通过svn co code_path project_name -r specific_version 先获得相应版本的代码,然后编译出完整的可执行定文件。
2、再打开一个Telnet,把完整的可执行文件上传到,某个目录下,比如根目录下。
3、在gdb下通过file命令导入刚才上传的文件,然后继续调试,会发现OK了。
(gdb) b DRV_BaSignalHandler
Breakpoint 1 at 0x14258
(gdb) c
Continuing.
Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0xb2dfe460 (LWP 944)]
0xb6e7e3a4 in raise () from lib/libpthread.so.0
(gdb) file /a8MediaServer
A program is being debugged already.
Are you sure you want to change the file? (y or n) y
Load new symbol table from "/a8MediaServer"? (y or n) y
Reading symbols from /a8MediaServer...done.
(gdb) bt
#0 0xb6e7e3a4 in raise () from lib/libpthread.so.0
#1 0x0004badc in __aeabi_ldiv0 () at ../../../gcc~linaro-4.8-2013.12/libgcc/config/arm/lib1funcs.S:1331
#2
#3
#4
#5
#6
#7
#8
#9
#10 0xb6e74e44 in start_thread () from lib/libpthread.so.0
#11 0xb6b8fd68 in ?? () from lib/libc.so.6
#12 0xb6b8fd68 in ?? () from lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) up
#1 0x0004badc in __aeabi_ldiv0 () at ../../../gcc~linaro-4.8-2013.12/libgcc/config/arm/lib1funcs.S:1331
1331 ../../../gcc~linaro-4.8-2013.12/libgcc/config/arm/lib1funcs.S: No such file or directory.
(gdb) up
#2 0x000211b0 in Fw_H264_parseSPS (gb=0xb2dfda88, width=0xb2dfdbb6, height=0xb2dfdbb4, frameRate=0xb2dfda9e, pusRefFrmNum=0xb2dfda9c)
at /home/ljfeng/soft-4188/src/sc/drv_es_decoder.c:697
697 /home/ljfeng/soft-4188/src/sc/drv_es_decoder.c: No such file or directory.
(gdb) p uiNumUnitsInStick
$1 = 2147483648
(gdb) p uiTimeScale
$2 = 0
(gdb) q
继续,由于没有看到具体的行数,并且退出调试。根据当前的信息,我只能排查代码。
代码中除计算有三处。
前两处 if (crop_left > (unsigned)INT_MAX / 4 / step_x ||
crop_right > (unsigned)INT_MAX / 4 / step_x ||
crop_top > (unsigned)INT_MAX / 4 / step_y ||
crop_bottom> (unsigned)INT_MAX / 4 / step_y ||
(crop_left + crop_right ) * step_x >= mb_width * 16 ||
(crop_top + crop_bottom) * step_y >= mb_height * 16
)
然而:
int vsub = (chroma_format_idc == 1) ? 1 : 0; -----------> 0 或者 1
int hsub = (chroma_format_idc == 1 || chroma_format_idc == 2) ? 1 : 0; -----------> 0 或者 1
int step_x = 1 << hsub; -------------> 不可能为0
int step_y = (2 - frame_flag) << vsub; -------------> 当frame_flag为2的时候有可能为0
然而 frame_flag 是通过 (U8)result >>= 7;得到的,所以它的值非0即1。
最后一处除计算被我在半秒钟给否了。代码如下:
if( uiNumUnitsInStick > 0 )
{
usFrameRate = (U16)(uiTimeScale / (uiNumUnitsInStick * 2));
}
这时,我就寄希望于再次复现,再次定位。功夫不负有心人,若干天再次复现的时候,就是狠狠打脸的时候。
结果是:
(gdb) p uiNumUnitsInStick
$1 = 2147483648
(gdb) p (uiNumUnitsInStick * 2)
$2 = 0
寄语:希望大家能够提高职业敏感度,不要跟我一样犯这种想当然的弱智错误(代码不是我写的,但是我也可能会写出这样的代码)。