今天训练yolo-v2一直出错,就想单步调试以下,因为用printf比较麻烦,选用了gdb调试。
首先修改下Makefile文件,将19行
CC=gcc
改为
CC=gcc -g
然后重新编译,编译完成后运行以下命令进入gdb调试
$gdb darknet
你会看到以下提示
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-51.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/guoyana/my_files/local_install/darknet-v2/darknet...done.
(gdb)
然后设置命令行参数,方法如下
(gdb) set args yolo train ./cfg/yolo-voc.cfg
回车后命令行参数就设置完了,我们用show来看一下
(gdb) show args
Argument list to give program being debugged when it is started is "yolo train ./cfg/yolo-voc.cfg ".
然后我们想在yolo.c文件中设置个断点,等程序运行到指定位置的时候停止,然后单步运行,看错误出在哪一步,设置断点方式如下:
(gdb) b yolo.c:40
Breakpoint 1 at 0x451d44: file ./src/yolo.c, line 40.
然后用r指令来运行程序:
(gdb) r
Starting program: /home/guoyana/my_files/local_install/darknet-v2/darknet yolo train ./cfg/yolo-voc.cfg
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
yolo-voc
layer filters size input output
0 [New Thread 0x7fffc52e1700 (LWP 13524)]
conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32
1 max 2 x 2 / 2 416 x 416 x 32 -> 208 x 208 x 32
2 conv 64 3 x 3 / 1 208 x 208 x 32 -> 208 x 208 x 64
3 max 2 x 2 / 2 208 x 208 x 64 -> 104 x 104 x 64
4 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128
5 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64
6 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128
7 max 2 x 2 / 2 104 x 104 x 128 -> 52 x 52 x 128
8 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256
9 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128
10 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256
11 max 2 x 2 / 2 52 x 52 x 256 -> 26 x 26 x 256
12 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512
13 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256
14 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512
15 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256
16 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512
17 max 2 x 2 / 2 26 x 26 x 512 -> 13 x 13 x 512
18 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024
19 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512
20 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024
21 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512
22 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024
23 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024
24 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024
25 route 16
26 reorg / 2 26 x 26 x 512 -> 13 x 13 x2048
27 route 26 24
28 conv 1024 3 x 3 / 1 13 x 13 x3072 -> 13 x 13 x1024
29 conv 125 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 125
30 detection
Learning Rate: 0.0001, Momentum: 0.9, Decay: 0.0005
Breakpoint 1, train_yolo (cfgfile=0x7fffffffe5bb "./cfg/yolo-voc.cfg", weightfile=<optimized out>) at ./src/yolo.c:41
41 char **paths = (char **)list_to_array(plist);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-55.el7_0.3.x86_64 libgcc-4.8.2-16.2.el7_0.x86_64 libstdc++-4.8.2-16.2.el7_0.x86_64
(gdb)
我们看到程序停留在了41行,为什么呢?不是40行么……
之后我们敲个n,然后就不停的回车,会看到以下运行过程
args.m = plist->size;
(gdb)
args.d = &buffer;
(gdb)
args.hue = net.hue;
(gdb)
args.d = &buffer;
(gdb)
args.hue = net.hue;
(gdb)
pthread_t load_thread = load_data_in_thread(args);
(gdb)
[New Thread 0x7fff384fd700 (LWP 16478)]
float avg_loss = -1;
(gdb)
pthread_t load_thread = load_data_in_thread(args);
(gdb)
float avg_loss = -1;
(gdb)
while(get_current_batch(net) < net.max_batches){
(gdb)
time=clock();
(gdb)
pthread_join(load_thread, 0);
(gdb)
time=clock();
(gdb)
pthread_join(load_thread, 0);
(gdb)
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff384fd700 (LWP 16478)]
0x00007fffeffd7595 in _int_malloc () from /lib64/libc.so.6
(gdb)
可以看到程序不是一步一步往下执行,跳来跳去,这是多线程的问题,最后程序报错了。SIGSEGV是什么意思呢?看一下百度百科的解释~
在POSIX兼容的平台上,SIGSEGV是当一个进程执行了一个无效的内存引用,或发生段错误时发送给它的信号。SIGSEGV的符号常量在头文件signal.h中定义。因为在不同平台上,信号数字可能变化,因此符号信号名被使用。通常,它是信号#11。
- 1
原来是内存错误,可能是内存不够用,我就看下内存占用
$free -h
total used free shared buffers cached
Mem: 125G 124G 1.1G 1.4G 50M 82G
-/+ buffers/cache: 41G 84G
Swap: 0B 0B 0B
gdb的常用命令,供大家参考
命令 | 描述 |
backtrace(或bt) | 查看各级函数调用及参数 |
finish | 连续运行到当前函数返回为止,然后停下来等待命令 |
frame(或f) | 帧编号 选择栈帧 |
info(或i) | locals 查看当前栈帧局部变量的值 |
list(或l) | 列出源代码,接着上次的位置往下列,每次列10行 |
list 行号 | 列出从第几行开始的源代码 |
list 函数名 | 列出某个函数的源代码 |
next(或n) | 执行下一行语句 |
print(或p) | 打印表达式的值,通过表达式可以修改变量的值或者调用函数 |
quit(或q) | 退出gdb调试环境 |
set var | 修改变量的值 |
start | 开始执行程序,停在main函数第一行语句前面等待命令 |
step(或s) | 执行下一行语句,如果有函数调用则进入到函数中 |
(END)