一. 缘来缘起
之前,一同事开发的代码上线后,概率性出现core dump问题,查了几小时没有找出原因,也确实难以发现。
我个人对查bug比较感兴趣,于是问了一下,他说是在vector的一个sort函数处出现了问题,百思不得其解。
根据经验,我猜测可能是compare函数的实现有问题,然后让他做了小小的改动,果然,奇迹就这样发生了。
二. core dump程序
原场景较为复杂,为了便于叙述,我简化原场景,和大家一起学习下这个core dump问题
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
bool compare(int a, int b)
{
return a >= b;
}
int main(int argc, char *argv[])
{
vector<int> vec;
for (int i = 0; i < 17; i++)
{
int x = 0;
vec.push_back(x);
}
sort(vec.begin(), vec.end(), compare);
return 0;
}
编译并运行一下:
ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ g++ -g test.cpp
ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ ./a.out
Segmentation fault (core dumped)
ubuntu@VM-0-15-ubuntu:~/taoge/cpp$
可以看到,程序core dump了,那该怎么办呢?调试一下呗。
三. 调试程序
程序core dump了,可以直接gdb a.out core分析搞起(此时a.out不会运行),也可以直接用gdb a.out来再次运行程序:
ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ gdb a.out
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from a.out...done.
(gdb)
(gdb)
(gdb)
(gdb) r
Starting program: /home/ubuntu/taoge/cpp/a.out
Program received signal SIGSEGV, Segmentation fault.
0x000000000040219f in __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)>::operator()<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > > (
this=0x7fffffffe1a0, __it1=<error reading variable: Cannot access memory at address 0x638000>, __it2=0)
at /usr/include/c++/5/bits/predefined_ops.h:123
123 { return bool(_M_comp(*__it1, *__it2)); }
(gdb)
(gdb)
(gdb)
(gdb) bt
#0 0x000000000040219f in __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)>::operator()<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > > (
this=0x7fffffffe1a0, __it1=<error reading variable: Cannot access memory at address 0x638000>, __it2=0)
at /usr/include/c++/5/bits/predefined_ops.h:123
#1 0x000000000040207c in std::__unguarded_partition<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=<error reading variable: Cannot access memory at address 0x638000>,
__last=0, __pivot=0, __comp=...) at /usr/include/c++/5/bits/stl_algo.h:1897
#2 0x0000000000401a6d in std::__unguarded_partition_pivot<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=0, __last=0, __comp=...)
at /usr/include/c++/5/bits/stl_algo.h:1918
#3 0x00000000004016d0 in std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=0, __last=0, __depth_limit=13, __comp=...)
at /usr/include/c++/5/bits/stl_algo.h:1948
#4 0x00000000004012c9 in std::__sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=0, __last=0, __comp=...) at /usr/include/c++/5/bits/stl_algo.h:1963
#5 0x0000000000400de0 in std::sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, bool (*)(int, int)> (__first=0, __last=0, __comp=0x400aa6 <compare(int, int)>) at /usr/include/c++/5/bits/stl_algo.h:4729
#6 0x0000000000400b41 in main (argc=1, argv=0x7fffffffe448) at test.cpp:21
(gdb)
能看到进程调用栈了,接着,进入帧1看看:
(gdb) f 1
#1 0x000000000040207c in std::__unguarded_partition<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=<error reading variable: Cannot access memory at address 0x638000>,
__last=0, __pivot=0, __comp=...) at /usr/include/c++/5/bits/stl_algo.h:1897
1897 while (__comp(__first, __pivot))
(gdb) i args
__first = <error reading variable __first (Cannot access memory at address 0x638000)>
__last = 0
__pivot = 0
__comp = {_M_comp = 0x400aa6 <compare(int, int)>}
(gdb) p *--__first
$6 = (int &) @0x637ffc: 0
(gdb)
然后呢?怎么办?配合源码看看呗。
四. 源码分析
结合源码可知:__first越界了,而--__first刚好没有越界,值为0,所以,内存越界导致了core dump,进入__unguarded_partition的逻辑,其代码为:
/// This is a helper function...
template<typename _RandomAccessIterator, typename _Tp, typename _Compare>
_RandomAccessIterator
__unguarded_partition(_RandomAccessIterator __first,
_RandomAccessIterator __last,
_Tp __pivot, _Compare __comp)
{
while (true)
{
while (__comp(*__first, __pivot))
++__first;
--__last;
while (__comp(__pivot, *__last))
--__last;
if (!(__first < __last))
return __first;
std::iter_swap(__first, __last);
++__first;
}
}
假设待排序的所有元素值都是0,而我们自定义的compare对于相等值返回true, 故如下代码会一直运行
while (__comp(*__first, __pivot))
++__first;
直到__first越界,而*__first就形成了越界访问。回顾一下,gdb调试的时候,发现*--__first的值为0,可见__first刚刚越界一点点,刚好出轨。
上面的compare只有在特定情况下才会越界,所以体现出概率性core dump. 如果所有的值不等,则不一定会core dump.
另外,如果把上面的程序的17改了16,程序就不会core dump了,为什么呢?且看源码程序:
template<typename _RandomAccessIterator>
void __final_insertion_sort(_RandomAccessIterator __first,
_RandomAccessIterator __last) {
if (__last - __first > _S_threshold)
{
__insertion_sort(__first, __first + _S_threshold);
__unguarded_insertion_sort(__first + _S_threshold, __last);
}
else __insertion_sort(__first, __last);
}
而_S_threshold的值正是16:
enum { _S_threshold = 16 };
五. 修复验证
根据上述实战调试和代码分析,我们得出一个重要结论:compare必须针对相等返回false.
显然,在上面的compare函数中,当a和b相等时,compare返回了true, 这是有问题的啊。
来看看修复后的程序,如下:
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
bool compare(int a, int b)
{
return a > b;
}
int main(int argc, char *argv[])
{
vector<int> vec;
for (int i = 0; i < 17; i++)
{
int x = 0;
vec.push_back(x);
}
sort(vec.begin(), vec.end(), compare);
return 0;
}
编译运行,一切OK了,如下:
ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ g++ -g test.cpp
ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ ./a.out
ubuntu@VM-0-15-ubuntu:~/taoge/cpp$
六. 最后的话
查杀bug, 是实际开发工作中重要的能力,它对思路和经验有很高要求。后续,我会介绍core dump调试的n种方法。
周六了,打工人最开心的日子,就不多说啦。希望大家开开心心,一切顺利,也希望本文对大家有帮助。咱们下次见。