秒杀STL core dump概率问题

一. 缘来缘起

之前,一同事开发的代码上线后,概率性出现core dump问题,查了几小时没有找出原因,也确实难以发现。

我个人对查bug比较感兴趣,于是问了一下,他说是在vector的一个sort函数处出现了问题,百思不得其解。

根据经验,我猜测可能是compare函数的实现有问题,然后让他做了小小的改动,果然,奇迹就这样发生了。

二. core dump程序

原场景较为复杂,为了便于叙述,我简化原场景,和大家一起学习下这个core dump问题

#include <iostream>#include <vector>#include <algorithm>using namespace std; bool compare(int a, int b){    return a >= b;} int main(int argc, char *argv[]){    vector<int> vec;     for (int i = 0; i < 17; i++)    {        int x = 0;        vec.push_back(x);    }     sort(vec.begin(), vec.end(), compare);     return 0;}

编译并运行一下:

ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ g++ -g test.cppubuntu@VM-0-15-ubuntu:~/taoge/cpp$ ./a.out Segmentation fault (core dumped)ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ 

可以看到,程序core dump了,那该怎么办呢?调试一下呗。

三. 调试程序

程序core dump了,可以直接gdb a.out core分析搞起(此时a.out不会运行),也可以直接用gdb a.out来再次运行程序:

ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ gdb a.out GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1Copyright (C) 2016 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law.  Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-linux-gnu".Type "show configuration" for configuration details.For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>.Find the GDB manual and other documentation resources online at:<http://www.gnu.org/software/gdb/documentation/>.For help, type "help".Type "apropos word" to search for commands related to "word"...Reading symbols from a.out...done.(gdb) (gdb) (gdb) (gdb) rStarting program: /home/ubuntu/taoge/cpp/a.out 
Program received signal SIGSEGV, Segmentation fault.0x000000000040219f in __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)>::operator()<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > > (    this=0x7fffffffe1a0, __it1=<error reading variable: Cannot access memory at address 0x638000>, __it2=0)    at /usr/include/c++/5/bits/predefined_ops.h:123123             { return bool(_M_comp(*__it1, *__it2)); }(gdb) (gdb) (gdb) (gdb) bt#0  0x000000000040219f in __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)>::operator()<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > > (    this=0x7fffffffe1a0, __it1=<error reading variable: Cannot access memory at address 0x638000>, __it2=0)    at /usr/include/c++/5/bits/predefined_ops.h:123#1  0x000000000040207c in std::__unguarded_partition<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=<error reading variable: Cannot access memory at address 0x638000>,     __last=0, __pivot=0, __comp=...) at /usr/include/c++/5/bits/stl_algo.h:1897#2  0x0000000000401a6d in std::__unguarded_partition_pivot<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=0, __last=0, __comp=...)    at /usr/include/c++/5/bits/stl_algo.h:1918#3  0x00000000004016d0 in std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=0, __last=0, __depth_limit=13, __comp=...)    at /usr/include/c++/5/bits/stl_algo.h:1948#4  0x00000000004012c9 in std::__sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=0, __last=0, __comp=...) at /usr/include/c++/5/bits/stl_algo.h:1963#5  0x0000000000400de0 in std::sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, bool (*)(int, int)> (__first=0, __last=0, __comp=0x400aa6 <compare(int, int)>) at /usr/include/c++/5/bits/stl_algo.h:4729#6  0x0000000000400b41 in main (argc=1, argv=0x7fffffffe448) at test.cpp:21(gdb) 

能看到进程调用栈了,接着,进入帧1看看:​​​​​​​

(gdb) f 1#1  0x000000000040207c in std::__unguarded_partition<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(int, int)> > (__first=<error reading variable: Cannot access memory at address 0x638000>,     __last=0, __pivot=0, __comp=...) at /usr/include/c++/5/bits/stl_algo.h:18971897              while (__comp(__first, __pivot))(gdb) i args__first = <error reading variable __first (Cannot access memory at address 0x638000)>__last = 0__pivot = 0__comp = {_M_comp = 0x400aa6 <compare(int, int)>}(gdb) p *--__first$6 = (int &) @0x637ffc: 0(gdb) 

然后呢?怎么办?配合源码看看呗。

四. 源码分析

结合源码可知:__first越界了,而--__first刚好没有越界,值为0,所以,内存越界导致了core dump,进入__unguarded_partition的逻辑,其代码为:

/// This is a helper function...    template<typename _RandomAccessIterator, typename _Tp, typename _Compare>      _RandomAccessIterator      __unguarded_partition(_RandomAccessIterator __first,                _RandomAccessIterator __last,                _Tp __pivot, _Compare __comp)      {        while (true)      {        while (__comp(*__first, __pivot))          ++__first;        --__last;        while (__comp(__pivot, *__last))          --__last;        if (!(__first < __last))          return __first;        std::iter_swap(__first, __last);        ++__first;      }   }  

假设待排序的所有元素值都是0,而我们自定义的compare对于相等值返回true, 故如下代码会一直运行​​​​​​​

  while (__comp(*__first, __pivot))          ++__first;

直到__first越界,而*__first就形成了越界访问。回顾一下,gdb调试的时候,发现*--__first的值为0,可见__first刚刚越界一点点,刚好出轨。

上面的compare只有在特定情况下才会越界,所以体现出概率性core dump.  如果所有的值不等,则不一定会core dump. 

另外,如果把上面的程序的17改了16,程序就不会core dump了,为什么呢?且看源码程序:​​​​​​​

template<typename _RandomAccessIterator>      void __final_insertion_sort(_RandomAccessIterator __first,                     _RandomAccessIterator __last) {          if (__last - __first > _S_threshold)          {              __insertion_sort(__first, __first + _S_threshold);              __unguarded_insertion_sort(__first + _S_threshold, __last);          }          else __insertion_sort(__first, __last);      }

而_S_threshold的值正是16:

enum { _S_threshold = 16 };

五. 修复验证

根据上述实战调试和代码分析,我们得出一个重要结论:compare必须针对相等返回false. 

显然,在上面的compare函数中,当a和b相等时,compare返回了true, 这是有问题的啊。

来看看修复后的程序,如下:​​​​​​​

#include <iostream>#include <vector>#include <algorithm>using namespace std; bool compare(int a, int b){    return a > b;} int main(int argc, char *argv[]){    vector<int> vec;     for (int i = 0; i < 17; i++)    {        int x = 0;        vec.push_back(x);    }     sort(vec.begin(), vec.end(), compare);     return 0;}

编译运行,一切OK了,如下:​​​​​​​

ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ g++ -g test.cppubuntu@VM-0-15-ubuntu:~/taoge/cpp$ ./a.out ubuntu@VM-0-15-ubuntu:~/taoge/cpp$ 

六. 最后的话

查杀bug, 是实际开发工作中重要的能力,它对思路和经验有很高要求。后续,我会介绍core dump调试的n种方法。

周六了,打工人最开心的日子,就不多说啦。希望大家开开心心,一切顺利,也希望本文对大家有帮助。咱们下次见。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值