现象
线上的服务出现coredump,堆栈为:
#0 0x000000000045d145 in GetStackTrace(void**, int, int) ()#1 0x000000000045ec22 in tcmalloc::PageHeap::GrowHeap(unsigned long) ()#2 0x000000000045eeb3 in tcmalloc::PageHeap::New(unsigned long) ()#3 0x0000000000459ee8 in tcmalloc::CentralFreeList::Populate() ()#4 0x000000000045a088 in tcmalloc::CentralFreeList::FetchFromSpansSafe() ()#5 0x000000000045a10a in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) ()#6 0x000000000045c282 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) ()#7 0x0000000000470766 in tc_malloc ()#8 0x00007f75532cd4c2 in __conhash_get_rbnode (node=0x22c86870, hash=30) at build/release64/cm_sub/conhash/conhash_inter.c:88#9 0x00007f75532cd76e in __conhash_add_replicas (conhash=0x24fbc7e0, iden=<value optimized out>) at build/release64/cm_sub/conhash/conhash_inter.c:45#10 0x00007f75532cd1fa in conhash_add_node (conhash=0x24fbc7e0, iden=0) at build/release64/cm_sub/conhash/conhash.c:72#11 0x00007f75532c651b in cm_sub::TopoCluster::initLBPolicyInfo (this=0x2593a400) at build/release64/cm_sub/topo_cluster.cpp:114#12 0x00007f75532cad73 in cm_sub::TopoClusterManager::processClusterMapTable (this=0xa219e0, ref=0x267ea8c0) at build/release64/cm_sub/topo_cluster_manager.cpp:396#13 0x00007f75532c5a93 in cm_sub::SubRespMsgProcess::reinitCluster (this=0x9c2f00, msg=0x4e738ed0) at build/release64/cm_sub/sub_resp_msg_process.cpp:157...
查看了应用层相关数据结构,基本数据都是没有问题的。所以最初怀疑是tcmalloc内部维护了错误的内存,在分配内存时出错,这个堆栈只是问题的表象。几天后,线上的另一个服务,基于同样的库,也core了,堆栈还是一样的。
最初定位问题都是从最近更新的东西入手,包括依赖的server环境,但都没有明显的问题,所以最后只能从core的直接原因入手。
分析GetStackTrace
确认core的详细位置:
# core在该指令0x000000000045d145 <_Z13GetStackTracePPvii+21>: mov 0x8(%rax),%r9(gdb) p/x $rip # core 的指令位置$9 = 0x45d145(gdb) p/x $rax $10 = 0x4e73aa58(gdb) x/1a $rax+0x8 # rax + 8 = 0x4e73aa600x4e73aa60: 0x0
该指令尝试从[0x4e73aa60]处读取内容,然后出错,这个内存单元不可读。但是具体这个指令在代码中是什么意思,需要将这个指令对应到代码中。获取tcmalloc的源码,发现GetStackTrace
根据编译选项有很多实现,所以这里选择最可能的实现,然后对比汇编以确认代码是否匹配。最初选择的是stacktrace_x86-64-inl.h
,后来发现完全不匹配,又选择了stacktrace_x86-inl.h
。这个实现版本里也有对64位平台的支持。
stacktrace_x86-inl.h
里使用了一些宏来生成函数名和参数,精简后代码大概为:
int GET_STACK_TRACE_OR_FRAMES { void **sp; unsigned long rbp; __asm__ volatile ("mov %%rbp, %0" : "=r" (rbp)); sp = (void **) rbp; int n = 0; while (sp && n < max_depth) { if (*(sp+1) == reinterpret_cast<void *>(0)) { break; } void **next_sp = NextStackFrame<!IS_STACK_FRAMES, IS_WITH_CONTEXT>(sp, ucp); if (skip_count > 0) { skip_count--; } else { result[n] = *(sp+1); n++; } sp = next_sp; } return n; }
NextStackFrame
是一个模板函数,包含一大堆代码,精简后非常简单:
template<bool STRICT_UNWINDING, bool WITH_CONTEXT>