现象
线上的服务出现coredump,堆栈为:
#0 0x000000000045d145 in GetStackTrace(void**, int, int) ()
#1 0x000000000045ec22 in tcmalloc::PageHeap::GrowHeap(unsigned long) ()
#2 0x000000000045eeb3 in tcmalloc::PageHeap::New(unsigned long) ()
#3 0x0000000000459ee8 in tcmalloc::CentralFreeList::Populate() ()
#4 0x000000000045a088 in tcmalloc::CentralFreeList::FetchFromSpansSafe() ()
#5 0x000000000045a10a in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) ()
#6 0x000000000045c282 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) ()
#7 0x0000000000470766 in tc_malloc ()
#8 0x00007f75532cd4c2 in __conhash_get_rbnode (node=0x22c86870, hash=30)
at build/release64/cm_sub/conhash/conhash_inter.c:88
#9 0x00007f75532cd76e in __conhash_add_replicas (conhash=0x24fbc7e0, iden=<value optimized out>)
at build/release64/cm_sub/conhash/conhash_inter.c:45
#10 0x00007f75532cd1fa in conhash_add_node (conhash=0x24fbc7e0, iden=0) at build/release64/cm_sub/conhash/conhash.c:72
#11 0x00007f75532c651b in cm_sub::TopoCluster::initLBPolicyInfo (this=0x2593a400)
at build/release64/cm_sub/topo_cluster.cpp:114
#12 0x00007f75532cad73 in cm_sub::TopoClusterManager::processClusterMapTable (this=0xa219e0, ref=0x267ea8c0)
at build/release64/cm_sub/topo_cluster_manager.cpp:396
#13 0x00007f75532c5a93 in cm_sub::SubRespMsgProcess::reinitCluster (this=0x9c2f00, msg=0x4e738ed0)
at build/release64/cm_sub/sub_resp_msg_process.cpp:157
...
查看了应用层相关数据结构,基本数据都是没有问题的。所以最初怀疑是tcmalloc内部维护了错误的内存,在分配内存时出错,这个堆栈只是问题的表象。几天后,线上的另一个服务,基于同样的库,也core了,堆栈还是一样的。
最初定位问题都是从最近更新的东西入手,包括依赖的server环境,但都没有明显的问题,所以最后只能从core的直接原因入手。
分析GetStackTrace
确认core的详细位置:
# core在该指令
0x000000000045d145 <_Z13GetStackTracePPvii+21>: mov 0x8(%rax),%r9
(gdb) p/x $rip # core 的指令位置
$9 = 0x45d145
(gdb) p/x $rax
$10 = 0x4e73aa58
(gdb) x/1a $rax+0x8 # rax + 8 = 0x4e73aa60
0x4e73aa60: 0x0
该指令尝试从[0x4e73aa60]处读取内容,然后出错,这个内存单元不可读。但是具体这个指令在代码中是什么意思,需要将这个指令对应到代码中。获取tcmalloc的源码,发现GetStackTrace
根据编译选项有很多实现,所以这里选择最可能的实现,然后对比汇编以确认代码是否匹配。最初选择的是stacktrace_x86-64-inl.h
,后来发现完全不匹配,又选择了stacktrace_x86-inl.h
。这个实现版本里也有对64位平台的支持。
stacktrace_x86-inl.h
里使用了一些宏来生成函数名和参数,精简后代码大概为:
int GET_STACK_TRACE_OR_FRAMES {
void **sp;
unsigned long rbp;
__asm__ volatile ("mov %%rbp, %0" : "=r" (rbp));
sp = (void **) rbp;
int n = 0;
while (sp && n < max_depth) {
if (*(sp+1) == reinterpret_cast<void *>(0)) {
break;
}
void **next_sp = NextStackFrame<!IS_STACK_FRAMES, IS_WITH_CONTEXT>(sp, ucp);
if (skip_count > 0) {
skip_count--;
} else {
result[n] = *(sp+1);
n++;
}
sp = next_sp;
}
return n;
}
NextStackFrame
是一个模板函数,包含一大堆代码,精简后非常简单:
template<bool STRICT_UNWINDING, bool WITH_CONTEXT>
static void **NextStackFrame(void **old_sp, const void *uc) {
void **new_sp = (void **) *old_sp;
if (STRICT_UNWINDING) {
if (new_sp <= old_sp) return NULL;
if ((uintptr_t)new_sp - (uintptr_t)old_sp > 100000) return NULL;
} else {