刚来到新部门没有几天,subsystem代码还没有怎么下载,就接到coredump问题。
部分非必要的,可能涉嫌透露产品的信息作了处理
【问题报告】
接到报告,xyzd 进程出现crash,经tester初步查看,和本组负责的组件相关。
【日志搜集】
连上环境,简单粗暴
journalctl >xyzd_crash.log
[robot@xx-0:/var/lib/systemd/coredump]
$ ls
core.xyzd.760.23726ab5789e4c08a46dfaa540e075b8.29810.1597131685000000000000.lz4
解开后是上述调用栈
(gdb) bt
#0 0x00007ffff7f9139b in avahi_entry_group_free () from /lib64/libavahi-core.so.7
#1 0x000055555555cdf2 in xxoo::AvahiUserdata::~AvahiUserdata (this=0x7ffff645eb40, __in_chrg=<optimized out>)
at common/avahi/AvahiManager.cpp:470
#2 0x000055555555d3af in xxoo::AvahiManager::ThreadFunc (this=0x55555558b3a0, avahi_mode=xxoo::AvahiMode::PUBLISH)
at /usr/include/bits/syslog.h:31
#3 0x00007ffff79ac354 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#4 0x00007ffff7ad74df in start_thread () from /lib64/libpthread.so.0
#5 0x00007ffff77f36a3 in clone () from /lib64/libc.so.6
【分析过程】
阶段一:
Master上找到的代码,行数对不齐
#0 0x00007ffff7f9139b in avahi_entry_group_free () from /lib64/libavahi-core.so.7
#1 0x000055555555cdf2 in xyz::AvahiUserdata::~AvahiUserdata (this=0x7ffff645eb40, __in_chrg=<optimized out>)
at common/avahi/AvahiManager.cpp:470
#2 0x000055555555d3af in xyz::AvahiManager::ThreadFunc (this=0x55555558b3a0, avahi_mode=xyz::AvahiMode::PUBLISH)
从官网下载代码,对应group结构如下
struct AvahiEntryGroup {
char *path;
AvahiEntryGroupState state;
int state_valid;
AvahiClient *client;
AvahiEntryGroupCallback callback;
void *userdata;
AVAHI_LLIST_FIELDS(AvahiEntryGroup, groups);
};
对应的
(gdb) p this->group_
$1 = (AvahiEntryGroup *) 0x7fffe80045d0
(gdb) x/100xw 0x7fffe80045d0
0x7fffe80045d0: 0xe8001280 0x00007fff 0x7fffe8001280 --path
0x00000002 --state,对应 AVAHI_ENTRY_GROUP_ESTABLISHED
0x00000001 --state_valid,表示可用
0x7fffe80045e0: 0xe8000c60 0x00007fff 0x7fffe800c60 --client
0x5555d130 0x00005555 0x55555555d130 --callback
0x7fffe80045f0: 0xf645eb40 0x00007fff 0x7fffef645eb40 --userdata
0x00000000 0x00000000 --next --prev 都是空的,说明是单个的节点
Path 具体内容如下:/Client13/Entryroup1
47 '/' 67 'C' 108 'l' 105 'i' 101 'e' 110 'n' 116 't' 49 '1'
0x7fffe8001288: 51 '3' 47 '/' 69 'E' 110 'n' 116 't' 114 'r' 121 'y' 71 'G'
0x7fffe8001290: 114 'r' 111 'o' 117 'u' 112 'p' 49 '1' 0 '\000'
Client
struct AvahiClient {
const AvahiPoll *poll_api;
DBusConnection *bus;
int error;
AvahiClientState state;
AvahiClientFlags flags;
/* Cache for some seldom changing server data */
char *version_string, *host_name, *host_name_fqdn, *domain_name;
uint32_t local_service_cookie;
int local_service_cookie_valid;
AvahiClientCallback callback;
void *userdata;
AVAHI_LLIST_HEAD(AvahiEntryGroup, groups);
AVAHI_LLIST_HEAD(AvahiDomainBrowser, domain_browsers);
AVAHI_LLIST_HEAD(AvahiServiceBrowser, service_browsers);
AVAHI_LLIST_HEAD(AvahiServiceTypeBrowser, service_type_browsers);
AVAHI_LLIST_HEAD(AvahiServiceResolver, service_resolvers);
AVAHI_LLIST_HEAD(AvahiHostNameResolver, host_name_resolvers);
AVAHI_LLIST_HEAD(AvahiAddressResolver, address_resolvers);
AVAHI_LLIST_HEAD(AvahiRecordBrowser, record_browsers);
};
(gdb) x/100xw 0x7fffe8000c60
0x7fffe8000c60: 0xe8000b60 0x00007fff poll_api
0x00000000 0x00000000 bus
0x7fffe8000c70: 0xffffffe9 0x00000064 error
0x00000002 state
0x00000000 flags
0x00000000 0x00000000 version_string
0x7fffe8000c80: host_name
0x00000000 0x00000000 host_name_fqdn
0x00000000 0x00000000 domain_name
0x00000000 local_service_cookie
0x7fffe8000c90: 0x00000000 0x00000000 local_service_cookie_valid
0x5555d920 0x00005555 callback
0x7fffe8000cb0: 0xf645eb40 0x00007fff userdata
0xe80045d0 0x00007fff groups
从结构上看,并没有看出问题,反汇编看看,调用栈是挂在 0x00007ffff7f9139b
int avahi_entry_group_free(AvahiEntryGroup *group) {
AvahiClient *client = group->client;
int r = AVAHI_OK;
assert(group);
if (group->path && avahi_client_is_connected(client))
r = entry_group_simple_method_call(group, "Free");
AVAHI_LLIST_REMOVE(AvahiEntryGroup, groups, client->groups, group);
avahi_free(group->path);
avahi_free(group);
return r;
}
(gdb) disas avahi_entry_group_free
Dump of assembler code for function avahi_entry_group_free:
0x00007ffff7f91380 <+0>: push %rbp
0x00007ffff7f91381 <+1>: push %rbx
0x00007ffff7f91382 <+2>: sub $0x8,%rsp
0x00007ffff7f91386 <+6>: test %rdi,%rdi
0x00007ffff7f91389 <+9>: je 0x7ffff7f91464 <avahi_entry_group_free+228>
0x00007ffff7f9138f <+15>: mov %rsi,%rbp
0x00007ffff7f91392 <+18>: test %rsi,%rsi
0x00007ffff7f91395 <+21>: je 0x7ffff7f91445 <avahi_entry_group_free+197>
=> 0x00007ffff7f9139b <+27>: mov 0x60(%rsi),%rsi
分析到这里,其实快要接近答案。
有两个线索一开始没有注意到
线索1 #0 0x00007ffff7f9139b in avahi_entry_group_free () from /lib64/libavahi-core.so.7
这里注意是挂在libavahi-core.so当中,之所以错过这个线索,在于不了解avahi的背景,实际上
线索2 disas avahi_entry_group_free 应该继续往下找出完整版本,没有kownledge就没有sense,当天继续错过答案。
阶段2
最终再次check 源代码,发现avahi_entry_group_free 有两份实现,libavahi-core/libavahi-client当中各有一份,从具体代码看,判定应该是libavahi-client才对。
反汇编方式来证实
反汇编
(gdb) info r
rax 0x0 0
rbx 0x7ffff645eb40 140737325165376
rcx 0x7fffe80008d0 140737085704400
rdx 0x7fffe8001910 140737085708560
rsi 0x7 7 这个参数明显不合法,导致0x00007ffff7f9139b 指令访问到了保护范围,比如中断向量表,触发段错误
rdi 0x7fffe8004600 140737085720064
rbp 0x7 0x7
rsp 0x7ffff645eaf0 0x7ffff645eaf0
r8 0x7fffe80008d8 140737085704408
r9 0x6 6
r10 0x4000 16384
r11 0x0 0
r12 0x55555558b3a0 93824992457632
r13 0x0 0
r14 0x7ffff645eb78 140737325165432
r15 0x7ffff645eb2c 140737325165356
rip 0x7ffff7f9139b 0x7ffff7f9139b <avahi_entry_group_free+27>
eflags 0x10202 [ IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
通过反汇编看,完整指令如下
(gdb) disas avahi_entry_group_free
Dump of assembler code for function avahi_entry_group_free:
0x00007ffff7f91380 <+0>: push %rbp
0x00007ffff7f91381 <+1>: push %rbx
0x00007ffff7f91382 <+2>: sub $0x8,%rsp
0x00007ffff7f91386 <+6>: test %rdi,%rdi
0x00007ffff7f91389 <+9>: je 0x7ffff7f91464 <avahi_entry_group_free+228>
0x00007ffff7f9138f <+15>: mov %rsi,%rbp
0x00007ffff7f91392 <+18>: test %rsi,%rsi
0x00007ffff7f91395 <+21>: je 0x7ffff7f91445 <avahi_entry_group_free+197>
=> 0x00007ffff7f9139b <+27>: mov 0x60(%rsi),%rsi
0x00007ffff7f9139f <+31>: mov %rdi,%rbx
0x00007ffff7f913a2 <+34>: test %rsi,%rsi
0x00007ffff7f913a5 <+37>: je 0x7ffff7f913c1 <avahi_entry_group_free+65>
0x00007ffff7f913a7 <+39>: nopw 0x0(%rax,%rax,1)
0x00007ffff7f913b0 <+48>: mov %rbx,%rdi
0x00007ffff7f913b3 <+51>: callq 0x7ffff7f88c70 <avahi_entry_free@plt>
0x00007ffff7f913b8 <+56>: mov 0x60(%rbp),%rsi
0x00007ffff7f913bc <+60>: test %rsi,%rsi
0x00007ffff7f913bf <+63>: jne 0x7ffff7f913b0 <avahi_entry_group_free+48>
0x00007ffff7f913c1 <+65>: mov 0x38(%rbp),%rdi
0x00007ffff7f913c5 <+69>: test %rdi,%rdi
0x00007ffff7f913c8 <+72>: je 0x7ffff7f913cf <avahi_entry_group_free+79>
0x00007ffff7f913ca <+74>: callq 0x7ffff7f89170 <avahi_time_event_free@plt>//半桶水的知识面上看,关键点在这里
0x00007ffff7f913cf <+79>: mov 0x50(%rbp),%rdx
0x00007ffff7f913d3 <+83>: mov 0x58(%rbp),%rax
0x00007ffff7f913d7 <+87>: test %rdx,%rdx
0x00007ffff7f913da <+90>: je 0x7ffff7f913e4 <avahi_entry_group_free+100>
0x00007ffff7f913dc <+92>: mov %rax,0x58(%rdx)
0x00007ffff7f913e0 <+96>: mov 0x58(%rbp),%rax
0x00007ffff7f913e4 <+100>: test %rax,%rax
0x00007ffff7f913e7 <+103>: je 0x7ffff7f91410 <avahi_entry_group_free+144>
0x00007ffff7f913e9 <+105>: mov 0x50(%rbp),%rdx
0x00007ffff7f913ed <+109>: mov %rdx,0x50(%rax)
0x00007ffff7f913f1 <+113>: movq $0x0,0x58(%rbp)
0x00007ffff7f913f9 <+121>: mov %rbp,%rdi
0x00007ffff7f913fc <+124>: movq $0x0,0x50(%rbp)
0x00007ffff7f91404 <+132>: add $0x8,%rsp
0x00007ffff7f91408 <+136>: pop %rbx
0x00007ffff7f91409 <+137>: pop %rbp
0x00007ffff7f9140a <+138>: jmpq 0x7ffff7f893f0 <avahi_free@plt>
0x00007ffff7f9140f <+143>: nop
0x00007ffff7f91410 <+144>: cmp 0xf8(%rbx),%rbp
0x00007ffff7f91417 <+151>: jne 0x7ffff7f91426 <avahi_entry_group_free+166>
0x00007ffff7f91419 <+153>: mov 0x50(%rbp),%rax
--Type <RET> for more, q to quit, c to continue without paging--c
0x00007ffff7f9141d <+157>: mov %rax,0xf8(%rbx)
0x00007ffff7f91424 <+164>: jmp 0x7ffff7f913f1 <avahi_entry_group_free+113>
0x00007ffff7f91426 <+166>: lea 0x18c93(%rip),%rcx # 0x7ffff7faa0c0
0x00007ffff7f9142d <+173>: mov $0x6d,%edx
0x00007ffff7f91432 <+178>: lea 0x1862a(%rip),%rsi # 0x7ffff7fa9a63
0x00007ffff7f91439 <+185>: lea 0x16dde(%rip),%rdi # 0x7ffff7fa821e
0x00007ffff7f91440 <+192>: callq 0x7ffff7f88770 <__assert_fail@plt>
0x00007ffff7f91445 <+197>: lea 0x18c74(%rip),%rcx # 0x7ffff7faa0c0
0x00007ffff7f9144c <+204>: mov $0x65,%edx
0x00007ffff7f91451 <+209>: lea 0x1860b(%rip),%rsi # 0x7ffff7fa9a63
0x00007ffff7f91458 <+216>: lea 0x19d3a(%rip),%rdi # 0x7ffff7fab199
0x00007ffff7f9145f <+223>: callq 0x7ffff7f88770 <__assert_fail@plt>
0x00007ffff7f91464 <+228>: lea 0x18c55(%rip),%rcx # 0x7ffff7faa0c0
0x00007ffff7f9146b <+235>: mov $0x64,%edx
0x00007ffff7f91470 <+240>: lea 0x185ec(%rip),%rsi # 0x7ffff7fa9a63
0x00007ffff7f91477 <+247>: lea 0x1b11c(%rip),%rdi # 0x7ffff7fac59a
0x00007ffff7f9147e <+254>: callq 0x7ffff7f88770 <__assert_fail@plt>
我们自身的代码调用如下
预期中的函数调用是如下(符号在/usr/lib64/libavahi-client.so中)
int avahi_entry_group_free(AvahiEntryGroup *group) {
AvahiClient *client = group->client;
int r = AVAHI_OK;
assert(group);
if (group->path && avahi_client_is_connected(client))
r = entry_group_simple_method_call(group, "Free");
AVAHI_LLIST_REMOVE(AvahiEntryGroup, groups, client->groups, group);
avahi_free(group->path);
avahi_free(group);
return r;
}
但是实际调用的是这个(符号在 /usr/lib64/libavahi-core.so中)
void avahi_entry_group_free(AvahiServer *s, AvahiSEntryGroup *g) {
assert(s);
assert(g);
while (g->entries)
avahi_entry_free(s, g->entries);
if (g->register_time_event)
avahi_time_event_free(g->register_time_event);//就是通过这个符号调用才明确是调用了该函数
AVAHI_LLIST_REMOVE(AvahiSEntryGroup, groups, s->groups, g);
avahi_free(g);
}
倒推回来,该函数接收到的两个参数都不是预期中的,只要走到该流程,必然出问题。
阶段三
请教 组内大拿 才知道
configure.ac 里面,有依赖关系配置,其中需要check的avahi库如下
PKG_CHECK_MODULES([LIBAVAHI], [avahi-core avahi-client])
avahi-core 其实并不需要,上面的头文件目录可以佐证,
avahi 编译出来的大概率是c库,而不是c++库,然后当前subsystem最终在链接的时候找到的是core的库函数,导致出问题(记得以前《C专家编程》一书中有过interposition描述),抛开先前看过的全局变量不小心在不同组件中有相同的导致串用,对于串用同名不同参的函数,职业生涯里来算是首次碰到。
额外一些背景知识
PKG_CONFIG_PATH=':/opt/xyz/lib64/pkgconfig'
在这个下面查找对应的avahi-client.pc
[i9527@897a602fb1f3:/opt/xyz/lib64/pkgconfig]
$ cat avahi-client.pc
prefix=/usr
exec_prefix=${prefix}
libdir=/usr/lib64
includedir=${prefix}/include
Name: avahi-client
Description: Avahi Multicast DNS Responder (Client Support)
Version: 0.7
Libs: -L${libdir} -lavahi-common -lavahi-client
Cflags: -D_REENTRANT -I${includedir}
上面两行高亮就是所谓
-L(lib 库) 包含/usr/lib64 目录下avahi-common avahi-client 两个库