最早的时候是在程序初始化过程中开启了一个timer(timer_create
),这个timer第一次触发的时间较短时就会引起程序core掉,core的位置也是不定的。使用valgrind可以发现有错误的内存写入:
==31676== Invalid write of size 8==31676== at 0x37A540F852: _dl_allocate_tls_init (in /lib64/ld-2.5.so)==31676== by 0x4E26BD3: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)==31676== by 0x76E0B00: timer_helper_thread (in /lib64/librt-2.5.so)==31676== by 0x4E2673C: start_thread (in /lib64/libpthread-2.5.so)==31676== by 0x58974BC: clone (in /lib64/libc-2.5.so)==31676== Address 0xf84dbd0 is 0 bytes after a block of size 336 alloc'd==31676== at 0x4A05430: calloc (vg_replace_malloc.c:418)==31676== by 0x37A5410082: _dl_allocate_tls (in /lib64/ld-2.5.so)==31676== by 0x4E26EB8: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)==31676== by 0x76E0B00: timer_helper_thread (in /lib64/librt-2.5.so)==31676== by 0x4E2673C: start_thread (in /lib64/libpthread-2.5.so)==31676== by 0x58974BC: clone (in /lib64/libc-2.5.so)
google _dl_allocate_tls_init
相关发现一个glibc的bug Bug 13862 和我的情况有点类似。本文就此bug及tls相关实现做一定阐述。
需要查看glibc的源码,如何确认使用的glibc的版本,可以这样:
$ /lib/libc.so.6GNU C Library stable release version 2.5, by Roland McGrath et al....
为了方便,还可以直接在(glibc Cross Reference)[http://osxr.org/glibc/source/?v=glibc-2.17]网页上进行查看,版本不同,但影响不大。
BUG描述
要重现13862 BUG作者提到要满足以下条件:
The use of a relatively large number of dynamic libraries, loaded at runtime using dlopen.
The use of thread-local-storage within those libraries.
A thread exiting prior to the number of loaded libraries increasing a significant amount, followed by a new thread being created after the number of libraries has increased.
简单来说,就是在加载一大堆包含TLS变量的动态库的过程中,开启了一个线程,这个线程退出后又开启了另一个线程。
这和我们的问题场景很相似。不同的是我们使用的是timer,但timer在触发时也是开启新的线程,并且这个线程会立刻退出:
/nptl/sysdeps/unix/sysv/linux/timer_routines.c
timer_helper_thread(...) // 用于检测定时器触发的辅助线程{ ...