深入理解TLS(Thread Local Storage)——以MacOS为例

easylyou

已于 2023-08-06 13:42:05 修改

阅读量557

点赞数 2

文章标签： macos

于 2023-08-06 00:24:46 首次发布

本文链接：https://blog.csdn.net/easy_level1/article/details/131999164

版权

前言

中文互联网上聊TLS原理的文章比较少，不仅不大详细，而且不少是七八年前的文章，比如有在window xp上介绍tls的实现的，比较老旧。
笔者因工作内容，需要较深入的理解TLS，故在此分享一下。
本文会以Mac上，arm64的程序为例，介绍clang、dyld与macOS是怎么配合，实现的TLS。
注意不同的平台，windows、linux、appleOS实现的方式都不大相同，在不同架构x64、arm上的处理细节也略有不同，但是原理相似。不了解TLS的读者也会有所收获。

什么是TLS

TLS，Thread Local Storage，是一个与线程存储相关的技术，能够让每一个线程拥有互不相干扰的一块私有内存。下面给出一个简单的例子：

#include <unistd.h>
#include <time.h>
#include <stdio.h>
long inner_global_var[2] = {0, 0}; // 全局变量
void record_start() { time(&inner_global_var[0]); }
void record_end() { time(&inner_global_var[1]); }
long record_seconds() { return inner_global_var[1]-inner_global_var[0]; }

int main() {
	record_start();
	for(int i=0; i<3; i++) {
		// do work ...
		sleep(1);
	}
	record_end();
	printf("Done! time used: %lds\n", record_seconds());
}

在上面的例子中record_start(), record_end(), record_seconds()是用来记录执行时间的函数，用法也是一目了然。正常情况下，它能够正常工作。然而在多线程中，它就不能正常工作了，如下代码：

#include <unistd.h>
#include <time.h>
#include <stdio.h>
#include <pthread.h>
long inner_global_var[2] = {0, 0}; // 全局变量
void record_start() { time(&inner_global_var[0]); }
void record_end() { time(&inner_global_var[1]); }
long record_seconds() { return inner_global_var[1] - inner_global_var[0];}

void Thread1() {
	record_start();
	for(int i=0; i<3; i++) {
		// do work ...
		sleep(1);
	}

	record_end();
	printf("Thread1 Done! time used: %lds\n", record_seconds());
}

void Thread2() {
	record_start();
	for(int i=0; i<2; i++) {
		// do work ...
		sleep(2);
	}
	record_end();
	printf("Thread2 Done! time used: %lds\n", record_seconds());
}

int main()
{
    pthread_t pid1, pid2;
    pthread_create(&pid1, NULL, (void*)Thread1, NULL);
    pthread_create(&pid2, NULL, (void*)Thread2, NULL);

    pthread_join(pid1, NULL);
    pthread_join(pid2, NULL);

    return 0;
}

很明显，record_xxx()并不是线程安全的，它会覆盖上一个调用的结果。在多线程中，后执行的线程会覆盖前一个先执行的线程的记录数据。
当然你可能会说，record_xxx()这样的设计完全就是错误的，它就不应该放在多线程环境下工作。
所以说，record_xxx()需要重新设计吗？不需要！使用TLS技术就能简单的解决，如下代码:

#include <unistd.h>
#include <time.h>
#include <stdio.h>
#include <pthread.h>
__thread long inner_global_var[2] = {0, 0}; // 全局变量
void record_start() { time(&inner_global_var[0]); }
void record_end() { time(&inner_global_var[1]); }
long record_seconds() { return inner_global_var[1] - inner_global_var[0];}

void Thread1() {
	record_start();
	for(int i=0; i<3; i++) {
		// do work ...
		sleep(1);
	}

	record_end();
	printf("Thread1 Done! time used: %lds\n", record_seconds());
}

void Thread2() {
	record_start();
	for(int i=0; i<2; i++) {
		// do work ...
		sleep(2);
	}
	record_end();
	printf("Thread2 Done! time used: %lds\n", record_seconds());
}

int main()
{
    pthread_t pid1, pid2;
    pthread_create(&pid1, NULL, (void*)Thread1, NULL);
    pthread_create(&pid2, NULL, (void*)Thread2, NULL);

    pthread_join(pid1, NULL);
    pthread_join(pid2, NULL);

    return 0;
}

修成后的代码只修改了一个地方，那就是给变量inner_global_var添加了__thread关键词修饰。添加后，这个变量在每一个线程中有不同的内存地址，他们之间互不相干扰。故，这样就解决了多线程下的问题。

快速回答一下读者可能想问的问题：
Q: 上面代码中，tls变量inner_global_var创建了几份？
A: 3个。1个主线程、2个pthread线程

Q: TLS能够感知到新线程的出现和线程的退出吗？
A: 是的。当一个新线程创建，系统为TLS修饰的变量开辟一块空间，大小正是变量所需的大小。当线程退出时，开辟的空间会正确的释放。(这样的回答大体上肯定是没问题的，具体与实现有关)

Q: TLS修饰的变量，存放在内存的哪里？
A: 与实现有关。在macOS上，存放在堆上，与malloc()出的内存毗邻。在linux上，存放在内存高位，与mmap()出的内存毗邻。

Q: TLS是如何实现的？
A: 在macOS上，需要编译器clang，连接器dyld，操作系统macOS三者配合。详情见下文。

TLS实现：编译器的工作 (macOS平台)

因TLS变量的出现，编译器主要做出了两点区别，与TLS变量相关的汇编不同；生成的二进制header部分会有不同。

从生成的汇编观察

有如下实例代码。先看正常的返回普通全局变量
（在mac上使用otool查看汇编，otool功能类似与linux上的objdump）

观察无tls变量程序

long gvar = 0; // 全局变量
long func() { 
	return gvar;
}

% otool -tvj a.out
a.out:
(__TEXT,__text) section
_func:
0000000100003f90	b0000008	adrp	x8, 1 ; 0x100004000
0000000100003f94	f9400100	ldr	x0, [x8]
0000000100003f98	d65f03c0	ret

arm64汇编解释:
adrp获得gvar变量的地址写到x8寄存器，该地址为0x100004000；接着ldr读取8字节该地址8个字节写到x0寄存器。x0寄存器为函数返回寄存器，类似x64的rax。

观察有tls变量程序

__thread long gvar = 0; // 全局变量
long func() { 
	return gvar;
}

% otool -tvj a.out
a.out:
(__TEXT,__text) section
_func:
0000000100003f78	a9bf7bfd	stp	x29, x30, [sp, #-0x10]!
0000000100003f7c	910003fd	mov	x29, sp
0000000100003f80	b0000000	adrp	x0, 1 ; 0x100004000
0000000100003f84	91000000	add	x0, x0, #0x0
0000000100003f88	f9400008	ldr	x8, [x0]
0000000100003f8c	d63f0100	blr	x8
0000000100003f90	f9400000	ldr	x0, [x0]
0000000100003f94	a8c17bfd	ldp	x29, x30, [sp], #0x10
0000000100003f98	d65f03c0	ret

arm64汇编解释:
确实由__thread修饰后，事情变复杂了不少。
地址 0000000100003f78 0000000100003f7c，是典型函数开头的压栈操作，等价与x64的push ebp mov ebp,　esp
地址 0000000100003f94 是典型的函数结尾，等价与x64的pop ebp

关键的不同在于，同样的使用ldr读取地址 0x100004000 8个字节后，接着blr寄存器跳转，等价与x64的call rax。并且返回值为所需要的变量所在地址，在进行一次ldr解引用等到gvar变量的值。

结论：
__thread 修饰的变量，其地址存放的是一个函数地址。想到得到该变量的地址，需要执行这个函数。函数返回值就是目标变量所在的地址，接着进行解引用，即可得到该变量的值。

从编译出的二进制header部分观察

使用MachOView，观察编译后的可执行文件machO的不同。（machO是AppleOS上的可执行文件格式，在linux上是elf，windows上是exe）。

观察无tls变量程序

long gvar = 1; // 全局变量
long func() { 
	return gvar;
}
int main()
{
    return 0;
}

Alt
正如上图MachOView显示，可读可写数据段__DATA中只有一个__data节。因为只有一个long类型的变量，所以该节大小8个字节。这是符合预期的。

观察有tls变量程序

__thread long gvar = 1; // 全局变量
long func() { 
	return gvar;
}
int main()
{
    return 0;
}

在这里插入图片描述
注意，当存在tls变量时，__DATA段中会出现__thread_前缀的节。__thread_vars是必存在的，后文会介绍。__thread_data有点类似__data，是经由tls修饰的本应在__data区域的变量，如上图，__thread_data大小正是8个字节。除此之外，还可能存在__thread_bss节，这个节对应__bss节，因为上面代码没有初始值为0的全局变量，所以该例子中不存在__bss节。

__thread_vars中存放着tls变量的信息，每一个tls变量需要24个字节存放，对应的结构为struct TLV_Thunk，如下。thunk就是上文说的那个函数地址，通过该函数可以拿到tls变量的地址。该函数实际上是dyld中的函数tlv_get_addr，该函数由汇编编写；key为pthread_key_init_free生成的key，后文再说；offset为该变量在tls内存中的偏移（这实际上意味着，tls变量都是放在一块连续的内存之中，变量的实际地址由基地址+offset指定）。

// dyld-dyld-1042.1/common/MachOAnalyzer.h:384

    // the compiler statically allocated one of these thunks for each thread_local variable
    // the compiler codegens access to a thread_local by calling the thunk to get the address of the variable for the current thread
    struct TLV_Thunk
    {
        TLV_Resolver thunk;
        size_t	  key;
        size_t	  offset;
    };

在这里插入图片描述

TLS实现：操作系统的工作（xnu kernel）

在介绍内核工作之前，先引申一下需求，看一看下面四个函数。

posix标准：thread specific data

先介绍一下posix标准下的一套api

int   pthread_key_create(pthread_key_t *key, void (*destructor)(void *));
void* pthread_getspecific(pthread_key_t key);
int   pthread_setspecific(pthread_key_t key, const void *value);
int   pthread_key_delete(pthread_key_t key);

关于这四个函数，详细用法请看manpage，也可以搜索相关文章，资料挺多。这里简单介绍：
通过pthread_key_create创建一个key，通过该key可以设置(pthread_setspecific)一块内存，也可以通过该key(pthread_getspecific)获得之前设置的内存。
注意，这个用法有点像hashmap，但是注意key的意义完全不同。不同的线程下，即使key是相同的，get与set的specific内存区域也是不同的。没错，这就是通过libc的api来动态添加tls变量。

所以一个很基本的问题，pthread_getspecific如何感知当前是哪一个线程呢。这个问题和pthread_self()的原理是一样的：
内核在调度时，切换激活线程的时候会在一个“约定”好的地方设置好“id”。这个约定的位置，不同的架构不同，不同的平台也不同。

xnu kenrnel on arm64

内核设置id的相关代码如下：关键代码 msr TPIDRRO_EL0, $1

// xnu/osfmk/arm64/cswitch.s:127
/*
 * set_thread_registers
 *
 * Updates thread registers during context switch
 *  arg0 - New thread pointer
 *  arg1 - Scratch register
 *  arg2 - Scratch register
 */
.macro	set_thread_registers
	msr		TPIDR_EL1, $0						// Write new thread pointer to TPIDR_EL1
	ldr		$1, [$0, ACT_CPUDATAP]
	str		$0, [$1, CPU_ACTIVE_THREAD]

	ldrsh	$2, [$1, CPU_NUMBER_GS]
	msr		TPIDR_EL0, $2

	ldr		$1, [$0, TH_CTH_SELF]				// Get cthread pointer
	msr		TPIDRRO_EL0, $1
.endmacro

dyld获取id的相关代码如下：

	mrs		x17, TPIDRRO_EL0

xnu kernel on x86_64

内核设置id的相关代码如下：关键代码 movq %rdx,%gs:CPU_ACTIVE_THREAD

// xnu/osfmk/x86_64/cswitch.s
/*
 * thread_t Switch_context(
 *		thread_t old,				// %rdi
 *		thread_continue_t continuation,		// %rsi
 *		thread_t new)				// %rdx
 *
 * returns 'old' thread in %rax
 */
Entry(Switch_context)
	popq	%rax				/* pop return PC */

	/* Test for a continuation and skip all state saving if so... */
	cmpq	$0, %rsi
	jne 	5f
	movq	%gs:CPU_KERNEL_STACK,%rcx	/* get old kernel stack top */
	
	/* save registers */
	movq	%rbp,KSS_RBP(%rcx)
	movq	%r12,KSS_R12(%rcx)
	movq	%r13,KSS_R13(%rcx)
	movq	%r14,KSS_R14(%rcx)
	movq	%r15,KSS_R15(%rcx)
	movq	%rax,KSS_RIP(%rcx)		/* save return PC */
	movq	%rsp,KSS_RSP(%rcx)		/* save SP */
5:
	movq	%rdi,%rax			/* return old thread */
	/* new thread in %rdx */
	movq    %rdx,%gs:CPU_ACTIVE_THREAD      /* new thread is active */
	movq	TH_KERNEL_STACK(%rdx),%rdx	/* get its kernel stack */
	lea	-IKS_SIZE(%rdx),%rcx
	add	EXT(kernel_stack_size)(%rip),%rcx /* point to stack top */

	movq	%rdx,%gs:CPU_ACTIVE_STACK	/* set current stack */
	movq	%rcx,%gs:CPU_KERNEL_STACK	/* set stack top */

	movq	KSS_RSP(%rcx),%rsp		/* switch stacks */
	movq	KSS_RBX(%rcx),%rbx		/* restore registers */
	movq	KSS_RBP(%rcx),%rbp
	movq	KSS_R12(%rcx),%r12
	movq	KSS_R13(%rcx),%r13
	movq	KSS_R14(%rcx),%r14
	movq	KSS_R15(%rcx),%r15
	jmp	*KSS_RIP(%rcx)			/* return old thread in %rax */

dyld获取id的相关代码如下：

	movq	%gs:0x0(,%rax,8),%rax	// get thread value

linux kernel on x86_64

内核设置id的相关代码如下：关键代码 save_fsgs(prev_p); x86_fsgsbase_load()

// linux/arch/x86/kernel/process_64.c
__no_kmsan_checks
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
	struct thread_struct *prev = &prev_p->thread;
	struct thread_struct *next = &next_p->thread;
	struct fpu *prev_fpu = &prev->fpu;
	int cpu = smp_processor_id();
	/* We must save %fs and %gs before load_TLS() because
	 * %fs and %gs may be cleared by load_TLS().
	 *
	 * (e.g. xen_load_tls())
	 */
	save_fsgs(prev_p);
	/*
	 * Load TLS before restoring any segments so that segment loads
	 * reference the correct GDT entries.
	 */
	load_TLS(next, cpu);

	arch_end_context_switch(next_p);
	
	x86_fsgsbase_load(prev, next);
	...
}

glibc获取id的相关代码如下：

#  define THREAD_SELF \
  ({ struct pthread *__self;						      \
     asm ("mov %%fs:%c1,%0" : "=r" (__self)				      \
	  : "i" (offsetof (struct pthread, header.self)));	 	      \
     __self;})
# endif

小结

内核在上下文切换的时候会在一个位置设置好线程信息。
特别的，x86_64 linux是设置在fs寄存器，x86_64 xnu是设置在gs寄存器。

TLS实现：链接器的工作 (dyld)

下面就是关键的地方了，我们一起走一走dyld的源码，看看dyld到底做了什么事情。从dlopen()的实现开始。

// dyld/DyldAPIs.cpp
void* APIs::dlopen(const char* path, int mode)
{
    void* callerAddress = __builtin_return_address(0);
    return dlopen_from(path, mode, callerAddress);
}

void* APIs::dlopen_from(const char* path, int mode, void* addressInCaller) {
		
		const Loader*   topLoader = nullptr;
		...
		// 将macho文件的段依次map到内存中
        topLoader = Loader::getLoader(diag, *this, path, options);
		...
		// 递归的加载依赖的dylib
		((Loader*)topLoader)->loadDependents(diag, *this, depOptions);


        // fixup，修复需要进行 rebase 与 rebind 的符号
        for ( const Loader* ldr : newLoaders ) {
            ldr->applyFixups(diag, *this, cacheDataConst, allowLazyBinds);
        }

		// ！我们关心的tls的初始化
        for ( const Loader* ldr : newLoaders ) {
            const MachOAnalyzer* ma = ldr->analyzer(*this);
            if ( ma->hasThreadLocalVariables() )
                setUpTLVs(ma);
        }
        
        ...
}

void RuntimeState::setUpTLVs(const MachOAnalyzer* ma)
{
    __block TLV_Info info;
    info.ma = ma;
    // Note: the space for thread local variables is allocated with
    // system malloc and freed on thread death with system free()
    info.key = 0;

	// forEachThreadLocalVariable遍历的是上文提到的，在machO的header中__thread_vars节，每24个字节一个循环。
    initialContent = ma->forEachThreadLocalVariable(diag, ^(MachOAnalyzer::TLV_Resolver *tlvThunkAddr, uintptr_t *keyAddr) {

        if (info.key == 0) {
            if ( this->libSystemHelpers->pthread_key_create_free(&info.key) != 0 )
                halt("could not create thread local variables pthread key");
        }

        *(intptr_t*)keyAddr = info.key;
        
        // getAddrFunc 就是 _tlv_get_addr，一个纯汇编的函数
        *tlvThunkAddr = getAddrFunc;

	    info.initialContentOffset = (uint32_t)initialContent.runtimeOffset;
	    info.initialContentSize   = (uint32_t)initialContent.size;
	    withTLVLock(^() {
	        _tlvInfos.push_back(info);
	    });
	}
}

由此，dyld对tls变量的初始化就完成了，也正如在前文所说，可以概括为：
dyld遍历每一个tls变量，将其第一个8字节写成_tlv_get_addr，第二个8字节写成key。这个key是通过pthread_key_create_free初始化而来。

接下来看看_tlv_get_addr的源码

// libdyld/threadLocalHelpers.s
#if __arm64__
	// Parameters: X0 = descriptor
	// Result:  X0 = address of TLV
	// Note: all registers except X0, x16, and x17 are preserved
	.align 2
	.globl _tlv_get_addr
	.private_extern _tlv_get_addr
_tlv_get_addr:

	ldr		x16, [x0, #8]			// get key from descriptor

	mrs		x17, TPIDRRO_EL0
	and		x17, x17, #-8			// clear low 3 bits???
	ldr		x17, [x17, x16, lsl #3]	// get thread allocation address for this key

	cbz		x17, LlazyAllocate		// if NULL, lazily allocate
	ldr		x16, [x0, #16]			// get offset from descriptor

	add		x0, x17, x16			// return allocation+offset
	ret		lr

LlazyAllocate:
	stp		fp, lr, [sp, #-16]!
	mov		fp, sp
	sub		sp, sp, #288
	stp		x1, x2, [sp, #-16]!		// save all registers that C function might trash
	stp		x3, x4, [sp, #-16]!
	stp		x5, x6, [sp, #-16]!
	stp		x7, x8, [sp, #-16]!
	stp		x9, x10,  [sp, #-16]!
	stp		x11, x12, [sp, #-16]!
	stp		x13, x14, [sp, #-16]!
	stp		x15, x16, [sp, #-16]!
	stp		q0,  q1,  [sp, #-32]!
	stp		q2,  q3,  [sp, #-32]!
	stp		q4,  q5,  [sp, #-32]!
	stp		q6,  q7,  [sp, #-32]!
	stp		x0, x17,  [sp, #-16]!	// save descriptor

	mov		x0, x16					// use key from descriptor as parameter
	bl		_instantiateTLVs_thunk  // instantiateTLVs(key)
	ldp		x16, x17, [sp], #16		// pop descriptor

	ldr		x16, [x16, #16]			// get offset from descriptor

	add		x0, x0, x16				// return allocation+offset

	ldp		q6,  q7,  [sp], #32
	ldp		q4,  q5,  [sp], #32
	ldp		q2,  q3,  [sp], #32
	ldp		q0,  q1,  [sp], #32
	ldp		x15, x16, [sp], #16
	ldp		x13, x14, [sp], #16
	ldp		x11, x12, [sp], #16
	ldp		x9, x10,  [sp], #16
	ldp		x7, x8, [sp], #16
	ldp		x5, x6, [sp], #16
	ldp		x3, x4, [sp], #16
	ldp		x1, x2, [sp], #16

	mov		sp, fp
	ldp		fp, lr, [sp], #16

	ret
#endif

注意到LlazyAllocate中会跳转到_instantiateTLVs_thunk中，这也是一个在dyld的函数

// called from threadLocalHelpers.s
extern "C" void* instantiateTLVs_thunk(pthread_key_t key);
VIS_HIDDEN
void* instantiateTLVs_thunk(pthread_key_t key)
{
    // Called by _tlv_get_addr on slow path to allocate thread
    // local storge for the current thread.
    return gDyld.apis->_instantiateTLVs(key);
}


// called lazily when TLV is first accessed
void* RuntimeState::_instantiateTLVs(pthread_key_t key)
{
    // find amount to allocate and initial content
    __block const uint8_t* initialContent     = nullptr;
    __block size_t         initialContentSize = 0;
    withTLVLock(^() {
        for ( const auto& info : _tlvInfos ) {
            if ( info.key == key ) {
                initialContent     = (uint8_t*)info.ma + info.initialContentOffset;
                initialContentSize = info.initialContentSize;
            }
        }
    });

    // no thread local storage in image: should never happen
    if ( initialContent == nullptr )
        return nullptr;

    // allocate buffer and fill with template
    // Note: the space for thread local variables is allocated with system malloc
    void* buffer = this->libSystemHelpers->malloc(initialContentSize);
    memcpy(buffer, initialContent, initialContentSize);

    // set this thread's value for key to be the new buffer.
    this->libSystemHelpers->pthread_setspecific(key, buffer);

    return buffer;
}